Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only T&Cs apply

Find out more

Detecting flu infections with Google searches

Tobias Preis explains how Google search keyword data has been used to detect the spread of influenza infection.
7
Right now, let’s discuss one of the very fascinating examples using online data. And it’s an example which became already famous. And also, people started to actually criticise it. So what am I talking about? I’m talking about the example when, actually, the Centers for Disease Control in Atlanta in the US teamed up with engineers at Google. So it’s one of the most impressive nowcasting examples. As you remember, we talked about nowcasting. It’s about forecasting the present. It’s the problem of actually getting estimates when traditional methods are time-lagged. And the number of flu cases in a country like the US is one of these examples.
65
So the Centers for Disease Control in Atlanta are tasked to come up with numbers for people in the US presenting themselves with influenza-like illness symptoms, ILI. So these numbers, again, are aggregated through several layers from a network of doctors based in the US. And when the Centers for Disease Control get the final number, then they know how many influenza-like illness cases were there about two weeks ago. And now it’s the question, how good can your decision to maybe close an airport, to make another kind of intervention, to change the real-world system to minimise the impact of a flu wave, how good can this decision be, based on information which is two weeks old?
128.8
So when Google engineers and the CDC teamed up, they were looking into the possibility of finding relationships between online data, and in particular, what people search for online on Google, and the number of people presenting themselves to doctors with ILI symptoms. So Google has a massive amount of data on what people are searching for. And what they did was a kind of brute-force analysis, correlating all possible search terms people search for with the number of flu cases in the US. So that’s a massive enterprise, to actually master the computational requirements for this exercise. And what they came up with in the end was a list of most related search terms for this problem for the number of flu cases.
194.5
So very interestingly, this list contained a lot of flu-related terms, symptoms you might have and symptoms you might actually search for online to find out what you should do if you have the flu. So using this information, which is, again, very readily accessible right after the creation, typing in the search term– so by using this information, Google, together with the CDC, were able to provide improved forecasts of the present, nowcasts, of the number of ILI cases in the US, a paper which was published in Nature.
245.4
So based on this finding, Google set up a service which is called Google Flu Trends, which provides figures and estimates of the current level of flu in a number of countries all over the world. So this is a service which is ongoing now for quite a while and which in some countries even got incorporated into official measures. So this would be a success story if the story would stop here. But it doesn’t. So what people recognised over time was that this Google-based estimate of flu cases, from time to time, wasn’t so accurate than people might have hoped.
296.5
So in particular, a couple of years ago, when, actually, Google forecasts were compared with the actual number of flu cases in the US, there was a mismatch. So Google dramatically overestimated the number of people having the flu. And people start to question why might this have happened and what are the underlying reasons. So what we have seen in particular is that when new flu stems, in particular, H1N1, was subject to public discussion, then these Google-based estimates weren’t so precise as they could have been.
346.9
And so a number of people came up with the idea that, in particular, when a lot of people, triggered by maybe media coverage, are looking up online flu and flu-related symptoms, then this obviously doesn’t any longer closely match with what is going on in the world right now in terms of actual disease cases. So we have to be careful about changes in the underlying behaviour of people, which might be triggered by all sorts of external factors. So this is the second part of the Google Flu Trends story, where we need to be a little bit careful about how to use these sources.
397.4
But also, this is not the end because a number of people, including Suzy and me, came up with the idea of actually using adaptive nowcasting models. So instead of training a model over the entire time period for which you have data available and use this as a static method, we actually adapt our model. And we actually train it on only short parts of the time series, which actually incorporates online behaviour, also, during this time period only. So what Google and others did, they trained one model in the first place and used this model without alterations and modifications, leading to, actually, mismatches when the behaviour of people changed.
454.5
By using shorter time windows in which we train our models and produce nowcasts for the next period, in the case of flu, for the next time step, and then retrain our model one week later, we are able to better incorporate such volume and its changing nature. So we will see how this develops over time and how successful this approach will be in various countries in the world. But the entire story really highlights the importance of being aware of changes in online behaviour and how we might be able to use this information on an ongoing basis.

Traditional measurements of the number of people who currently have the flu rely on flu patients visiting their doctor, and their doctor reporting flu cases to a central health authority, such as the Centers for Disease Control and Prevention (CDC) in the US. This data collection process can take a while, and as a result, there is usually a delay of one or two weeks in making the data available.

In a very well known study, Google collaborated with the CDC to investigate whether the number of flu infections could be estimated from data on how often Internet users had searched for flu-related keywords, such as flu symptoms. As we discussed in Week 2, data on how frequently Internet users are searching for keywords is available to Google with no delay, opening up an opportunity to get much quicker estimates of the number of people currently infected with the flu. After watching the video above you can see what went wrong in January 2013, and how appropriate analyses might allow flu scare problems to be avoided and valuable information to be extracted from Google data.

This article is from the free online

Big Data: Measuring And Predicting Human Behaviour

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now