Want to keep learning?

This content is taken from the The University of Warwick's online course, Big Data: Measuring And Predicting Human Behaviour. Join the course to learn more.

Skip to 0 minutes and 4 secondsHi, Susie. Hi, Tobias. So how was the break? It was good. I had a good week. I hope it gave the learners a chance to catch up. It was so good to see so many of them on Twitter. Great, so as always, I've been gathering some questions the learners have been asking. So, ready to get started? Yep, let's go. OK, so the first question is, why would you want to Nowcast the flu? OK, so one of the points that we've been trying to get across in this course is that for people to make good decisions, they need the best possible understanding both of what's going to happen in the future, but also simply what's happening in the world right now.

Skip to 0 minutes and 46 secondsA lot of the time, the kind of decisions you're trying to make are about resource allocation and certainly in the health services, this is a really crucial question. So you have a number of resources you need to take care of. You need to work out when you're going to need hospital beds and who for. You need to work out where vaccines need to be. And you've also got the very important personnel themselves. You need to work out how they're going to be allocated. And so what the goal of Nowcasting the flu is, is essentially to reduce the delay in measurements of the flu. So traditional measurements, as we discussed this week, are often delayed by one or two weeks.

Skip to 1 minute and 26 secondsAnd so what we're trying to do here is come up with better estimates of how many people have got the flu right now so that people in the health service have got the best possible understanding of the situation at the moment, so they can make these important decisions about allocation of resources, such as hospital beds or vaccines or health personnel. So how does this Nowcasting get affected by people that don't google? In particular, how about the elderly, and what happens when a family-- or if somebody's searching for their whole family? That is a very good question. So I think this issue of bias in big data sources is really important.

Skip to 2 minutes and 6 secondsAnd it would certainly be misplaced to think that just because we have really large data sets, we're actually measuring everybody and everything. That's clearly not the case. But what we find is useful with these data sets, such as data from Google, is simply that you do have access to this data very quickly, and you do have access to data on a very large sample of people, even if it's not everybody, and it's not equally distributed across society. And so the idea is to leverage these advantages, simply because we've seen that by traditional methods it's difficult to get these measurements quite so quickly. And it's also difficult to measure that larger sample of population.

Skip to 2 minutes and 56 secondsSo, for example, with surveys, it could take a very long time to gather that sort of information, or it might just be too expensive to sample quite so many people. And what you see from then the statistics that we carry out, looking at this from a purely quantitative point of view is that you can get value from that data. You can use that information to improve both your predictions of the future, but crucially here, your predictions of the present, so your estimates of what's going on right now. So the quantitative results show that it works.

Skip to 3 minutes and 29 secondsAlthough when looking at our models and our data in more detail, we absolutely do need to remember in the back of our minds that we're not going to have sampled everybody in society, because there are differences in who uses Google and other kinds of technology and who doesn't. And so a related question-- how accurate are the CDC measures as a measure of the number of people that have the flu? I saw that question, and I thought that was a really good point. So we know that we've got these CDC measures, these official measures, from people going to the doctor and saying they've got the flu and these statistics being collected.

Skip to 4 minutes and 5 secondsBut it was pointed out that this isn't necessarily actually a measure of everybody who's got the flu right now, because, potentially, you could have the flu and not go to the doctor. I think this is a really good point. So, yes, it's true, what we see from the CDC is not necessarily a measurement of everybody who's got the flu.

Skip to 4 minutes and 26 secondsHowever, if you come back to this first point of why we'd actually want to Nowcast the flu in the first place, the fact that we need to make decisions about these health service resources, like hospital beds and vaccines and health personnel, you're thinking there of people who've been affected so badly by the flu that they do need the use of the health services. They possibly even need to be in hospital, because we know that sadly, people do die from the flu. And so from that point of view, the measure we have from the CDC is a measure of people who were sufficiently affected by the flu to decide to go to the doctor.

Skip to 5 minutes and 5 secondsAnd so, from that point of view, it's a relevant measure of the population who have the flu right now or the people who are affected by it sufficiently badly, even though it's not, as people have pointed out, a measure of necessarily everybody who's got the flu at the moment. So a lot of the learners are curious about how the adaptive model works. Oh, oh right, very good question Chanuki. No, that's a very, very good point. So how does it work? I mean, from a very abstract way, if you think of an adaptive model, I mean, obviously, it seems to be the opposite of a static model.

Skip to 5 minutes and 40 secondsSo if we look for a moment what a static model is and, particularly, in the context of Google flu trends, then this brings us back to how all this started. So when Google teamed up with the CDC, and they were interested in the question to correlate, to find relationships between how many people have the flu, as measured, as Susie just explained, in CDC data, and what people are searching for online. So by finding correlations or basically a relationship, a high correlation coefficient means that if one quantity tends to go up, then the second quantity tends to go up, and vice versa.

Skip to 6 minutes and 21 secondsSo they found that there are certain specific flu-related symptoms which people are searching for, and these terms seem to be closely related to the number of flu cases in the real world. So that's a correlation. That's something they've worked out using the entire time period, and this was basically the start of a very static model, which doesn't change. So, as we have seen in a number of videos this week that this is, to some extent, a problem, because human behaviour is changing over time. It's not the case that we always react or behave in the same way.

Skip to 6 minutes and 59 secondsAnd obviously, technology is also changing over time, changing biases and more people are using certain services and others stop using certain platforms and services, and this is quite problematic. So the adaptive part, or the model which makes use of adaptive features, tries to learn as it goes along this relationship again and again and again. Instead of actually using the entire time period in order to work out, for example, what is the correlation between one quantity, like the number of flu cases, and the second quantity, certain search activity, we actually use or look just at a certain time window for this specific relationship, and we neglect information which is earlier in time.

Skip to 7 minutes and 50 secondsSo we have a kind of sliding window which goes through our time series, and all the time we basically correlate or learn this relationship, and this we call adaptive model and work out how the behaviour changed. And this can also address some of the points which Susie picked up on, basically that also the demographics of people searching online is changing, and more and more people are using, for example, a certain search engine. So this can actually keep track of all these different changes and can so improve the model by being a little bit flexible and adapt to what changes might happen in the real world. So what do "in sample" and "out of sample" mean?

Skip to 8 minutes and 39 secondsOh right. OK, that's another great question. So let's start with "in sample." That's probably the easiest to explain. So "in sample," it comes back what I just said, when we use the entire time period in order to correlate one quantity to another quantity, to build a model based on behaviour we see. Then we use all the data points which we have. So at every point in time during, for example, a 10-year's period, we use information from this entire 10-year's time period in order to say something about a very specific time period, for example, one month within these 10 years. So it is, to some extent, if you look from an application point of view, we would be cheating.

Skip to 9 minutes and 26 secondsWe are not so precise in terms of what is future and what is past. So, if we-- after three years, we look at one specific month, and we use data on the entire 10 years, then, to some extent, we are looking-- we would look into the future, without actually having experienced the future. So we couldn't use this in a real world application or in a very robust model. So "in sample" basically means you're having the entire sample, and you are using it in order to estimate the parameters of your model. "Out of sample" now actually tries to overcome this limitation.

Skip to 10 minutes and 0 secondsSo you are trying to use at every point in time just the information which would have been available to you at that point in time. So, for example, if you're in your 10-year's period, you are just three years through, then you only use the three years which you have experienced in the past in order to say something about the next step. Let's think about our rolling model approach. Let's say we use 3 years, and we predict the next week, and then we roll this window over. For the following week we use the three-year's period which starts one year-- sorry, one week after our data set started. So that's one way of explaining "out of sample."

Skip to 10 minutes and 42 secondsBut obviously, in the more broader sense, you also want to use "out of sample" tests in order to check what you actually have experienced "in sample." So another approach is actually if you train a model or if you want to measure and model certain things in the real world, then you build "in sample" a model, but you hold off a little bit of your data set at the end-- the more the better, because these are actually data points where you can test your model without actually the model having used this model in order to calibrate and to measure certain parameters. So that's something which is important and makes a huge difference.

Skip to 11 minutes and 20 secondsSo if something holds "out of sample," this is much more probable than if you have just experienced or measured this "in sample." Great, thanks both. It's been another good week. So what's coming up next? So next week we're going to be looking at how you can measure happiness. And so happiness is something we know matters to all of us, but traditionally, because it's a subjective feeling, it's been quite difficult to measure. And so we're going to be looking at some new approaches that help us address this question using both mobile phones and also data we have from what people are doing online. Yes, absolutely, these are topics we will touch on.

Skip to 11 minutes and 59 secondsBut also we will talk about a very important issue, and that's privacy and ethics. I mean, obviously, in this ocean of big data, this is something we need to discuss at some point. This will come up next week. Excellent-- looking forward to it. I'll see you next week. Bye. Bye. See you again. Bye bye.

Week 6 round-up

In Week 6, we began to explore how big data might help us find ways to improve society’s health. Here’s a brief summary to help you prepare for Week 7.

You learned about Google Flu Trends and why, from time to time, these estimates were not so accurate. We also discussed how important it is to be cautious about how you conduct analyses with big data and to be aware of biases that may be present in the data.

You also heard from Bruno Gonçalves on how analysing people’s travel patterns can help predict when the peak of a flu pandemic will occur. He also discussed how data from social networks suggest that there’s a limit to the number of friends we can handle.

Finally, you completed the R code you’ve been writing to download data on what people are looking for on Wikipedia – well done! It was great to read all your ideas as to why there was a big spike of interest in “Friday” in 2011. We think it’s due to a certain song, which some of you found too…

Keep up all the great work!

Share this video:

This video is from the free online course:

Big Data: Measuring and Predicting Human Behaviour

The University of Warwick

Get a taste of this course

Find out what this course is like by previewing some of the course steps before you join: