Skip to 0 minutes and 7 seconds So a couple of weeks ago we had the European elections. Now during the campaign, before the elections, political parties organise meetings to share ideas and programmes, provided they have any, which is not always the case. But also, quite importantly, to see how many supporters they can gather. Now just a couple of weeks before the elections, in Italy, something quite interesting happened where two of the major political parties– which are labelled A and B, just to avoid any discussion– organised meetings in the same place, but on two different days. Now on the news in the days following these events, there was quite a lot of debate on which of the two parties had managed to gather the most people.
Skip to 0 minutes and 55 seconds So this is a very simple and motivating introductory example on why we might be interested in counting crowds. You can of course think of many more examples like helping with the planning and design of people’s flow at events, security reasons, emergency situations, how many doctors, how many ambulances, how many policemen to send, think of a protest, or a fire alarm, looking at what happened today. You might be interested in knowing where people are during the day because demographic information through census data tells us where people live, but that’s not where they are.
Skip to 1 minute and 33 seconds So to gain some understanding in this context, I will show you some data that was made publicly available at the start of 2014 as part of a big data challenge set up by Telecom Italia, an Italian phone company. The data relative to Milan in the north of Italy and for two months in November and December of 2013. So the first piece of information we have is the volume of phone calls and SMS over these two months. The volume of phone calls is aggregated in time intervals and is separated in a grid with 10,000 cells.
Skip to 2 minutes and 6 seconds In this slide, you can see a snapshot of this data where red indicates a higher volume and the red blob overlaps very well with the actual city centre. The second information we have is the volume of access to the internet through smart phones. Again it’s very similar to the previous data aggravated in time intervals and separated in 10,000 cells. And the last bit of information that we have is the complete set of geo-localised tweets that have been tweeted in these two months. So it’s a complete set of geo-localised tweets– roughly 400,000 tweets. So summarising, we have three bits of information– the volume of phone calls.
Skip to 2 minutes and 48 seconds From now on, when I say volume of phone calls I will actually mean phone calls and SMS. The volume of access to the internet through mobile phones, and the complete set of geo-localised tweets. Now how can we use this information to approach our problem of counting crowds? The idea that Suzy, Tobias and I had is that there are particular events which happen in localised spaces, and where the number of people is easily recorded and publicly available. In particular, we thought of football matches. Football matches happen in localised spaces– the football stadium. The number of people attending is easily determined by the number of tickets that have been sold.
Skip to 3 minutes and 33 seconds And so it seemed to us that football could provide a perfect case study to calibrate our model and gain some understanding. And also, very importantly, the football World Cup starts tonight, so this is the perfect day to talk about football. So we have this information, we have this data, we can plot the time series, for example, of the volume of phone calls inside the football stadium. And this is how it looks. So you can see a very clear and evident pattern. You have 10 spikes. And in these two months there have been precisely 10 football matches which are labelled in the figure. So the information seems to be there.
Skip to 4 minutes and 15 seconds A very similar time series could be built by looking at the volume of access to the internet, and I will not show it at the moment. And then if we use the last information, which is the number of tweets whose coordinates are inside the stadium, even if the volume of the number of tweets isn’t too large, we can still pick up the 10 football matches. Now we have this information. We have the number of people that attended the match, because that’s available easily on the internet. So we can see if we can fit them all into it.
Skip to 4 minutes and 47 seconds So if we look at our first variable, the volume of phone calls versus the real attendance, you can see the point in the plot and the blue line is a linear fit to this point, which is statistically significant. We can then change our predictive variable. Say we use the volume of access to the internet, and this is how it looks. And again, we have a linear fit. And again, it is statistically significant with a very large R squared actually. And then the last variable we can use is the number of tweets inside the football stadium. And again, we have very similar behaviour, with a statistically significant fit.
Skip to 5 minutes and 29 seconds OK. So we have seen that using three different predictive variables, we can build three different statistically significant models. Now we want to ask the question, can we extend our model using more than one predicative variable? Say two, or even all three of them? In principle, this question makes sense. But we know that we need to be careful because we might incur problems, namely that of multicollinearity because we expect that people will be tweeting from their smartphones. So the number of tweets and access to the internet will probably be correlated. And indeed we have seen that if we add more than one predictive variable, the fit is not statistically significant anymore.
Skip to 6 minutes and 12 seconds And running another comparison test, tells us that we are not actually adding information to our system by adding predicting values. We can quickly see this for the following two plots. In this one we have, on the x-axis, the fitted values as obtained using the volume of access to the internet. And on the y-axis we have the real attendance. So ideally, we want to be on the dashed line, which is the diagonal, where the two values match. Now, if we add one variable, what we expect is that the points will get closer to the diagonal because our model gets better. If we do it, this is what happens. So the points don’t really get closer to the diagonal.
Skip to 6 minutes and 58 seconds So even just visually, we can see that adding extra predictive variables does not add any information to our model. And we have done these for every possible combination of the three predictive variables that we have. So the last thing we can do is that of predictions. Now, in this context, when I talk about predictions, I actually mean it in the sense of live, inferring the number of people attending, even one of the three predictive variables. OK, so we have 10 football matches. We can fit our model with only nine of them. Say we leave the last one out. Now we have one point we can try and predict the attendance to.
Skip to 7 minutes and 41 seconds And then we can leave a different one out every time. So we can repeat these for 10 times and plot it to see how we do. This is the way of testing the effectiveness of our model as well. So we’ll be looking at a plot of this form. Again we have the real attendance on the x-axis, and the predicted one on the y. So the diagonal is where an ideal model would be. So points that lie above the diagonal, like the one that you can see, will be overestimated because we are predicting 60,000 people, but actually only 20,000 people attended. And similarly, points below the diagonal are underestimated.
Skip to 8 minutes and 21 seconds So if we run these over the 10 data points that we have, this is what we get. So the points, together with their 95% predictive interval, they lie quite close to the diagonal, and the interval intersects the diagonal, which means that our model is performing very well, even on unseen data points– the points where it has not been trained on. So this is a very preliminary study and it would need to be extended. First of all to more data points, different types of events, different locations, different cities, different countries even maybe to see how specific that is. OK.
Skip to 9 minutes and 7 seconds But it seems to be quite promising in trying to answer the question I posed at the start, how many people attended the two different events. Thank you. [APPLAUSE]
Counting crowds with mobile phone and Twitter data
When protests or political gatherings take place, how do organisers estimate how many people have attended? And why is this number often different to what the police report?
In this talk, Federico Botta explains how we can use data from mobile phones and Twitter to make surprisingly accurate estimates of the number of people in a given location at a given time. Might this provide a new answer to the age-old question of how to count crowds?
Federico Botta is a PhD student at the University of Warwick. His interests range from the analysis of complex and real-world networks to computational social sciences, with a strong focus on large behavioural datasets.
© Warwick Business School, The University of Warwick