Thank you all for coming. So my name is Chester Curme and today I’m going to talk about why we’re interested in online search data and how these data can tell us about events in the real world. So to motivate our project, suppose you have a question. Like this guy up here clearly has a question. And if you’re like me, the first thing you might do is you Google it. So you look for information on Google. Suppose you’re wondering if you should come to some conference that you’ve heard about.
You enter in some queries and then you find some information, and based on the information you find, you proceed with some action or decision, like you all come to the conference, OK. So internet search engines such as Google tend to provide extensive data on their search activity. And the interesting question which has emerged in recent years is whether or not we can use these search data to get a handle on this process - the information gathering stage - and whether or not we can use these information gathering signals to anticipate events in the real world.
So there’s a lot of different types of events that people could care about, but we chose to study financial market movements because there’s an abundance of data and they’re very easily quantified and moreover, people are interested about them in general. So this is the big question, ‘can we use Google Trends data to anticipate financial market movements?’ Here’s an image of a Google Trends time series, just an example. It’s a time series for the word ‘bank’. And you can see that it may be non-stationary. It has some trends. And at any rate, we’re not exactly interested in the absolute search volume for some word, but rather, in how changes in these search volumes might be related to subsequent financial market movements.
So we have to difference this time series somehow. Right, and we’re going to difference it according to a simple moving average model. And in fact our… as you just saw, our method can be interpreted as a simulated trading strategy. So here’s how the strategy goes. It’s Sunday of week t, and you look up the Google Trends search volume for your favourite word. There it is in green. And then you compare it to the mean over the previous delta t weeks. So here I used delta t is 3.
And now if the search volume this week is lower than the mean, then you buy an index on Monday, and you sell it back the next Monday, picking up the return of the index in that week. And conversely, if the search volume this week is higher than the mean, then you sell the index on Monday, and you buy it back the next Monday, you pick up minus the return that week. So we chose to trade the S&P 500 in this analysis, since we were looking at US-based Google Trends search volume. So in this way, any keyword that you think of can be associated to a cumulative return from this imaginary trading strategy.
So the logical question is ‘what is the magic word that has the highest return?’ So you can run this simple analysis on thousands of keywords and look at the winners. And here are the top five. So apparently, the magic word is ‘show’, with a return of 253%.
So clearly, you can’t learn anything from this analysis. Because it’s not clear what, if anything, these keywords have to do with each other. And it might be a bad question to begin with because we could just be seeing the artefacts of noise. That is, this could just all be statistical fluctuations and we’re not really learning anything. So this is a common problem. And the remedy is often to reduce the dimensionality of your data set. And that is, when we’re dealing with keyword data, we want to take those keywords and we want to aggregate them into groups or topics. And there’s different methods to do this.
But especially when you’re using keyword data, you want the topics you find to have some semantic interpretation. So you want words that mean the same thing to be grouped together. And a popular algorithm to accomplish this is known as Latent Dirichlet Allocation or LDA. So LDA leverages the simple observation that words with related meanings tend to occur in documents together. So for example, these are just toy documents, but the word ‘debt’ is more closely related to word ‘housing’ than it is to word ‘orange’. And this is reflected in the co-occurrence of these words across documents.
So in this way we can hand a computer a large corpus of documents and the computer can figure out which words are relating to each other without knowing anything about the words themselves per se. And you might point out at this point that we could just go to a thesaurus and find groups of related keywords that way. And that’s absolutely true. But for our purpose, we wanted a dynamic corpus, that is, a corpus that is continuously updated with current events, with events that we care about. So for this purpose, we went to Wikipedia. So now Wikipedia is our collection of documents. And we train an LDA model on the English Wikipedia.
So the LDA model goes to every Wikipedia page and says, you are a document, where a document is a probability distribution over some number of topics. And each topic has some probability of generating various words. So if I actually run an LDA on these three documents, I find two topics, one relating to fruit and the other relating to debt, housing and crisis. So this is obviously not the real Wikipedia. So you want to see the results from the real Wikipedia. And here’s a sample of the results. These are five arbitrary topics we found on Wikipedia. There’s something relating to academia or education. There’s business, another one relating to education, health care, maybe.
So you or I might disagree about what exactly these topics are referring to. So in order to settle this, we appealed to workers on the service Amazon Mechanical Turk. So I won’t go into too much detail about what Amazon Mechanical Turk is. But suffice it to say, it’s a service whereby you can farm out menial tasks to people all over the world for a small fee. So in this case, we paid workers to essentially take a vote on the names of these topics. And here’s some of the labels that they came up with. And you can see that if two different topics were given the same label, I distinguished them with the Roman numeral. OK, that’s fine.
So here’s the full list of topics that we recovered from Wikipedia. This is my very wordy slide. And so I don’t expect you to read everything. But you can see that we’re recovering some meaningful topics. So we have energy, art, botany and fruit, for example. So we’re really running the whole gamut of human experience, which is sort of our objective, it’s to get some unbiased sample of topics. Now, remember that each one of these topics is a collection of words. We took the top 30 words from each LDA topic and said, OK, you’re a topic. And each one of those words can be associated with the cumulative returns from that trading strategy I mentioned.
So for example, here’s the distribution of returns from the topic food. The median is exactly zero, in this case, actually. And we can represent this distribution as a box plot, that little guy. Now, if you’ll permit me, I’m going to rotate this box plot and show you the box plot for all the topics. So here they are. The topics are arranged on the horizontal axis in order of their mean return. And on the vertical axis, you can see the actual distribution. And now, in order to get a sense of which of these topics might be related to subsequent financial market movements, we have to compare to some benchmark.
So we compare to a random strategy in which you trade randomly every week without any regard to what’s happening in the world around you. And when we compare to this random strategy, we find that two topics are significantly different, and those are politics 1 and business. So I shade them here by their mean return. So this analysis was conducted for a single value of delta t, which was delta t equals three weeks. And so we have to test how sensitive these results are to that lone parameter in our trading strategy. And when we do that, we get a plot that looks like this. So it’s kind of dark. But delta t is on the vertical axis.
And I shade in the box for a topic if the distribution of returns for that topic, for that value of delta t, was significantly different from the random strategy. And here we find politics 1 and business are again significant, and also politics 2 seems to be consistently significant here. And I’ll mention that there are 55 topics here and 15 values of delta t. So we’ve just done 825 statistical tests. And I’ll just mention that we do correct for the number of statistical tests when finding these results. Now, lastly, we can study how these trends vary in time by repeating the exact same analysis in moving time windows.
So here’s the distribution of returns for those three topics in 2004 to 2007. In here, business and politics 1 are significant. But as we enter the financial crisis, the returns from these topics grow tremendously until after the financial crisis, when we exit, the returns diminish again. So what we’re seeing here is that increases in search volumes for topics related to politics and business or finance tend to precede falls in the stock market. And there’s different possible explanations. One possible explanation is that humans are loss averse. So that, say you’re concerned about the state of the economy or your finances. You might look for information.
You might look for information about your investments or about what’s happening in the stock market or you might even appeal to politicians whom you think have some large impact on the state of the economy. And we hypothesise that we’re seeing traces of these information gathering signals in these results. So just to summarise, internet search engines, search data, like from Google, provide the intriguing possibility to study the information gathering processes that precede real world events. LDA is an effective means to reduce the dimensionality of keyword data when performing similar analyses so that you study underlying semantic factors of importance.
And lastly, in this particular analysis, we find that search volumes relating to politics and business can be linked to subsequent stock market moves, at least in the S&P 500. So thank you all, and I’m happy to take any questions.