Skip to 0 minutes and 8 seconds I think, as scientists, we often feel like a little kid in a playground, poking a bug for the very first time, fixated and amazed at what this bug is doing. When my sister, Monica, was four years old, she came up with her very first scientific theory. She noticed some families have lots of children, while other families have only one or two. Why, she thought, did this happen? At the time, our own family had six kids. Some of her friends had eight. Others had only one or two. This was her data set. Her little toddler mind started wondering, what’s the connection? What connects women and babies? Soon she figured it out. She realised what connects women and babies. Umbilical cords.
Skip to 1 minute and 1 second Yes, umbilical cords is the connection. But how do umbilical cords affect the number of children? Again, her little four-year-old mind started processing all the data it could. Soon she figured it out. She figured out how it worked. She had to tell someone. So off she went with her two sisters into their bedroom, where they held a little toddler symposium, her audience of two eagerly awaiting this revelation on life. Monica explained to them that girls are born with a roll of umbilical cord all ready to go. Each girl has a different length of cord. As they have babies, they use up their cord until there’s none left. And that is how you get families of different sizes.
Skip to 1 minute and 48 seconds Now for toddlers this is quite profound. Now they understand the meaning of life. But I’m sure you can guess the flaw with Monica’s little theory. It’s not that the theory is wrong. It’s that she’s committed the cardinal sin of data science. She’s assumed that because umbilical cords are a connection, they’re also a cause. Connection does not equal cause. So let’s see if we can find some data to help little Monica out. Just think what we all here do online. We send emails. We watch videos on YouTube. And we all search for information on Google. We search on how to do things, on things we’re interested in, and we even ask questions we’re not comfortable asking another person.
Skip to 2 minutes and 40 seconds All this searching online is being recorded. And we can see the volume of people searching for phrases online. Let’s have a look. Let’s try the word hurricane. Wow, we get these massive spikes in volume of people searching for hurricane. And these spikes occur at around the same time as hurricanes in the real world. That’s amazing. Here at the Data Science lab, we’re figuring out how to use online data just like this to discover incredible things about the real world. So let’s give it a shot. Now little Monica, the toddler scientist, had a data set of only a handful of families. But we’ve got the birth rates per 1000 people across America.
Skip to 3 minutes and 27 seconds Here at the Data Science lab we’re building a data engine that we call Tree. And in one line of code, Tree runs all the way across the Atlantic Ocean, knocks on Google’s door, and asks, what are US states with more births searching for online? It comes all the way back again with the answer. Google’s data engine, Google Correlate, gives us the top 100 search terms that correlate with birth rates in America, all sorted by correlation. People are searching for things like, five languages of love, pregnancy calendar, and hospital bag checklist. States with more births are searching more for hospital bag checklist. But we want to know what’s the main topic in all these phrases.
Skip to 4 minutes and 16 seconds So we took the top 30 and asked random people around the US, what is the predominance topic in this list of phrases? And we limited them to only one word. 72% of people said these words were about pregnancy. People having babies are leaving a data trail online. Theoretically, we could use this to estimate how many babies are being born and where. All right, what happens if we ask Google the opposite question? What are US states with less births searching more for online? We get the top 100. We take the top 30. And we ask, what’s the most prominent topic? And it turns out, states with less births are searching more on cats.
Skip to 5 minutes and 8 seconds They’re typing in things like, Friskies, cat not eating. And you know how cats, all of a sudden, rush across the room, 50 miles an hour, looking around everywhere, and all of a sudden pouncing on a random spot in the room. People are searching for ghost mice. Now little Monica has had the pleasure of holding four of her newborn brothers and sisters. Even she understood that these were new scientists ready to explore the world. What Monica didn’t know - and she’s very lucky not to know this - she didn’t know that these budding explorers of the new world sometimes die. America has one of the lowest infant mortality rates in the world. But it’s slipping.
Skip to 5 minutes and 59 seconds If we rank all the nations by infant mortality rates in 1960, America is the 12th best country. By 2013 they are the 30th. Perhaps online data can show us more of it’s story. Now, on Google, states with higher infant mortality rates are searching more for food and frosting. Now we certainly can’t say why, but it could be as simple as they have more money and they’re searching more for fancy stuff to cook. But what about states with more babies dying? What are they searching for? They’re searching for credit and loans. And they’re also searching for STDs. States with more babies dying are typing in things like, loans for people with bad credit, I need a loan, and pictures of STDs.
Skip to 6 minutes and 58 seconds Now we’re certainly not suggesting that interest in bad credit somehow causes your babies to die. It could be the other way around. People whose babies are dying could be very desperate for money to save their lives. What we can say is there might be a connection between infant mortality rates, credit, and STDs. We found this connection with online data.
Skip to 7 minutes and 25 seconds Now, right now Tree is only relying on data from Google. But the wonderful team here at the Data Science lab are building incredible new data sources of huge amounts of data from Wikipedia, and soon Flickr and Twitter. The potential for all of this data is enormous. Just think about everything we’ve been talking about the last couple of days. But imagine, instead of birth rates, we used crime rates. We could estimate the level of crime, perhaps down to the city level. The US government could get a real time view of crime around the country. They could better allocate resources, keep people safer, and maybe keep people out of prison.
Skip to 8 minutes and 7 seconds Here at the Data Science lab we are working on something I believe is truly amazing. We are helping people to make incredible discoveries and tell meaningful stories with online data. We are doing this by building a data engine that we call Tree. Tree makes it easy and fun to work with online data. Inside each and every single one of you here is a little child scientist. What will you discover next, and what story will you tell? Thank you very much.
Telling stories with data
What do Internet users in US states with lower birth rates search for? What about states with higher infant mortality rates?
In this entertaining talk, Adrian Letchford concludes our course by revealing how the online search behaviour of US citizens varies according to their socio-economic situation.
Adrian Letchford is a research fellow in the Data Science Lab at Warwick Business School. With a background in computer science, Adrian has worked on a wide range of topics, including artificial intelligence, finance, veterinary science and Australian national security. Adrian’s current work focuses on building software tools that connect online behaviour to the real world.
© Warwick Business School, The University of Warwick