Skip main navigation

New offer! Get 30% off your first 2 months of Unlimited Monthly. Start your subscription for just £35.99 £24.99. New subscribers only T&Cs apply

Find out more

Identifying individuals in big data

Is it really impossible to re-identify an individual from supposedly "anonymised" big datasets?
6.8
Welcome back. We have talked a lot about possibilities to use these new kinds of online data to measure a little bit better what is going on in the world right now and in some cases, even to predict what humans might do in the near future. So let’s step back for a moment to think about the risks in this new world. So data generated by our everyday interactions gets recorded by a number of institutions and companies. Obviously, Google knows a lot about our interaction with the search engine, maybe a lot about our email communication, because of your using Gmail and many, many other data sets, based on our interactions with services provided by this company.
62.9
So data about us on an individual basis is very valuable to these companies. And it can also be a certain risk, in particular, if some of these data sets are leaving the company. So we have seen that a service like Google Trends provides us with the opportunity to learn on a very large scale what people are interested in. This doesn’t reveal any information about you or your neighbour, what you have searched for last night or what your neighbour has emailed yesterday morning. However, some data sets are provided in a very raw format and reveal data on a very individual basis and can be accessed, in some cases, even publicly.
114.1
So there was a very interesting study performed by Yves-Alexandre de Montjoye from the MIT Media Lab recently, together with a number of colleagues, involving Sandy Pentland. This is a study published in Science. They looked at a very sensitive data set - or at least all of us would consider it to be very sensible - and these are financial records, our credit card records. Credit card records are made available to a number of externally-based institutions and stakeholders in a format which doesn’t necessarily allow you to re-identify individuals. These data sets are provided in an anonymous fashion. However, as we have seen in many, many other data sets, it is really hard to anonymise data perfectly.
173.8
And so this group of scientists looked at the question to which extent we can re-identify individuals based on the places, the times and the prices of their transactions. So to be clear, there were no personal data available in this data set, no names, no telephone numbers, no account numbers. They only used the fact, where on a daily basis you have bought individual items and used this information to find a reasonable minimum of information of data points you need to re-identify this individual. What they found is quite striking. It is enough to have four observations in space and time to re-identify uniquely 90% of the credit card users.
237.6
They used a sample of 1.1 million people over a period of three months. So when they had the information on when an individual was buying, for example, a coffee and maybe on another occasion was buying food in a restaurant, the location of the restaurant and maybe two more observations, then they were sure that this person could be linked to one existing stream in their data set, allowing them to reveal all other transactions which this person has made over time. So this can be quite scary and probably will trigger us to think about the question how we use, how we interact in the future with all these technological devices.
295.5
It is probably worth thinking about ways to avoid revealing data which we don’t want to get revealed. And this credit card example highlights one important issue. If you have another, second observation from a different data set which tells the world where you are and when, then this can be linked, in particular, in anonymous data sets in order to identify you. So let me give you one example. If you are using Twitter, for example, and you have switched on location-based services, then you are, on an ongoing basis, whenever you tweet, revealing where you are at which point in time.
344.7
So it is easy to find four observations in time and space which could be linked, for example, to your history of credit card transactions. However, this example also highlights another major risk where you are directly revealing where you are and when. So probably, you can think of possible risks which this comes with. If you are tweeting, for example, that you are really enjoying your holiday in a country which is very, very far away, then you’re revealing that you are not in your home country, you’re not in your hometown, and in particular, you are not at home.
391.2
However, if you are using location-based services on an ongoing basis, it is trivial, as you might imagine, to find out where your home is. So now you can count one and one together and continue to think that you’re publicly revealing that you are not at home, but at the same time, publicly revealing where your home is. So all these data sets with all these details about individuals might hopefully trigger us to think about twice how we use them and how society might need to give itself new rules on how to deal with these data sets.

We often hear that big data sets which describe our behaviour have been “anonymised” to protect our privacy. Does this really work? Is it really impossible to re-identify an individual from these datasets?

Watch this video to hear about new research that suggests that most individuals can be uniquely identified simply from the time and location of four credit card transactions. What consequences does this have for our privacy?

This article is from the free online

Big Data: Measuring And Predicting Human Behaviour

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now