Cyborg head with binary code repeated in the background

Anonymisation fails

Data anonymisation is the aim of processing a data set with personal information/data within it to create a new dataset that cannot allow for the recreation of personal data. That is the aim, but anonymisation is not an exact science. In some cases, even datasets tested statistically, can be processed and used in some way to identify individuals.

Protecting the personal data in this way is normally done before it is used operationally or for research purposes or shared more widely as shared or open data. It is done to enable the data asset to be useful beyond its core collection purpose whilst protecting the individual(s) concerned.

In some cases anonymisation has not worked, which can lead to potentially damaging media coverage, fines through the General Data Protection Regulation (GDPR) but most importantly, and to be avoided, detriment to the individuals concerned.

Examples

One example from Netflix and Internet Movie Database (IMDB) illustrates the potential of the mosaic effect where an anonymous Netflix dataset was de-anonymized by correlating it with the IMDB database. This is an example of statistical deanonymization against high-dimensional micro-data, such as individual preferences, recommendations, transaction records and so on.

The techniques were applied to the Netflix Prize dataset, which contained anonymous movie ratings of 500,000 subscribers of Netflix, the world’s largest online movie rental service. In this case an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset. Using the Internet Movie Database as the source of background knowledge, the researchers successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.

We are probably all aware but also lets consider Facebook data and Cambridge Analytica. Users give their data to Facebook for a service allowing social interactions and networking. In this example firms obtained users’ private information from the social media network to develop “political propaganda campaigns” in the UK and the US.

Althought they may not have been aware, Facebook’s data from their users was used to understand and ultimately influence the behaviour and voting options of individuals. By understanding what people respond to in positive ways parties could, for example, tailor campaigns to be more effective.

Another recent example reported in the press illustrated that sensitive information about the location and staffing of military bases and spy outposts around the world have been revealed by a fitness tracking company on a data visualisation map that shows all the activity of tracked users. Whilst areas of online mapping are often obfuscated for sensitive information like this - the publication of strava data could enable the identification not just of the location of bases, but the roads and movements within it.

Testing anonymisation is done in a systematic way. Processes should be documented in for example a Privacy Impact Assessment. To reach an acceptable level of risk the organisation should also consider, the likelihood that someone would look to create personal data from the dataset, the possibility given existing data sets also available to integrate with the sample data and the technologies available to undertake these tasks.

However in some cases this is not always possible. Remember significant value is enabled when we share data, but increasingly, and with big data technologies becoming more advanced, anonymisation fails may occur.

Share this article:

This article is from the free online course:

The Power of Data in Health and Social Care

University of Strathclyde