Skip main navigation

£199.99 £139.99 for one year of Unlimited learning. Offer ends on 28 February 2023 at 23:59 (UTC). T&Cs apply

Find out more

Re-identification from de-identified data

In this article, Sebastian Binnewies looks at the security issues related to medical data.
Lorem Ipsum text that has been redacted
© Griffith University

Despite the promise of electronic health records (EHRs) and machine learning, medical data needs to be kept secure and shared carefully.

Anonymity with de-identified datasets

Some datasets are made publicly available to advance research. These datasets, in particular, need to be sufficiently de-identified to prevent identification of specific individuals. A recent study1 has shown, that de-identification methods used by the Australian government were not strong enough and individuals could be re-identified.

Failure of de-identification

Longitudinal medical billing records

In 2016, the Australian Department of Health made the medical billing records of 10% of Australians publicly available online. The records included all bills for medical and pharmaceutical items from 1984 to 2014 where these people received a reimbursement.

Each record encodes a medical event, containing:

  • a code corresponding to the medical service or prescription
  • the state of the supplier and patient at the time of the event
  • a date
  • the price paid and reimbursed
  • a supplier ID number for medical items.

Patient information in these records consists of:

  • a patient ID number
  • the patient’s year of birth
  • gender.

Only the supplier ID and the patient ID number were obscured through encryption.

Re-identifying the individuals

Researchers at the University of Melbourne have been able to re-identify individuals in this dataset, using only a few publicly available facts. They did not have to decrypt any of the ID numbers and pointed out that the re-identification would be ‘straightforward for anyone with technical skills about the level of an undergraduate computing degree’.1(p4)

So, how did they do it?

Identify unusual medical events

They started by searching for unusual medical events in the dataset. For example, they looked at values for year of birth, gender, and childbirth in the government dataset to find mothers who have given birth late in life. The former two values are directly available as part of a patient’s unencrypted information. Childbirth values can be found in the dataset by looking for medical events where the code corresponds to a childbirth-related service in hospital. They found that in some years, there was only a single older woman who gave birth.

Correlate with publicly available information

Next, they looked for publicly available information, such as news reports and Wikipedia pages, to correlate with the information from the dataset. Often times, unusual medical events or procedures performed on famous persons are reported in the media. Through this correlation, the researchers found three unique matches.

Amendment to the national Privacy Act

Some may argue that three matches among 2.9 million people in the dataset are not an issue. However, as long as one person can be re-identified, a breach of privacy has occurred.

In response to these findings, the Australian Government drafted an amendment to the national Privacy Act. Under the Re-identification Offence amendment, re-identification could lead to two years imprisonment.2 While this could deter hackers with malicious intent, it does not solve the underlying issue of insufficient de-identification measures.

In the next step we will shift our focus to the application of big data analytics in business.

Your task

What are your ideas?

  • How could the re-identification have been prevented?
  • Could it have been prevented by releasing data not just for 10% of the population but for 100% of the population?

Share your answers in the comments.


  1. Culnane C, Rubinstein BI, Teague V. Health Data in an Open World. arXiv preprint arXiv:1712.05627. 2017 Dec 15.  2

  2. Pearce R. Government hasn’t given up on ‘re-identification’ bill [Internet]. Computerworld; 2018 [updated 2018 Aug 14; cited 2018 Dec 22]. Available from: 

© Griffith University
This article is from the free online

Big Data Analytics: Opportunities, Challenges and the Future

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education