Re-identification from de-identified data
Despite the promise of electronic health records (EHRs) and machine learning, medical data needs to be kept secure and shared carefully.
Anonymity with de-identified datasets
Some datasets are made publicly available to advance research. These datasets, in particular, need to be sufficiently de-identified to prevent identification of specific individuals. A recent study1 has shown, that de-identification methods used by the Australian government were not strong enough and individuals could be re-identified.
Failure of de-identification
Longitudinal medical billing records
In 2016, the Australian Department of Health made the medical billing records of 10% of Australians publicly available online. The records included all bills for medical and pharmaceutical items from 1984 to 2014 where these people received a reimbursement.
Each record encodes a medical event, containing:
- a code corresponding to the medical service or prescription
- the state of the supplier and patient at the time of the event
- a date
- the price paid and reimbursed
- a supplier ID number for medical items.
Patient information in these records consists of:
- a patient ID number
- the patient’s year of birth
Only the supplier ID and the patient ID number were obscured through encryption.
Re-identifying the individuals
Researchers at the University of Melbourne have been able to re-identify individuals in this dataset, using only a few publicly available facts. They did not have to decrypt any of the ID numbers and pointed out that the re-identification would be ‘straightforward for anyone with technical skills about the level of an undergraduate computing degree’.1(p4)
So, how did they do it?
Identify unusual medical events
They started by searching for unusual medical events in the dataset. For example, they looked at values for year of birth, gender, and childbirth in the government dataset to find mothers who have given birth late in life. The former two values are directly available as part of a patient’s unencrypted information. Childbirth values can be found in the dataset by looking for medical events where the code corresponds to a childbirth-related service in hospital. They found that in some years, there was only a single older woman who gave birth.
Correlate with publicly available information
Next, they looked for publicly available information, such as news reports and Wikipedia pages, to correlate with the information from the dataset. Often times, unusual medical events or procedures performed on famous persons are reported in the media. Through this correlation, the researchers found three unique matches.
Amendment to the national Privacy Act
Some may argue that three matches among 2.9 million people in the dataset are not an issue. However, as long as one person can be re-identified, a breach of privacy has occurred.
In response to these findings, the Australian Government drafted an amendment to the national Privacy Act. Under the Re-identification Offence amendment, re-identification could lead to two years imprisonment.2 While this could deter hackers with malicious intent, it does not solve the underlying issue of insufficient de-identification measures.
In the next step we will shift our focus to the application of big data analytics in business.
What are your ideas?
- How could the re-identification have been prevented?
- Could it have been prevented by releasing data not just for 10% of the population but for 100% of the population?
Share your answers in the comments.
Pearce R. Government hasn’t given up on ‘re-identification’ bill [Internet]. Computerworld; 2018 [updated 2018 Aug 14; cited 2018 Dec 22]. Available from: https://www.computerworld.com.au/article/645154/government-hasn-t-given-up-re-identification-bill/ ↩
© Griffith University