Skip main navigation

Ethics in statistics and data science

What ethical principles must data scientists apply and what key ideas should they consider in their work?

Data scientists need to apply ethical principles to their work. Here, you are introduced to key ideas you would need to consider in any data scientist role because decisions using statistical and data science tools often concern many aspects of people’s lives.

The following issues are particularly valuable and some have legislation attached to them. Check for the most up-to-date information for your country after completing this reading:

  • Data privacy
  • Data security
  • Data-driven decisions

1. Data privacy

Users customarily provide personal information when signing up for services such as banking, insurance, utility supplies, social media, and so on.

This is greatly enhanced by detailed records of browsing history, online shopping habits or geolocation data from mobile devices.

Can this information be used, shared or sold?

Should individuals be made aware of their personal data collection and its intended use?

In the European Union and the UK, the basic principles and rules for data protection are set out in the General Data Protection Regulation (GDPR) (‘See also’ section below). 

Example 1: Facebook study of emotional contagion

Source: Kramer, A.D.I., Guillory, J.E., Hancock, J.T. 2014. Experimental evidence of massive-scale emotional contagion through social networkPNAS111 (24), pp.8788–8790.

In January 2012, Facebook conducted an experiment to investigate whether emotional states can be transferred to others via emotional contagion.

There were 689,003 Facebook users randomly selected to the study without their knowledge.

Their exposure to emotional expressions was manipulated by reducing either positive or negative emotional content in the user’s newsfeed.

There was also a control group, with users whose newsfeed was not manipulated.

In each of the groups, the percentage was measured of positive and negative words produced by a person.

The result of the study was impressive: the emotional response is hugely affected by the emotional content in the newsfeed.

This is an interesting academic discovery, but was the study designed appropriately? It may be especially concerning that the Facebook users were unaware of their inclusion in the study.

The rationale for Facebook’s non-disclosure was that otherwise the participants would not have acted authentically.

On the other hand, the Facebook users recruited to the study might be upset as the study design violated the requirement of informed consent.

The claim by Facebook that the study was consistent with their Data Use Policy, to which all users agree prior to creating a Facebook account, might not be valid because the consent must be case-specific rather than generic or ‘by default’.

Thus, there is an ethical dilemma that needs to be considered very carefully before the study.

Big fines can be imposed on businesses for violating the data protection law.

Example 2: Amazon GDPR fine

Source: 2021. Data Privacy Manager. Luxembourg DPA issues €746 Million GDPR Fine to Amazon. [Online].

In July 2021, the Luxembourg National Commission for Data Protection (CNPD) issued a record fine of €746 million ($888 million) to Amazon.com Inc.

The fine followed a complaint by 10,000 people against Amazon in May 2018 through a French privacy rights group, La Quadrature du Net.

The CNPD investigation looked into how Amazon was processing the personal data of its customers, and found infringements regarding Amazon’s advertising targeting system that was carried out without proper consent.

2. Data security

If the data contains private or personal information, such as credit card details, social security number or health details, this data must be protected and stored securely to avoid unauthorised access.

The security measures to protect data include encryption and anonymisation.

  • Encryption aims to encode data in a secret way (e.g by scrambling it) so that it is extremely hard to decipher the code and retrieve the original data.
  • Anonymisation means the removal of data features that would make it possible to trace back and identify a particular person behind a piece of data.

Unfortunately, there are frequent, and sometimes successful, attempts by criminal hackers to steal personal data and use it for illegal activities, such as scams and phishing.

This is an ongoing battle in cyberspace, with security protection measures getting ever more sophisticated and effective, but criminals are still often able to penetrate the firewalls through software loopholes and steal sensitive data.

Example 3: Yahoo! data breaches

Source: Wikipedia. Yahoo data breaches. [Online].

In September 2016, the internet service company Yahoo! reported a data breach that occurred in 2014 and affected over 500 million Yahoo! user accounts.

An earlier data breach, reported by Yahoo! in December 2016, affected over 1 billion user accounts.

Details stolen by hackers included names, email addresses, telephone numbers, security questions and answers, dates of birth and passwords.

These data breaches are considered the largest in history.

The breach revelations impacted on the sale of Yahoo! to Verizon Communications, who lowered their initial buying offer of $4.8 billion by $350 million (the deal was closed in June 2017).

3. Data-driven decisions

Data plays a pivotal part in many decisions, such as whether or not someone gets a bank loan, an insurance policy or even a job.

Increasingly, such decisions are made automatically using algorithms trained on large databases. But such databases may contain inaccurate, outdated or biased information.

For instance, if a large company is using a CV-screening algorithm trained on former and current employees considered to be ‘good’, the recruitment outcomes are bound to reproduce the existing work practices, possibly involving gender or racial biases.

Due to such biases in the training data, the prediction algorithms may discriminate against certain parts of the population.

A more serious issue arises when data analytics is used to manipulate the offerings for individuals depending on their predictive profiles.

Of course, such practices are widely used by retailers (such as Amazon) and online content providers (such as Netflix), and this may be very helpful with regard to improving the customers’ shopping experience.

However, there are examples where such tools have been badly abused.

Example 4: Cambridge Analytica scandal

Source: Wikipedia. Cambridge Analytica data scandal. [Online].

In the 2010s, the personal data of millions of Facebook users were collected, without their explicit consent, by British consulting firm, Cambridge Analytica.

The data were gathered through a series of questions aimed not only at selected Facebook users, but also their friends and friends of the friends.

The algorithms harvested the data of up to 87 million Facebook accounts.

Psychological profiles configured from collected data were used by Cambridge Analytica to provide analytical assistance to the 2016 presidential campaigns of some of the Republican candidates.

Specifically, these profiles were used to tailor political advertisements sent to each Facebook user:

  • Voters who were classed as potential Republican supporters received triumphant visual images and stories about the Republican candidate’s growing momentum.
  • Swing voters were shown images of notable Republican supporters, combined with negative reports about Democratic candidates.

Cambridge Analytica’s actions were made known to the public in 2017-2018, based on the whistleblowing of a former employee.

These actions were widely condemned as a breach in political campaigning and an election interference.

As a direct consequence of this scandal, more than $100 billion was knocked off Facebook’s market capitalisation.

Cambridge Analytica declared bankruptcy and was dissolved in July 2019.

One difficult issue with machine-learning predictive algorithms is that they often operate like black boxes, as it may not be clear as to how they have arrived at their conclusions.

Because of their increasingly widespread use, with potentially negative effects on minorities or poor, such practices have been called “weapons of math destruction.”

Source: C. O’Neil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy, Crown, New York, 2016.

The urgent quest for being able to understand, interpret and explain the outputs of machine-learning predictive algorithms pinpoints a strong emphasis on developing Explainable Artificial Intelligence (XAI).

It is also argued by many that, in addition to a stronger regulation of data science, machine learning and AI, there should be a pledge for good and ethical data science, to which all professionals should commit.

As one recent report on teaching data science at the undergraduate level has suggested, such an oath may, in part, read as follows:

Possible oath for data scientists

“I will remember that my data are not just numbers without meaning or context, but represent real people and situations, and that my work may lead to unintended societal consequences, such as inequality, poverty, and disparities due to algorithmic bias. My responsibility must consider potential consequences of my extraction of meaning from data and ensure my analyses help make better decisions.”

Source: National Academies of Sciences, Engineering, and Medicine. Data Science for Undergraduates: Opportunities and Options, Appendix D: Data Science Oath. 2018. The National Academies Press. p. 119.

Next steps

In the next research activities, you can practice for yourself and review your learning so far.

This article is from the free online

Statistical Methods

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now