Be mindful about your data!
In this step we will examine some real-life cases that illustrate some of the caveats we should be wary of when handling data, including possible distortion, bias, misrepresentation or misconception of data.
1. Questions to bear in mind when working with data
There are many questions you should ask yourself while you work with data to make sure you are careful in collecting, representing and analysing data, and then interpreting and reporting the results:
- There is a phrase, “numbers don’t lie”, but are you always able to understand correctly what they say?
- Are you certain that the data is ‘faithful’ to the information of interest, or could there be some kind of distortion or bias?
- Do the numbers tell the ‘full’ story, or is some important part of evidence hidden or unavailable?
2. Sampling, non-response and response biases
Bias may be caused by a poor design of sampling.
2.1 Sampling bias
Example 1: US presidential election, 1936
Source: Squire, P. 1988. Why the 1936 Literary Digest poll failed. Public Opinion Quarterly. 52 (1988), pp.125–133.
In the 1936 presidential election in the United States, Democratic President Franklin D. Roosevelt was re-elected, winning against a Republican Alf Landon, Governor of Kansas. Before the election, a popular weekly magazine Literary Digest conducted a poll by mailing out some 10 million postcard ‘ballots’, about 2.3 million of which were returned. Addresses for mailing were compiled from membership of different clubs, telephone directories and lists of registered car owners. Based on these responses, the Digest predicted that Landon would be the winner with 57% of the popular vote.
In reality, Landon lost the election with just 39% of the popular vote, while Roosevelt enjoyed a landslide victory, with 62% of the popular vote and winning in 46 of the 48 states. One reason for the dramatic failure of the Digest’s prediction was due to focusing on people with telephones in their homes, which accounted for about 40% of households. This is an example of sampling bias, where the sample is not representative of the general population.
2.2. Non-response bias
Continuing with Example 1, an even bigger mistake was that the prediction relied on voluntary responses, which distorted the sample due to the so-called non-response bias. As it turned out, Landon supporters were more inclined to return their answers: 60% of non-respondents voted for Roosevelt, while only 40% of respondents did so. In contrast, the newly established Gallup Institute correctly predicted Roosevelt’s victory using a stratified sample of 50,000 people, which proved to be more representative of the general population of voters.
2.3. Response bias
A different example of data distortion in sample surveys is the response bias contained in the actual responses. One reason for that kind of bias might be that the question is asked in a leading way, making a certain response more likely. Another possible reason for bias is that the respondents may lie because they think the true answer is socially unacceptable. Even the order of questions may influence the results.
Example 2: Freedom of reporting
Source: Crossen, C. 1996. Tainted Truth: The Manipulation of Fact in America. New York: Simon & Schuster.
In a sample survey in the United States during the Cold War, there were two questions:
- “Should the United States let Russian newspaper reporters come into the country and send back whatever they want?”
- “Should Russia let American newspaper reporters come in and send back whatever they want?”
For the first question, the percentage of “yes” responses was 36% when it was asked first and 73% when it was asked second. This is an example of response bias.
3. Hidden data
Another cause of possible bias is due to not taking into account all of the available data.
Example 3: The Challenger disaster, 1986
Source: Dalal, S.R., Fowlkes, E.B., Hoadley, B. 1989. Risk analysis of the space shuttle: pre-Challenger prediction of failure. Journal of the American Statistical Association. 84 (1989), pp.945-957.
On 28 January 1986, the space shuttle Challenger broke up above the Atlantic Ocean shortly after the launch from Cape Canaveral, Florida, killing all seven crew members aboard. Following the investigation, the disaster was attributed to the failure of the rubber O-ring seals in the shuttle’s right solid rocket booster. The O-rings lost elasticity due to the record-low air temperature on the launch morning of 36°F (2°C). Before the fatal launch, the air temperature was forecast to drop overnight as low as 18°F (−8°C), rising to 26°F (−3°C) at the scheduled launch time of 09:38.
Could the disaster have been prevented by a more careful analysis of the past data given a freezing weather forecast? The O-ring issue was known to NASA (National Aeronautics and Space Administration, USA) from past launches .
The next table shows the data for the previous 23 space shuttle flights.
Table 1: Statistics of O-ring faults in space shuttle launches (1981–1986)
Flight | Date | Temperature | Failed (1/0) |
---|---|---|---|
1 | 12/04/1981 | 66 | 0 |
2 | 12/11/1981 | 70 | 1 |
3 | 22/03/1982 | 69 | 0 |
4 | 11/11/1982 | 68 | 0 |
5 | 04/04/1983 | 67 | 1 |
6 | 18/06/1983 | 72 | 0 |
7 | 30/08/1983 | 73 | 0 |
8 | 28/11/1983 | 70 | 0 |
9 | 03/02/1984 | 57 | 1 |
10 | 06/04/1984 | 63 | 1 |
11 | 30/08/1984 | 70 | 1 |
12 | 05/10/1984 | 78 | 0 |
13 | 08/11/1984 | 67 | 1 |
14 | 24/01/1985 | 53 | 1 |
15 | 12/04/1985 | 67 | 1 |
16 | 29/04/1985 | 75 | 1 |
17 | 17/06/1985 | 70 | 1 |
18 | 29/07/1985 | 81 | 0 |
19 | 27/08/1985 | 76 | 0 |
20 | 03/10/1985 | 79 | 0 |
21 | 30/10/1985 | 75 | 1 |
22 | 26/11/1985 | 76 | 1 |
23 | 12/01/1986 | 58 | 1 |
The data is visualised in the following dot plots:
The decision of NASA to go ahead with the launch of the Challenger space shuttle was based on the faults analysis (see the first plot in Figure 1). This dataset has a sample mean of 66.7°F and a median of 68°F, and the observed frequencies of faults do not indicate that the risk is increasing towards lower temperatures. However, looking at the entire dataset (see the second plot in Figure 1) would change the picture completely, revealing that safe launches occurred only at 66°F or above, thus strongly indicating that freezing temperatures, such as 34°F, present a great risk.
4. Survivorship bias
Sometimes, the most informative data is simply not available.
Example 4: Damage in surviving planes
Source: Casselman, W. 2016. The legend of Abraham Wald. [Online].
During World War 2, Allied bomber planes would return to the base after a mission with some bullet hits in the body and wings. The air commanders wanted to figure out where to strengthen the planes to improve their survival chances and to cut losses. A straightforward idea would be to reinforce the part that received more bullet hits. But an American statistician Abraham Wald, working in the Statistical Research Group (SRG) at Columbia University, advised otherwise!
He simply noted that the observed bullet hits in the surviving planes were not damaging enough to put the planes down, so they were able to return safely. In contrast, the areas that were not hit (such as motors) might be more likely to be critical for survival, despite no direct evidence from the non-returning planes. This insight leads to the counter-intuitive conclusion that the least-damaged areas should be reinforced. Such situations are referred to as survivorship bias: missing or unavailable data may contain more information than the data we have access to.
5. Underestimation of risk
A thoughtful and knowledgeable use of statistical theory is essential for the correct assessment of ‘rare’ events representing risk.
Example 5: NATS incident, 2023
Source: Reuters. 2023. UK air traffic meltdown “one in 15 million” event. [Online].
At 08:32 on 28 August 2023, the UK National Air Traffic Services (NATS) experienced a major incident resulting in the air traffic control system’s failure. The system received details of a flight which was due to cross UK airspace later that day. The system detected that two markers along the planned route had the same name – even though they were in different places. This triggered the system to automatically stop working for safety reasons so that no incorrect information was passed to NATS air traffic controllers.
This unfolded in just 20 seconds. Fixing the problem took a few hours, but it caused three days of chaos in the airports. The NATS chief executive said that the system did “what it was designed to do, i.e fail safely when it receives data that it can’t process.” He described it as “a one in 15 million” – this was the first ever incident of such a kind over the five years of the current software in operation, having processed more than 15 million flights.
This seems to imply that such an event occurs with a probability (small p) of about 1 out of 15 million (of course, this is just an estimate based on a single occurrence). This probability looks extremely small so that the NATS appear ‘innocent’.
However, let us ask: What is the probability that no such events occur over (small n=15) million ‘trials’ (flights in this case)?
We put (small p) (1 out of 15 million) and (n) (15 million) into the formula
(small (1-p)^napprox 0.36788 (37%)).
That is to say, the probability that an incident was to happen at least once over the past five years is given by
(small 1-(1-p)^n approx 1-0.36788 = 0.63212 (63%))
Thus, at odds with the NATS’ assessment, it is actually pretty likely that such an event would have occurred.
5. Data misrepresentation
When reporting data or the results of statistical analysis, it is crucial to represent information in a fair and truthful way.
Example 6 University enrolment data
Source: Agresti, A., Franklin, C., Klingenberg, B. 2023. Statistics: The Art and Science of Learning from Data. Pearson. p.117.
Now consider the following example of the student enrolment numbers in a US university from 2004 to 2012, comparing the total number of students with the number of stem majors.
Table 2: University enrolment data
Year | Total students | Stem major |
---|---|---|
2004 | 29,404 | 2,003 |
2005 | 29,693 | 1,906 |
2006 | 30,009 | 1,871 |
2007 | 30,912 | 1,815 |
2008 | 31,288 | 1,856 |
2009 | 32,317 | 1,832 |
2010 | 32,941 | 1,825 |
2011 | 33,878 | 1,897 |
2012 | 33,405 | 1,854 |
The following graph is a newspaper illustration of these numbers, displaying the growth of the total number of students compared to the intake in STEM (Science, Technology, Engineering and Mathematics).
Variable human figures in the graph are supposed to represent the numbers (indicated below and above the figures). Does this graph represent the data well? Take a moment to note down what you think is wrong with it and post your comment in the discussion area in the next section.
Here are some answers:
- The vertical axis is not labelled and that makes it difficult to match the size with the enrolment.
- The sizes of the human figures are intended to represent the enrolment numbers, but these are shown completely out of proportion. For instance, in 2004 the STEM number (2,003) was about 7% of the total (29,404), but the height of the STEM figure is about two-thirds the height of the total figure.
- The use of solid blue figures versus outlined figures may distort the perception. It would be better to put the figures next to each other for a clearer comparison.
Overall, the design of this graph gives a misleading representation of the data.
Using a bar chart would seem a better choice than using human figures (see the next image).
This graph displays correctly a small proportion of STEM enrolments. It shows a slow but steady increase in the total enrolments, but any trend for STEM is not clear because of small counts as compared to the totals.
It may be more meaningful to show the variation of the STEM percentage over the years, as shown in the next polygonal plot.
This plot now clearly shows a strong declining trend in the STEM enrolment percentage from year to year.
6. ‘Noise’ and ‘outliers’ may be informative
When analysing data, it is helpful to clean it up by removing noise or potential outliers. The wealth of available data should be treated with care by paying attention to possibly valuable ‘messages’ from what may at first look like a nuisance.
Example 7: Discovery of CMB radiation
Source: Chodos, A. APS News. 2002. June 1963: Discovery of the Cosmic Microwave Background/This Month in Physics History. 11 (7). p.2.
In 1963, American physicists Arno Penzias and Robert Wilson used a Bell Labs radio telescope to collect measurements of radio signals from remote inter-galaxy regions in space. The idea was to filter out the weak background signal not distorted by the galaxy masses which could be attributed to ‘residual’ cosmic processes that took place about 14 billion years ago. To do so, they had to eliminate all interference, i.e by cooling the receiver with liquid helium.
Despite all their efforts, the remaining signal still contained an annoying background ‘noise’ in the microwave range which seemed to come from all directions. They first thought this was just due to a statistical error, but the results were so persistent that statistical theory could not explain them. Penzias and Wilson concluded that in fact the ‘noise’ was not noise but the valuable signal proving the existence of what was later termed Cosmic Microwave Background (CMB).
This discovery was the first evidence to support the Big Bang theory that explains the expansion of the Universe from an initial state of high density and temperature. In 1978, Penzias and Wilson received the Nobel Prize in physics for their serendipitous discovery of CMB radiation.
Similarly, care should be taken when dealing with ‘outliers’ (i.e untypical rare values in the data) – they may contain valuable information about risk (e.g financial losses or environmental extremes).
Reflection
Do you have any other examples, from your interests of field of work, that highlight issues such as these?
Next steps
Now that you know more about possible errors and causes of bias, in the follow-up activity you are invited to practise your skills in assessing representation of various datasets by detecting possible flaws and suggesting ways of improvement.
Before moving on you may wish to engage with your peers in the following Share additional resources area.
Consider the question:
Do you have any other examples from your interests or field of work, that highlight issues such as these?
Reach your personal and professional goals
Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.
Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.
Register to receive updates
-
Create an account to receive our newsletter, course recommendations and promotions.
Register for free