Skip main navigation

Case studies: Inference from data

What is the value of using statistics in data science?

Previously, we have established the value of using statistics in data science. Here, you explore four case studies and reflect on what research questions the data may help to answer.

The four case studies are:

  1. Women’s clothing sizes
  2. Hardness of water in Texas
  3. Effect of nitrogen on the growth of plants
  4. Discovery of argon (Ar)

1. Women’s clothing sizes

Source: This is a stylised dataset adapted to the UK sizing system from a real dataset for French women, see Fashion Network (refer to ‘See also’ section below). 2016.

Women’s clothing sizes in the UK are labelled as even numerals from 6 to 26, which summarise the key body characteristics, such as bust, waist and hip. Suppose a clothing manufacturer wishes to determine the right proportions for the quantity of sizes when producing women’s garments. It would not be right to produce clothing items equally in all sizes.

To investigate, a company statistician explores data from a sample of 160 randomly chosen women. This data is shown in the table that follows.

Table 1:  Women’s clothing size data

Size Frequency Percentage
6 1 0.6%
8 11 6.9%
10 17 10.6%
12 29 18.1%
14 31 19.4%
16 27 16.9%
18 17 10.6%
20 13 8.1%
22 9 5.6%
24 3 1.9%
26 2 1.3%
Total 160 100%

They then visualise the data in the following bar chart:

A bar plot showing observed frequencies in the sample using vertical bars of the corresponding heights. Data taken from Table 1. The plot has a bell shape, with the mode at size 14, and is slightly asymmetrical, with a noticeable skewness to the left.

The statistician observes that the data is unimodal, with the mode (the most frequent value) being size 14, as well as the median (the value separating the observed counts into two equal parts). The sample mean (the average of all observations) is slightly bigger, 14.8, which is due to a noticeable skewness to the right, meaning that sizes above the median tend to extend further. In particular, over 44% of women wear size 16. This data provides a helpful guide into plausible proportions of different sizes, which the company could follow.

Note, however, that the size distribution is likely to depend on age. What if the manufacturer is focusing on younger women’s fashion, while the survey has been done regardless of age? In that case, the statistician should use data from women randomly selected from an appropriate age cohort. Another interesting question to explore is if and how the distribution of size is changing over time. This may reflect fashion trends but also changes in anthropometric characteristics (for example, related to a growing prevalence of overweight).

2. Hardness of water in Texas

Source: Brase, C.H. and Brase, C.P. 2016. Understanding Basic Statistics. Cengage Learning. 7th ed., pp. 137–139.

In Texas, underground water is extremely important for growing crops. The hardness or softness of water is characterised using its ‘pH’. A pH less than 7.0 is acidic (‘soft’) and a pH above 7.0 is alkaline (‘hard’). If pH is too high, the water is unusable or needs expensive treatment to make it softer.

Note: If pH is too low, the water may irritate if used in a swimming pool or a shower, but this is not the case in Texas.

The dataset in the table that follows represents pH levels in groundwater in a random sample from 103 Texas wells. The table shows that the water in this region tends to be quite hard.

Table 2:  Water pH data

pH Frequency
7.0 8
7.1 10
7.2 10
7.3 11
7.4 9
7.5 8
7.6 9
7.7 6
7.8 5
7.9 5
8.0 1
8.1 7
8.2 7
8.3 0
8.4 1
8.5 1
8.6 1
8.7 1
8.8 2
8.9 1

The data is visualised by the following histogram:

A histogram showing observed frequencies in the sample using 10 classes of width 0.2 each. The histogram is strongly skewed to the right, with the maximum class 8.8–9.0 lying far to the right from the mode class 7.2–7.4.

The data is unimodal, with the mode 7.3, whereas the mean and median of the sample are 7.5 and 7.6, respectively. This confirms that the data is skewed to the right so there is a tendency for higher values of pH. In particular, over 62% of wells have a pH higher than 7.3 but only 27% with pH less than 7.3, including just 8% with neutral water (pH = 7.0). This data can help the local authorities decide on their best policy to handle hard water in the region.

Suppose that a certain crop can tolerate irrigation water with a pH not exceeding 7.95. What percentage of the wells in the region could be used for that crop? If we believe that the sample is representative of the wells in the region, the required percentage can be estimated as the number of wells with a pH of 7.9 or less (8+10+10+11+9+8+9+6+5+5=81) divided by the total number of wells, 103 (and multiplied by 100% to convert to percentage):  

(small frac{81}{103}cdot 100 % =78.6 %.)

This can be taken as an estimate of the required percentage, but due to random variation of pH across the wells, the actual figure over the entire region may well be higher or lower. Statistical methods provide tools to confidently evaluate the possible error of this estimate. Specifically, as can be shown, we may be pretty sure that the required percentage across the entire region would lie within the bounds of 71.1% and 86.2%.

3. Effect of nitrogen on the growth of plants

Source: Walpole, R.E., Myers, R.H., Myers, S.L., Ye, K. 2012. Probability & Statistics for Engineers & Scientists. Prentice Hall. 9th ed., pp. 4-5.

A study was conducted to assess the impact of nitrogen (N) on growth. Twenty seedlings of northern red oak were planted in a greenhouse, with the same type of soil and the same amount of sunshine and water. Half of the seedlings were treated with nitrogen and the other half were left untreated as a control. After 140 days, the stem weights (in grams) were recorded as shown in the following table.

Table 3:  Weights of stems in two samples 

Observation With nitrogen Without nitrogen
1 2.6 3.2
2 4.3 5.3
3 4.7 2.8
4 4.9 3.7
5 5.2 4.7
6 7.5 4.3
7 7.9 3.6
8 8.6 4.2
9 6.2 3.8
10 4.6 4.3
Mean 5.65 3.99

Plotting dot plots and box plots from the data help us compare the two samples:

Two dot plots depicting the data with nitrogen (top) and without nitrogen (bottom). The dots in the top plot spread widely across the range from 2.6 to 8.6, whereas the dots in the bottom plot concentrate in the left half of this range, from 2.8 to 5.3.

Two box plots summarising the data with nitrogen (top) and without nitrogen (bottom). The top plot shows a wider spread of the data, with the median (5.1) skewed to the left of the broad interquartile range. The bottom plot shows a more narrow spread, with the median (4.0) located roughly in the middle of the interquartile range.

Both types of plots strongly suggest that nitrogen has a significant impact on the enhanced growth of plant stems, and we can expect that statistical methods should be able to support this conclusion.

4. Discovery of argon (Ar)

Source: Spanos, A. 2010. The discovery of argon: a case for learning from data? Philosophy of Science. 77(3), pp.359–380. (refer to ‘See also’ section below)

In 1904, Lord Rayleigh (1842–1919) was awarded the Nobel Prize in physics for the discovery of argon, an inert gas in the atmosphere. The conventional belief at the time was that atmospheric air was a mixture of oxygen (O2), nitrogen (N), and small quantities of carbon dioxide (CO2) and water vapour (H2O). Rayleigh’s laboratory procedure was to pass air through a hot copper tube, yielding atmospheric nitrogen. He then compared the atomic weights from such a sample with those of the chemically pure nitrogen.

The experimental results are shown in the next table.

Table 4: Atomic weights for atmospheric and chemical nitrogen

Observation Atmospheric nitrogen Chemical nitro
1 2.31017 2.30143
2 2.30986 2.29890
3 2.31010 2.29816
4 2.31001 2.30182
5 2.31024 2.29869
6 2.31010 2.29940
7 2.31028 2.29849
8 2.31163 2.29889
9 2.30956 2.30074
10 2.31026 2.30054
Mean 2.31022 2.29971

Here, the sample means are pretty close to one another, so the difference between the two samples may not be evident.

Using visual tools provides us with more insights into the data as shown in the two dot plots that follow.

Two dot plots depicting the data with atmospheric nitrogen (top) and chemical nitrogen (bottom). The plots show clearly the distinctly different ranges of these datasets, with values in the top plot strongly shifted to the right.

The plots indicate that the atomic weight of atmospheric nitrogen is higher. These results can be explained by hypothesising that air contains an unknown component which withstands the entire processing and remains in the atmospheric probes – hence its description as ‘inertial gas’ and the name ‘argon’ (a Greek word for ‘inactive’).

However, because the numerical values are so close, there may still be doubts about whether the difference between the samples is significant enough.

Statistics is capable of resolving such doubts objectively.

Next steps

This step introduced you to four real-world situations where data would be helpful to inform understanding and decision making in different situations. You saw how applying statistical methods can help you create research questions that lead to useful data. Next, you build on these ideas by considering how bias can unintentionally creep into data collection and skew the resulting data.

This article is from the free online

Statistical Methods

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now