Case studies: Inference from data
Previously, we have established the value of using statistics in data science. Here, you explore four case studies and reflect on what research questions the data may help to answer.
The four case studies are:
- Women’s clothing sizes
- Hardness of water in Texas
- Effect of nitrogen on the growth of plants
- Discovery of argon (Ar)
1. Women’s clothing sizes
Source: This is a stylised dataset adapted to the UK sizing system from a real dataset for French women, see Fashion Network (refer to ‘See also’ section below). 2016.
Women’s clothing sizes in the UK are labelled as even numerals from 6 to 26, which summarise the key body characteristics, such as bust, waist and hip. Suppose a clothing manufacturer wishes to determine the right proportions for the quantity of sizes when producing women’s garments. It would not be right to produce clothing items equally in all sizes.
To investigate, a company statistician explores data from a sample of 160 randomly chosen women. This data is shown in the table that follows.
Table 1: Women’s clothing size data
Size | Frequency | Percentage |
---|---|---|
6 | 1 | 0.6% |
8 | 11 | 6.9% |
10 | 17 | 10.6% |
12 | 29 | 18.1% |
14 | 31 | 19.4% |
16 | 27 | 16.9% |
18 | 17 | 10.6% |
20 | 13 | 8.1% |
22 | 9 | 5.6% |
24 | 3 | 1.9% |
26 | 2 | 1.3% |
Total | 160 | 100% |
They then visualise the data in the following bar chart:
The statistician observes that the data is unimodal, with the mode (the most frequent value) being size 14, as well as the median (the value separating the observed counts into two equal parts). The sample mean (the average of all observations) is slightly bigger, 14.8, which is due to a noticeable skewness to the right, meaning that sizes above the median tend to extend further. In particular, over 44% of women wear size 16. This data provides a helpful guide into plausible proportions of different sizes, which the company could follow.
Note, however, that the size distribution is likely to depend on age. What if the manufacturer is focusing on younger women’s fashion, while the survey has been done regardless of age? In that case, the statistician should use data from women randomly selected from an appropriate age cohort. Another interesting question to explore is if and how the distribution of size is changing over time. This may reflect fashion trends but also changes in anthropometric characteristics (for example, related to a growing prevalence of overweight).
2. Hardness of water in Texas
Source: Brase, C.H. and Brase, C.P. 2016. Understanding Basic Statistics. Cengage Learning. 7th ed., pp. 137–139.
In Texas, underground water is extremely important for growing crops. The hardness or softness of water is characterised using its ‘pH’. A pH less than 7.0 is acidic (‘soft’) and a pH above 7.0 is alkaline (‘hard’). If pH is too high, the water is unusable or needs expensive treatment to make it softer.
Note: If pH is too low, the water may irritate if used in a swimming pool or a shower, but this is not the case in Texas.
The dataset in the table that follows represents pH levels in groundwater in a random sample from 103 Texas wells. The table shows that the water in this region tends to be quite hard.
Table 2: Water pH data
pH | Frequency |
---|---|
7.0 | 8 |
7.1 | 10 |
7.2 | 10 |
7.3 | 11 |
7.4 | 9 |
7.5 | 8 |
7.6 | 9 |
7.7 | 6 |
7.8 | 5 |
7.9 | 5 |
8.0 | 1 |
8.1 | 7 |
8.2 | 7 |
8.3 | 0 |
8.4 | 1 |
8.5 | 1 |
8.6 | 1 |
8.7 | 1 |
8.8 | 2 |
8.9 | 1 |
The data is visualised by the following histogram:
The data is unimodal, with the mode 7.3, whereas the mean and median of the sample are 7.5 and 7.6, respectively. This confirms that the data is skewed to the right so there is a tendency for higher values of pH. In particular, over 62% of wells have a pH higher than 7.3 but only 27% with pH less than 7.3, including just 8% with neutral water (pH = 7.0). This data can help the local authorities decide on their best policy to handle hard water in the region.
Suppose that a certain crop can tolerate irrigation water with a pH not exceeding 7.95. What percentage of the wells in the region could be used for that crop? If we believe that the sample is representative of the wells in the region, the required percentage can be estimated as the number of wells with a pH of 7.9 or less (8+10+10+11+9+8+9+6+5+5=81) divided by the total number of wells, 103 (and multiplied by 100% to convert to percentage):
(small frac{81}{103}cdot 100 % =78.6 %.)
This can be taken as an estimate of the required percentage, but due to random variation of pH across the wells, the actual figure over the entire region may well be higher or lower. Statistical methods provide tools to confidently evaluate the possible error of this estimate. Specifically, as can be shown, we may be pretty sure that the required percentage across the entire region would lie within the bounds of 71.1% and 86.2%.
3. Effect of nitrogen on the growth of plants
Source: Walpole, R.E., Myers, R.H., Myers, S.L., Ye, K. 2012. Probability & Statistics for Engineers & Scientists. Prentice Hall. 9th ed., pp. 4-5.
A study was conducted to assess the impact of nitrogen (N) on growth. Twenty seedlings of northern red oak were planted in a greenhouse, with the same type of soil and the same amount of sunshine and water. Half of the seedlings were treated with nitrogen and the other half were left untreated as a control. After 140 days, the stem weights (in grams) were recorded as shown in the following table.
Table 3: Weights of stems in two samples
Observation | With nitrogen | Without nitrogen |
---|---|---|
1 | 2.6 | 3.2 |
2 | 4.3 | 5.3 |
3 | 4.7 | 2.8 |
4 | 4.9 | 3.7 |
5 | 5.2 | 4.7 |
6 | 7.5 | 4.3 |
7 | 7.9 | 3.6 |
8 | 8.6 | 4.2 |
9 | 6.2 | 3.8 |
10 | 4.6 | 4.3 |
Mean | 5.65 | 3.99 |
Plotting dot plots and box plots from the data help us compare the two samples:
Both types of plots strongly suggest that nitrogen has a significant impact on the enhanced growth of plant stems, and we can expect that statistical methods should be able to support this conclusion.
4. Discovery of argon (Ar)
Source: Spanos, A. 2010. The discovery of argon: a case for learning from data? Philosophy of Science. 77(3), pp.359–380. (refer to ‘See also’ section below)
In 1904, Lord Rayleigh (1842–1919) was awarded the Nobel Prize in physics for the discovery of argon, an inert gas in the atmosphere. The conventional belief at the time was that atmospheric air was a mixture of oxygen (O2), nitrogen (N), and small quantities of carbon dioxide (CO2) and water vapour (H2O). Rayleigh’s laboratory procedure was to pass air through a hot copper tube, yielding atmospheric nitrogen. He then compared the atomic weights from such a sample with those of the chemically pure nitrogen.
The experimental results are shown in the next table.
Table 4: Atomic weights for atmospheric and chemical nitrogen
Observation | Atmospheric nitrogen | Chemical nitro |
1 | 2.31017 | 2.30143 |
2 | 2.30986 | 2.29890 |
3 | 2.31010 | 2.29816 |
4 | 2.31001 | 2.30182 |
5 | 2.31024 | 2.29869 |
6 | 2.31010 | 2.29940 |
7 | 2.31028 | 2.29849 |
8 | 2.31163 | 2.29889 |
9 | 2.30956 | 2.30074 |
10 | 2.31026 | 2.30054 |
Mean | 2.31022 | 2.29971 |
Here, the sample means are pretty close to one another, so the difference between the two samples may not be evident.
Using visual tools provides us with more insights into the data as shown in the two dot plots that follow.
The plots indicate that the atomic weight of atmospheric nitrogen is higher. These results can be explained by hypothesising that air contains an unknown component which withstands the entire processing and remains in the atmospheric probes – hence its description as ‘inertial gas’ and the name ‘argon’ (a Greek word for ‘inactive’).
However, because the numerical values are so close, there may still be doubts about whether the difference between the samples is significant enough.
Statistics is capable of resolving such doubts objectively.
Next steps
This step introduced you to four real-world situations where data would be helpful to inform understanding and decision making in different situations. You saw how applying statistical methods can help you create research questions that lead to useful data. Next, you build on these ideas by considering how bias can unintentionally creep into data collection and skew the resulting data.
Reach your personal and professional goals
Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.
Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.
Register to receive updates
-
Create an account to receive our newsletter, course recommendations and promotions.
Register for free