Skip main navigation

Shape of data distribution

What are the most effective methods for visualising the distribution of a quantitative variable?

In this step we examine how it is useful to be able to characterise some qualitative features of the data. This step explains the shape of the data distributions, focusing on features such as modality, symmetry and skewness, and also introduces what is known as ‘normal’ datasets.

There are three sections:

  1. Data modality
  2. Data symmetry and skewness
  3. ‘Normal’ datasets

1. Data modality

One important question to ask when describing the data distribution is about its modality:

  • Does the distribution have a single ‘peak’?

If so, the distribution is called unimodal, and the location of the peak is the mode.

A distribution with two distinct peaks is called bimodal.

The next image illustrates unimodal and bimodal distributions.

In practice, a bimodal distribution can arise in opinion polls when the population is polarised on a controversial issue. For example, public perception of the 2016 UK Brexit poll results, where UK citizens were asked to vote either for or against leaving the European Union.

A bimodal distribution can also be due to the observations coming from two different groups. For instance, in the next image, the histogram of the height of students shows two peaks: one for females and one for males (data taken from Example in the previous step on numerical summaries).

2. Data symmetry and skewness

Another important question to ask is about symmetry:

  • Is the distribution symmetric or skewed?

A unimodal distribution is symmetric if the side of the distribution below its central value (e.g the mode) is about the same in shape as the side above.

The distribution is skewed if one side of the distribution stretches longer than the other side.

The next image shows two plots that illustrate the skewness of data.

Histograms contrasting data skewed to the right and left. This is because one side stretches longer than the other.

In picturing the distribution features, such as symmetry and skew, it is common to use smooth curves to summarise the shape of a histogram. A smooth approximation of the histogram may be expected to become more accurate when more data is collected and smaller bins are used accordingly.

A real-life example of a variable with a symmetric distribution is the IQ (‘Intelligence Quotient’) which is a total score derived from a set of standardised tests to assess human intelligence (see Intelligence quotient, Wikipedia (online) in the ‘See also’ section).

Evidence from many statistical experiments suggests that the distribution of IQ in the human population is symmetric, with a mean of 100 and a standard deviation of 15.

Real-life examples of skewed distributions are:

  • Life span: this variable is skewed to the left, as relatively few individuals die young, while the mean life span (called life expectancy) is quite high.
  • Income: this variable is skewed to the right, as relatively few people are rich, with a stretched range of their income, while the median income is fairly low.

Comparing the sample mean and sample median in skewed distributions, it can be concluded that:

  • In symmetric distributions, the mean and the median are approximately equal.
  • In skewed distributions, the mean is shifted away from the median in the direction of skewness.

This is explained by noting that more prominent tails of skewed distributions provide a significant contribution to the sample mean, thus making it bigger for the right skewness and smaller for the left skewness. The more highly skewed is the distribution, the more the mean and median tend to differ.

3. ‘Normal’ datasets

Datasets observed in practice often have histograms that are similar in shape. They reach their peak at or near the sample median and then decrease on both sides in a bell-shaped symmetric fashion. Such datasets are said to be (approximately) normal.

The next image illustrates a normal dataset, including a bell-shaped curve that approximates the histogram.

A histogram visualising a normal data set.

Suppose that we have an approximately normal dataset, and let (small bar{x}) and (small s_x)  be its sample mean and sample standard deviation. The following Empirical Rule specifies the approximate proportions of the data values that are within a certain distance away from the sample mean (small bar{x}:!).

Empirical rule

If a dataset is normal (‘bell-shaped’), then approximately:

  • 68% of the data lie within (small s_x)​ of (small bar{x}),
  • 95% of the data lie within (small 2 s_x)​ of (small bar{x}),
  • all or nearly all observations lie within (small 3 s_x)​ of (small bar{x}).

Warning: the Empirical Rule only applies to symmetric, bell-shaped datasets, but will work poorly for skewed data.

To illustrate the Empirical Rule, we can use the dataset for female student heights used in the earlier reading, Numerical Summaries of Data. This came from the book Agresti, A., Franklin,  C., Klingenberg, B. 2023. Statistics: The Art and Science of Learning from Data, Pearson. pp. 95, 105.

Similarly to the previous step, it is not necessary to share the complete dataset. This is a short version of the data, enough to illustrate the Empirical Rule:

Quadrant
Minimum
First Quadrant
Median
Mean
Third Quadrant
Maximum
Height (inches) 56.00 64.00 65.00 65.28 67.00 77.00

The standard deviation is calculated to be 2.95.

The next image is a histogram of the data. The histogram has approximately a bell shape. The mean and median are close, about 65, which confirms that the distribution is approximately symmetric.

Histogram visualising the distribution of female student height.

Let us now compare the Empirical Rule prediction with the actual percentages of the height data, knowing that xˉ=65.28small (small bar{x}=65.28) and (small s_x=2.95):

Table 1: Percentage of the height data within (small bar{x}pm ks_x) intervals: actual and predicted by the Empirical Rule. 

(small k) Interval Observations Percentage Predicted %
1 (62.3, 68.2) 187 72% 68%
2 (59.4, 71.2) 248 95% 95%
3 (56.4, 74.1) 258 99% 100%

We see that the percentages predicted by the Empirical Rule are close to the actual ones. An explanation of the Empirical Rule is based on the so-called normal distribution which approximates bell-shaped data.

The next image shows a normal approximation as a smooth bell-shaped curve.

Histogram comparing female student height with the normal approximation shown by a red curved line.

The red curve here is plotted as the graph of a function (small y=f(x))defined by the formula 

(small f(x)= frac{1}{sqrt{2pisigma^2}},expleft(-frac{(x-mu)^2}{2:!sigma^2}right),qquad xinmathbb{R},)

where the location parameter (small mu) and the spread parameter (small sigma>0)are replaced with the sample mean (small bar{x}) and the sample standard deviation (small s_x)​, respectively.

The normal approximation is often referred to as the Gaussian distribution in honour of Carl Friedrich Gauss (1777–1855), a great German mathematician who discovered the normal distribution concerning measurement errors in physics and astronomy.

This discovery was commemorated in a former German banknote of 10 Deutsche Mark, shown in the following image.

Next steps

You have now completed the step that reviewed some basic tools in descriptive statistics to produce graphical and numerical summaries of datasets. Next, you start to learn about RStudio and how to use it for data summaries, first by watching a video demonstration of basic RStudio principles, then by reading a detailed description of how to use basic R commands to produce data summaries in RStudio. From this point in the course, you will be getting hands-on experience with RStudio.

Before moving on you may wish to engage with your peers in the following Share area. 

Consider the question:

Can you think of another real-life variable that would likely have a: 

  1. Bimodal distribution?
  2. Left-skewed distribution?
This article is from the free online

Statistical Methods

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now