Numerical summaries of data
By selecting the most suitable summary for data, you can identify its key features. In this step, we continue to review some basic concepts and tools of descriptive statistics, now focusing on numerical summaries of data.
1. Sample mode
A useful characteristic of a categorical dataset is its mode. The mode of a dataset is the observed value with the highest frequency.
Applied to categorical and discrete types of data, the mode indicates the most likely value in the observed sample, which in turn, hints at the most typical value(s) in the general population. For continuous data, the sample mode is less useful because each value in the sample may have been observed only once. In that case, it is helpful to look at grouped data, e.g by using the histogram, for which the mode makes more sense.
Re-using the frequency of shark attacks data from Example 1 in 2.3 Graphical summaries of data, the following image shows a bar plot of the data. The longest bar indicates that the mode is the state of Florida (with 52.5% of all shark attacks).
Similarly, if we visualise the salary dataset from Example 2 in the previous step, the mode is 62 (i.e. $62,000) as shown by the longest bar in the next image.
2. Centrality and spread
For large quantitative datasets, it is useful to summarise the data by using some measures of centrality and spread. To explain these concepts, first, consider the two histograms shown in the following image that describe two different datasets. Intuitively, the ‘centre’ of the first histogram is around 0, while the second histogram is ‘shifted’ to the right, with a new centre around 2.
Next, consider the two histograms in the following image which are graphically different to the previous examples. Both histograms are centred around 0, but they differ in their spread: all values in the first dataset are within the interval from –3 to 3, while the values in the second dataset are more spread out, from –6 to 6. Centrality and/or spread can be characterised numerically using a suitable statistic which is a term for any computational ‘rule’ or a function of data.
3. Sample median
One intuitively appealing measure of centrality is the median of the sample. The median is the middle value of the observations when they are ordered, say from smallest to largest.
In other words, the median is a value such that about half of the observations are smaller and the other half are bigger. This concept applies mostly to quantitative data, although it also makes sense for ordinal categorical data. Due to possible ties, it may happen that the median is not separating the data values into exactly equal halves. Another nuisance may occur if the number of values is even (e.g 42); here, the median is usually taken as the average of the two values in the middle of the sample.
Practical rule to find the sample median
Given a dataset of size (small n), order the observed values from smallest to largest.
If (small n) is odd, the sample median is the value in position (frac{1}{2}small (n+1)).
If (small n) is even, the sample median is the average of the values in positions (frac{1}{2}small n) and (frac{1}{2}small n+1).
4. Sample mean, variance and standard deviation
The sample mean and variance are the usual measures of centrality and spread, respectively.
Suppose we have a dataset consisting of (small n) numerical values (small x_1, x_2,dots, x_n)
The sample mean, denoted (small bar{x}), is the arithmetic average of the data values:
(small fbox{$,displaystyle bar{x}=frac{x_1+dots+x_n}{n}=frac{1}{n}sum_{i=1}^n x_i,$})
The sample variance, denoted (small s_x^2) (or just (small s^2)), is the average of squared deviations from the sample mean (:!bar{x})
(small fbox{$,displaystyle s_x^2=frac{(x_1-bar{x})^2+dots+(x_n-bar{x})^2}{n-1}=frac{1}{n-1}sum_{i=1}^n (x_i-bar{x})^2,$})
The standard deviation (small s_x) (or just (small s)) is the square root of the sample variance:
(small fbox{$,displaystyle s_x=sqrt{s_x^2}=frac{1}{sqrt{n-1}}sqrt{sum_{i=1}^n (x_i-bar{x})^2},$})
For instance, using these definitions for the salary dataset (shared as Example 2 in the previous reading), we can calculate the sample mean 62.4 and the standard deviation 4.247875.
The idea of standardisation: for a dataset (small x_1,dots,x_n) its standardised values (also called z-scores) are defined as
(small fbox{$,displaystyle z_i=frac{x_i-bar{x}}{s_x},$})
It is easy to check that the new sample mean and sample variance are ‘standardised’ to 0small 00 and 1small 11, respectively:
(small fbox{$displaystyle ,bar{z}=0,quad s^2_z=1,$})
Note that the standardised values are dimensionless, i.e they do not depend on the units used for measurement.
For instance, standardised values for the salary sample are the same, where the salary is to be measured in dollars or thousands of dollars.
5. Sample percentiles
Sample percentiles generalise the notion of the median.
For any (small 0le ple 1), the sample (small 100:!p) percentile of a dataset is the value such that (small 100:!p%) of the data are less than or equal to it and (small 100:!(1-p)%) are greater than or equal to it.
If two data values satisfy this condition, then the sample (small 100:!p)-percentile is the average of these values.
The median is the 50-percentile (small p={normalsize frac12}).
The 25 percentile (small p={normalsize frac14}) is referred to as the first (or lower) quartile, while the 75-percentile (small p={normalsize frac34}) is the third (or upper) quartile. The median is, therefore, the second quartile.
6. Box plots
Box plots (also called box-and-whisker plots) are often used to visualise some of the summary statistics of datasets.
A box plot is based on a five-number summary: minimum, first (lower) quartile, median, third (upper) quartile and maximum of the data. Note that the minimum and maximum are determined after exclusion of potential outliers which is a term to refer to unusually large or small values (see more detail below).
- It is drawn by starting with a straight-line segment from the smallest to the largest data value.
- A rectangular box is then imposed on the line, stretching from the first to the third quartile, with the median indicated by a cross line.
- Two segments of a dashed line (called whiskers) extend outside the box to the minimum and maximum values, respectively, excluding potential outliers.
- The length of the line segment on the box plot, equal to the largest minus the smallest data value, is called the range of the data.
- The length of the box itself, equal to the upper quartile minus the lower quartile, is called the inter-quartile range (IQR).
- By convention, a data value is classified as an outlier if it is more than 1.5.
- IQR is below the first quartile or above the third quartile. The intuitive motivation for this rule is that we do not expect the ‘usual’ data values to occur too far from the ‘bulk’ of the data. The outliers are marked on the box plot by small circles.
The next image is the box plot for the salary data from Example 2 in the previous step. Note that there are no outliers.
Box plots are especially useful when comparing several samples.
Example: Student heights
Source: Agresti, A., Franklin, C., Klingenberg, B. 2023. Statistics: The Art and Science of Learning from Data, Pearson. pp. 95, 105 (refer to ‘See also’ section below or download the file from the ‘download’ section).
This dataset comprises the heights (in inches) of 262 female and 117 male students at the University of Georgia. The dataset inspection reveals that there is an extremely large observation: 92 inches (234 cm) for one female student. This is an outlier (i.e an untypical observation) which we exclude from the analysis.
After removing the outlier, we can construct side-by-side box plots, as shown in the next image.
The plots clearly show that males are generally higher than females. The median (central line in the box) is 71 inches for males and 65 inches for females, with similar widths of the boxes (i.e. IQRs). Both samples include unusually low or high heights, marked by small circles as potential outliers (with just one outlier for males and quite a few for females).
Next steps
You have now discovered common graphical and numerical summaries that will be useful to use in this course. In the next step, you will learn key features that characterise the shape of the data distributions.
Reach your personal and professional goals
Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.
Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.
Register to receive updates
-
Create an account to receive our newsletter, course recommendations and promotions.
Register for free