Skip main navigation

Types of data

What is the difference between a statistical variable of interest and observations of this variable recorded in the dataset?

In this step, you discover an important difference between a statistical variable of interest and observations of this variable recorded in the dataset. Then, you consider a useful classification of different types of data. 

1. Variables and observations

Statistical methods provide ways to visualise, measure and understand variability in the data.

Characteristics observed in the study are subject to variation across the sample, as well as in the general population.

For instance, there is variability in individuals’ weight and height, work and income, number of marriages and children, religious affiliation and political preferences, and so on.

Variables

  • The characteristics featured in the study are called variables.

Observations

  • The values observed for a variable are referred to as observations.

For instance, the sex attribution of an individual is a variable that can be ‘male’ (M) or ‘female’ (F), and it varies from person to person.

Another example is the level of rain (e.g in millimetres [mm]) measured daily in a weather station. Possible values of this variable are positive numbers which will vary from day to day.

2. Types of variables

2.1. Quantitative vs categorical

Variables can be quantitative (numerical) or categorical (in categories).

Quantitative

  • A variable is called quantitative if its observations take numerical values corresponding to possible different magnitudes.

Categorical

  • A variable is called categorical if each observation represents one of a set of categories (labels).

In the above examples, ‘sex’ is categorical whereas ‘level of rain’ is quantitative.

As a simple way to distinguish between the two types, think if you could do arithmetic operations with observations, e.g take their average.

For instance, telephone area codes are categorical, not quantitative, even though the values are expressed as numbers; for example, ‘average area code’ doesn’t make sense.

2.2. Discrete vs continuous

Quantitative variables may be discrete or continuous.

Discrete

  • A quantitative variable is discrete if it takes values in a set of separate numbers, e.g 0, 1, 2, …

Continuous

  • A quantitative variable is continuous if its possible values form an interval, e.g [0,∞)[0,∞)or [0,1][0,1].

Examples of discrete variables are: the number of children in a family, the number of customers in a queue at the cash desk, or the number of earthquakes in a certain region over one year.

Examples of continuous variables are: a person’s weight, height, age or time to find a job after graduation.

Sometimes, the data is reported in discrete values, e.g age at death in full years, but this may in fact represent a continuous variable and appears discrete only due to rounding.

2.3. Nominal vs ordinal

Categorical variables may be nominal or ordinal.

Nominal

  • A categorical variable is nominal if the categories are not ordered in any way.

Ordinal

  • A categorical variable is ordinal if the categories can be ordered, e.g in the order of preference.

Examples of nominal variables are sex (‘male’ or ‘female’), ethnic group (‘White’, ‘Black’, ‘Asian’, ‘Mixed’) or blood type (‘A’, ‘B’, ‘AB’ or ‘O’).

Examples of ordinal variables are socio-economic status (‘low income’, ‘middle income’, ‘high income’), education level (‘high school’, ‘BSc’, ‘MSc’, ‘PhD’) or satisfaction rating (‘very negative’, ‘negative’, ‘neutral’, ‘positive’, ‘very positive’).

Ordinal variables are often treated quantitatively by assigning numerical scores to the categories.

For instance, levels of satisfaction can be assigned increasing scores as follows:

Table 1: Numerical scores for satisfaction levels

Very Negative Negative Neutral Positive Very Positive
1 2 3 4 5

Then, the average rating score in a sample of customers would make sense (e.g 4.34.3).

3. Frequency tables

Categorical data with a relatively small number of categories can be conveniently presented in a frequency table.

Example 1 (shark attacks) 

Source: Agresti, A., Franklin, C., Klingenberg, B. 2023. Statistics: The Art and Science of Learning from Data, Pearson. p. 65.

A total of 387 shark attacks were reported in the USA between 2004 and 2013. The next table shows the breakdown by state, together with percentages.

Table 2: Shark attacks in the USA in 2004-2013

US State Frequency Percentage
California 33 8.5%
Florida 203 52.5%
Hawaii 51 13.2%
North Carolina 23 5.9%
South Carolina 34 8.8%
Texas 16 4.1%
Other 27 7.0%
Total 387 100%

4. Grouped data

For some datasets, the number of distinct values is too large for frequencies to be useful. In addition, each distinct value might likely be taken only once. This is especially true for continuous variables, such as height, temperature, time, and so on.

In this situation, the data may be succinctly summarised by aggregating the observed values into suitable intervals, or ‘bins’.

The number of bins should be a trade-off between choosing too few classes at the cost of losing too much information about the actual data values, and choosing too many bins which will result in the frequencies of each bin being too small for a pattern to be discernible. It is common, although not essential, to choose bins of equal length.

Reporting only frequencies within the chosen bins leads to grouped data.

Grouped data may also occur due to limited precision of the measurement instrument, or for convenience of reporting.

For instance, data in demographic life tables comprises frequencies of human life span in integer years – either the age at death in full years lived or rounded to the nearest birthday.

Example 2 (incandescent lamps)

Source: Ross, S. 2021. Introduction to Probability and Statistics for Engineers and Scientists, 6th ed., Elsevier/Academic Press. p. 17, Table 2.4.

The observed lifetimes (in hours) of 200 incandescent lamps, grouped into intervals of length 100, are shown below:

Table 3: Lifespans (in hours) of incandescent lamps

Bin (hours) Frequency Percentage
500–600 2 1.0%
600–700 5 1.5%
700–800 12 6.0%
800–900 25 12.5%
900–1000 58 29.0%
1000–1100 41 20.5%
1100–1200 43 21.5%
1200–1300 7 3.5%
1300–1400 6 3.0%
1400–1500 1 0.5%

This table conveniently represents the data spread, clearly showing a concentration of more typical values near the bin 900-1000 and also less frequent values towards the tails.

Next steps

Now you have learned about different types of data, go to the next steps to review basic tools to summarise data, both graphically and numerically. These steps focus on statistical methods; you will review the corresponding R commands to produce data summaries throughout the rest of the course. 

This article is from the free online

Statistical Methods

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now