New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

# Terminology we will be using in this course

Find out the meaning behind the terms you will encounter in the course and get access to a handy glossary of terms.

Before we dive into the substance of the course, it is important that we agree on certain basic terminology.

## Data taxonomy

We broadly classify data according to its complexity, that is, how much information a single point of data contains. As with most classifications, this is not a mathematically precise scale, we will always find grey areas and exceptions. It is nonetheless very useful to distinguish between three levels of complexity:

• Simple data consists of single numbers or values, like the height of a person or their hair colour; simple data is also known as primitive data.
• Composite data is an aggregation of a small amount of simple data. A good example is the measurements of a furniture item (width x height x depth) or GPS coordinates (longitude/latitude), both are composites of simple data types. Another common example of composite data are records of bank accounts: an account record is composed of an account number, the customer’s name and the current balance.
• Complex data is everything that does not reasonably fit into the previous description. Good examples are image, audio, or video files.

Note that this classification is also relative to the data set we are looking at. A good example of this is textual data which could be either of the above depending on the context. So, for example, the word ‘red’ can be treated as simple data in a dataset when it is among the other values:

red, orange, yellow, green, blue, indigo, violet

But we should probably treat it as ‘composite’ (here: ‘consisting of individual characters’) if it appears among:

…, record, recording, recover, recovery, red, reduce, reduction, refer, reference, reflect, …

Given a set of data, we compute statistics to learn what the data contains. A statistic is some form of numerical summary that highlights one specific aspect of the data. For example, we might compute the mean of a numerical dataset to find out what a ‘typical’ value in the set is.

## Data models

Going further, we might want to build a model of the data, meaning a mathematical description that captures certain aspects of the data.

A regression model tells us how the values of certain parts of our data influence other parts of it. For example, we could build a regression model to relate the age of children with their height, meaning that for any given age it gives us a good estimate of how tall a child of that age would be.

A classification model, on the other hand, tells us which combinations of values fit into which category. For example, we may wish to classify a person’s gender based on their weight and height.

We may employ a simple linear model to capture the relationship between age and height. This model can then also predict values: given an input age, it can provide us with a predicted height. Of course, this model is a stark simplification of the data and we always need to keep in mind that every model will contain errors.

Thus data scientists attempt to build the most accurate model they can, given the data they have available so that the predictions generalise when the model is confronted with new, previously unseen, data.

## Glossary of terms

We have created a glossary of terms that you can refer to during this course.

© Coventry University. CC BY-NC 4.0