Skip main navigation

Machine Learning Data Sets

Data is the fuel that powers machine learning; through the patterns and features in the data used to train the algorithm, it is also the rule book that the generated algorithms will follow.
Some random data
© University of York

“Data is the new oil. / Like oil, data is valuable, but if unrefined it cannot really be used.” – Clive Humby (Mathematician) / Michael Palmer (Advertiser)

Data and Machine Learning

Data is the fuel that powers machine learning; through the patterns and features in the data used to train the algorithm, it is also the rule book that the generated algorithms will follow.

Unlike traditional software engineering where a programmer seeks to build an algorithm that acts on an input to create the desired output, Machine Learning (ML) takes together the inputs and outputs and seeks to train an algorithm that best matches them.

Furthermore, Machine Learning components typically do not adapt to new scenarios very well. If you put them in a scenario they have not encountered before, they do not take the principles from the old scenarios and apply them innovatively to the new ones. Indeed, often they simply just won’t work.

Therefore, it is crucial that the data sets used to train and test the algorithms are carefully designed, curated, and managed both for the efficient functioning of the ML algorithm but also to provide safety assurance. How do we develop requirements for the data to ensure that this happens?

Data Requirements

Data requirements need to specify the characteristics the data sets must have, so that we can ensure that the data captures all relevant safety features and behaviours. We can build these requirements from four principles:

  • Relevance – the datasets match the scenarios the ML will work in
  • Completeness – the datasets cover all the scenarios the ML will work in
  • Accuracy – the datasets accurately identify the features the ML is to classify
  • Balance – the datasets are not biased

The following examples show how these principles inform requirements for data to be used to train and test Autonomous Vehicles (AV).

Relevance Example

If we want a ML component to be used for object detection on an AV, then a relevance requirement could be, “data samples should be captured from the same position as the sensor on the vehicle”. Typically, the forward-looking vision system of an AV is behind the rear-view mirror. Therefore, we should avoid using images taken from very low or high angles, such as overhead from an aerial drone, as the ML component will typically see people in an upright from chest height. It is worth noting that this requirement would change depending on the vehicle: think about the different views of people on the road from the windscreen of an articulated truck and that of a sports car.

Completeness Example

If we want an AV to operate anytime, day or night, then the machine learning component must be able to operate effectively in both bright sunshine and low light conditions. Then our data requirement for completeness could be “Data samples should be captured at all times of day and under the following light conditions: bright sunlight, overcast, heavy cloud, twilight, direct sunlight, on-coming headlights, urban street lighting and unlit rural roads.”

When specifying these requirements it is particularly important to consider the complexity and high dimensional variability of the natural environment in which the Machine Learning component may operate. In everyday speech, we describe these conditions from the perspective of our human experience, which is difficult to ‘translate’ into the numerical concepts which capture sensor operation e.g. illuminance, albedo, cross-section, contrast, or signal-to-noise ratio.

Accuracy Example

If we want an AV to pinpoint the location of all detected pedestrians to within 50 cm of their true position, we must consider that humans are not individual points in space or isolated limbs. ML components should accurately identify the whole pedestrian’s position, even if they are partially obscured, e.g. standing between parked cars. We need to locate elements of a person consistently. Then our data requirement for accuracy could be “When labelling data samples, the position of all pedestrians shall be recorded as the extremity of their person closest to the road”.

Labelling of data is a way of humans telling the machine learning what the correct result or output is in a set of data, such as by adding some key words and/or drawing a box around the result. The quality of the labelling has a dramatic effect on the reliability of the algorithm. However, the work is labour intensive and rather dull, so is often prone to error and bias.

Balance Example

Humans have significant variation in appearance owing to phenotypical variation, sexual dimorphism, health reasons or socio-cultural factors. However, we need an AV to detect all humans, so we require a data requirement for balance: “the dataset shall be balanced for variation in human appearance.”

It is worth considering that ML classifiers are usually designed to group objects into a number of classes. In the case of an AV, relevant classes include cars, buses, trucks, road signs, traffic lights and pedestrians. A data set that is “balanced with respect to class” would have equal numbers of each class. However, there is still a need to avoid bias with respect to features of interest in the class: for example, the data for the ‘pedestrian’ class needs to to be balanced with respect to gender, race etc..

© University of York
This article is from the free online

Intelligent Systems: An Introduction to Deep Learning and Autonomous Systems

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now