Skip main navigation

The role of data in machine learning

In this article, learn about the role of data within machine learning, and how data sets are used for training and testing.
an illustration of a cylinder split into segments, a line leads from the cylinder to a hexagon in the middle fo which are the letters
© Creative Computing Institute

Let’s look more in-depth at the role data plays within machine learning.

Machine learning data sets

In Machine Learning, data sets are used for training and testing. A training set is a subset of data used to train a model. A test set is a subset of data used to test the trained model (1). Typically the training set will be much larger than the testing set.

A wide range of datasets and machine learning tools

There is a wide range of existing datasets and machine learning tools which are widely available such as from Google, Microsoft and IBM (2, 3, 4).

Typically, Machine Learning training sets require a large amount of data. The amount is difficult to quantify, as it depends on the complexity of the system being built but typically consider that it would require tens of thousands to millions of data points.

It can be difficult to find an existing dataset with large enough representative data for underrepresented/marginalised demographic groups.


Google has taken steps to create Inclusive Machine Learning which will reduce the likelihood of negative stereotypes, automatic denial of access to services and product failure for underrepresented groups.

Their AutoML platform enables smaller training sets to be used in Machine learning increasing the inclusivity of machine learning.

Machine learning influences design

Machine learning influences design as it can impact the usability and accessibility of products. As an example, during focus groups, it may become clear that different dialects and colloquialisms need to be considered when training a voice recognition application, or chatbot.

Without using the correct training sets and particularly test sets, the system won’t be fit for purpose for these people.

Inputs and results during testing

In machine learning, the test set will typically be a subset of the training set and may represent a particular demographic. For wider types of products and services, the test data is used to represent expected usage.

These could be values used to complete a form on a website, the images of a face for an image processing algorithm, or messages entered into a chatbot.


Analysis of the results found throughout testing can lead to a significant amount of change required when creating new products, or updating existing products.

Utilise the results to inform the design, this may mean adding accessibility options, or changing terminology so that it does not exclude cultures/demographics.

It is important to perform testing iteratively throughout development so that results can influence necessary updates to design.


  1. Training and Test Sets: Splitting Data
  2. Datasets – Google Research
  3. Microsoft Azure Open Datasets
  4. Artificial intelligence Datasets – IBM Developer
  5. Inclusive ML – Google
© Creative Computing Institute
This article is from the free online

Anti-Racist Approaches in Technology

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education