Skip main navigation

Best Practices for Data labelling

Data labels can be added in two ways: through automated processes or by people known as labellers. Read this article on best practices.

Data labels can be added in two ways: through automated processes or by people known as labellers. “Labellers” is a generic term that covers various contexts, skill sets, and levels of specialisation.

Ensure labeller pool diversity

To minimise potential biases during the data collection and labelling processes, you can consider various groups of labellers. Take an example of AI-based image labelling. Labellers could be your users providing “derived” labels within your product, for example, through actions like tagging photos, generalists (adding titles to a wide variety of data through crowd-sourcing tools), or trained subject matter experts, using specialised tools to label things like medical images)

Provide clear instructions

AI can be trained quickly if data labelling is agreed upon early in development. For the data expected to have labelling disagreements or that are difficult to coordinate, problems can be pre-emptively solved by agreeing upon the data labelling from an early point during the development process.

Look at the example of labelling shoe images. Which instruction is clear and specific to understand? On the right, the labeller’s subjective definition of “athletic” might or might not include dance shoes. But on the left, the phrase “running shoes” mitigate possible confusion and helps quickly rule out the selection of other athletic shoe types like soccer cleats.

* Source:

Value label disagreements

Understanding differences in how labellers interpret and apply labels to prevent problems later on. These disagreements in labels offer an opportunity to identify deeper data and/or labelling issues that you may need to address to ensure data quality.

Design new tools or/and workflows for labelling

Data labelling tools and workflows can be newly designed as needed. Simplifying raters’ workflow is essential for efficiently training AI models. Another best practice for data labelling is to audit labels, verify the accuracy, and adjust the labels as necessary. When building tools for professional labellers, the article First: Raters offers some useful recommendations.

Use automatic data labelling on large datasets

Automatic data labelling helps reduce the cost and time it takes to label your data set compared to using only humans. For instance, Amazon SageMaker Ground Truth is a machine learning-powered, human-in-the-loop data labelling and annotation service. This helps you efficiently perform highly accurate data labelling using a combination of automated data labelling and human-performed labelling. 

This article is from the free online

Designing Human-Centred AI Products and Services

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now