Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

Find out more

Synthetic data and generative AI

A short article describing the use of synthetic data and generative AI in expanding datasets.
A screenshot of software that has produced a 3D synthetic image of several wheat plants growing in soil
© The University of Nottingham

In the video we thought about how we can use tools like 3D modelling, image generation or style transfer – including generative AI approaches – to help expand datasets.

Why do we need synthetic data? As we have seen, machine learning systems need training data – images, in our case – of whatever it is we want to predict. Perhaps this is a prediction of a ripeness label from a picture of fruit on a tree. So, we need to collect images of the fruit at various stages of ripeness, and we need to assign a ground-truth ripeness value in order to build a supervised machine learning system.

The images in the training set need to capture the variability we might expect to see when using the system for real. That means we need lots of views of lots of trees, as well as many examples of different ripeness values, and lots of examples of occluded (partially covered) objects as well. As well as collecting all the images, and not forgetting we need them collecting under different lighting and weather conditions, we also need to manually label all the training images. This means we have a lot of work to do at the data capture step!

Synthetic data can help here. We can build new, synthetic training images, to represent images we don’t have enough of in the real training dataset. Suppose for example, that we collected all our images in the sunshine, and we want to simulate some cloudy or rainy images. Of course, collecting real images is the best possible standard of data here, but it may be extremely costly or impossible to do that. We could build a system to apply a style transfer – ie. simulate different lighting under various weather conditions, to see if we can artificially broaden our dataset. And the best part is, we already have the annotations, so we can do this programmatically.

Likewise, perhaps we want to simulate different levels of occlusion, so we can simulate recognising or counting fruit when partially covered by leaves. We can simulate covering real (or even modelled) images with artificial leaves to achieve this, e.g https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8990779/.

How do we know the synthetic data helps? If you are concerned that we are making up the data we are training on, this is a valid point! What if we are providing synthetic data which isn’t accurate, or in fact causes a model to perform worse! The gold standard is always to test a trained model on real-world test images, capturing as much of the variability of real images we would expect to see on a live system. Therefore, we can test the effect of a model trained on synthetic data on real world data, and from this understand if the synthetic data has helped or not.

Often, a combination of real, augmented and synthetic data will be best, but the particular combination will depend on the domain, and quality of the synthetic data, for example. With vast advances in generative AI, this is an exciting area to be aware of.

This article is from the free online

Experimental Design for Machine Learning

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now