Skip main navigation

The role of data in AI

We now know that "data" is one of the four key ingredients for modern AI systems. In this section, we will look specifically at the role data plays.

Humans typically learn from structured information – such as clearly labeled charts, diagrams, or ‘classroom’ examples – but AI systems operate differently. These systems can process vast amounts of unstructured data (raw text, images, audio, and video) without being explicitly told what to look for. They learn by identifying patterns, connections, and recurring structures in the data.

This pattern recognition is at the heart of how AI “learns”. For example, a music streaming platform might analyse millions of listening habits such as what time people press play, which songs they skip, and which genres they return to. While no human could ever sift through all those data, an AI can use these patterns to recommend a new song you’re likely to enjoy.

What Data?

AI systems are trained on data gathered from an enormous variety of sources: books, news websites, Wikipedia, YouTube transcripts, Reddit threads, blog posts, social media platforms, audio recordings, scientific papers, online forums, medical databases, and many more. This includes everything from Shakespeare’s plays to TikTok captions.

TASK: Think about what is on the internet, and how these data may be asymmetrically distributed. 

Use the prompt, “Tell me about what is on the internet. What countries the content originates from, the languages it is written in, etc.” to gain a response on this topic from an LLM. 

What do you think about the geopolitical and economic distribution of data online? Remember, these data are what train AI systems (more on this topic in AI & Governance, Week 2).

Crucially, these data do not need to be manually labeled by humans. For example, LLMs like ChatGPT are trained on raw, unlabelled text by learning to predict what word is most likely to come next in a sentence. Over billions of examples, they develop a highly detailed model of language use, tone, and meaning, without ever “understanding” in the human sense.

For image-based AI systems, training often involves massive datasets of pictures sourced from the internet, paired with file names, captions, or tags. These weak labels are enough to let the system learn what visual patterns typically co-occur with terms like “dog”, “sunset”, or “street protest”.

An excellent overview of this topic, The Role of Data in AI: Report for the Data Governance Working Group of the Global Partnership of AI, was published in 2020 by the Digital Curation Centre, University of Edinburgh, and provides a thorough and accessible discussion. 

Historical, Backward-Looking Data

These models learn entirely from historical data, which means they reflect past beliefs, norms, and behaviours – both the good and the bad. This backward-looking nature introduces some notable ethical concerns.

Hence, these models often exhibit biases that can lead to unfair, unethical or discriminatory outcomes. The lifecycle of these biases may be broadly separated into the cause (training data) that we have now considered and effect (harms/outcomes) as illustrated in the following image:

Image showing life cycle of bias in AI systems

The above image is from AI Safety, Ethics and Society, an online and freely accessible textbook that is an excellent resource, written by Dan Hendrycks and published in 2025. Here you can find videos as well as podcasts to support further learning on this topic.

Causes of Bias in AI

The biases we observe in AI often originate from three broad sources, each shaping the data AI learns from:

1. Psychological Bias: Humans create and label data, and humans carry cognitive biases. For example, if people label images of “leaders” and mostly choose white middle-class men, the AI will learn the same association. Ambiguities in language and subjective labelling decisions further entrench these biases in AI systems.

2. Historical Bias: Historical patterns of inequality are deeply embedded in the data. If past job advertisements featured more male applicants for technical roles, or if historical texts reflect sexist, ableist, or racist language, AI will absorb these patterns even if they are no longer generally acceptable today. This includes:

  • Over- and Under-Representation: Some groups are mentioned more than others in data. For instance, men are overrepresented in discussions of leadership, while women are more often linked to caregiving roles.
  • Spatial and Temporal Bias: AI trained mostly on Western internet content may misrepresent or ignore non-Western cultures. Older data may reflect outdated or harmful social attitudes (more on this in AI & Governance, Week 2).

3. Social Bias: Social structures and media coverage shape the data AI sees. News articles tend to highlight rare or dramatic events (like terrorism or shark attacks over the effects of chronic illness or economic disparities), and social media may amplify louder or more extreme voices. These patterns skew the training data and AI’s reference landscape (see more on this in AI & Existential, Week 2).

Effects of Bias in AI

Once biases are embedded in training data, they often manifest in real-world outputs. These effects can be subtle or obvious, and their consequences can be far-reaching:

1. Stereotypical Associations: AI might connect certain jobs or traits to specific genders or ethnicities, such as suggesting that engineers are typically men or that caregivers are women.

2. Discriminatory Outcomes: In hiring or admissions scenarios, an AI may favour applicants who resemble those in the majority group, even if protected characteristics (like gender or race) are removed from the data. This happens because AI may detect proxies such as school names, hobbies, or writing style that intersect with identity.

3. Cultural Misrepresentation: AI may generate outputs that misrepresent non-Western cultures or reinforce global power imbalances. For example, a chatbot might describe American sports or holidays as “universal”, or offer inaccurate translations of idioms from underrepresented languages.

TASK: get AI’s perspective (continuing on from the previous task prompt): “How do these asymmetries inherent in data on the internet reflect themselves in the outputs from LLMs”

Why It Matters

Even without any intention to harm, an AI system can amplify harmful assumptions and inequities. Because these models reflect historical patterns and learn from human-created data, they risk reproducing the worst parts of our collective past. And since these systems often operate at scale in schools, hospitals, courtrooms, and job applications, the impact of these biases can be significant and systemic. 

Before we move onto exploring some of these biases, you are encouraged to access and explore the resources spotlighted in this section. Another useful perspective on LLM bias, “Bias in Large Language Models: Origin, Evaluation, and Mitigation” was published by arXiv in 2024 and can be freely accessed.

Exploring Bias

At the level of ‘chatbot’ interactions, LLMs have come a long way in correcting and catching bias…

Response to AI prompt

However, if we go back to some of the more basic LLMs of the last few years, we can catch them in performing some peculiar gymnastics to accommodate for the biases in their training data.

Moving Forward

Efforts to identify, evaluate and combat the harms that arise from these biased predictions are a significant consideration for individuals, corporations, our public institutions and governments. Though it may seem that the LLMs we use today have implemented safe-guards against bias, it is important not to become complacent. The bias is there in the data, and predicting when, where and how it will manifest remains a challenging moving target. We will cover this topic further in the AI & Governance (Week 2) and AI & Healthcare (Week 3) sections of the course.

This article is from the free online

AI Ethics, Inclusion & Society

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now