Skip main navigation

Your model and the truth

What are the limits of your data model? Find out in this article why it is important to document and explain your data analysis.

We have built three different models for our Tate dataset to figure out how we can, in general, determine whether a painting is a landscape, a portrait, or some other motif.

Of course, given the simplicity of the models and the fact that none of our models actually takes the painting itself into consideration, we would never be tempted to say that our model knows what a landscape and a portrait are.

When it comes to more complicated models and methods, however, it is easy to forget that in the end, any model only describes a certain dependence between the training data’s features and its labels. Since we would like to ascribe a certain generality to our model—extending beyond the scope of the training data on to some real-world ‘wild’ data—it is all the more difficult to appreciate its limits.

Though, you might ask, what’s the harm? There are two important points to keep in mind here: 1) Most people regard computers as neutral decision-makers and 2) algorithmic decision-making is becoming ubiquitous. Consequently, many important decisions made about us – whether a credit application is approved, whether we’re eligible for certain social support, whether our CV makes it into the next round of a job application, etc – may be delegated to automated inscrutable systems.

A discussion of where biases can be and are introduced in these potentially vast systems is by far outside the scope of this course, but we should have a closer look at bias in data. For example, the dataset we used this week has a geographical and cultural bias since it came from a British museum. It will additionally reflect certain preferences of the curators that worked on the collection. We chose to only look at oil paintings, which certainly carries with it a certain selection bias, and we only used a certain subset of the data to build our model.

All of these little choices can influence the makeup of our data. There is no simple way to obtain neutral data, which is why we have to be cognisant of our choices and make them transparent in our analysis. Data analysis should therefore not simply consist of mathematics and program code, it needs to be documented and explained. This is one of the reasons why notebooks have become a popular choice to communicate such work.

Your task

Which areas of your life do you feel are influenced by data-driven decisions? Do you think that these decisions are fair?
Please respond to this question in the comments.
Read through a few of your fellow learners’ responses, can you see any patterns or trends?

Further information

We will be covering ethics in data science in week two but if you would like to find out more about this topic (as well as data science in general) now you can check out the following resources:

GOV.UK. (2018, June 13). Data ethics framework. Web link

DataKind. (n.d.). Harnessing the power of data science in the service of humanity. Web link

Caroline Criado Perez. (n.d.). Books. Web link

Towards Data Science. (n.d.). A medium publication sharing concepts, ideas, and codes. Web link

© Coventry University. CC BY-NC 4.0
This article is from the free online

Applied Data Science

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now