Skip main navigation

Basic visualisation

Find out how visualisation helps us to spot patterns and practise plotting in order to help you do this.
© Coventry University. CC BY-NC 4.0

Before we apply any of the powerful data analysis machinery that is at our disposal, we should first have a good look at our data. We have already seen parts of the data in tabular form, but of course, we cannot scroll through thousands or more rows of data (approximately 3,000 in this case).

In order to get a better understanding of the data. We need a way to visualise our data in order to get a feeling how it is distributed; this will, intuitively, indicate to us how the data is spread across its possible values and which values are more likely to occur. As a first step, we simply look at each column individually.

Preparing a histogram

For numerical data, the plot of choice is a histogram, which displays how many data points fall within certain ranges called bins. If we want to see how many paintings are in the collection and from which time periods, those bins could be single years, decades, or non-uniform ranges corresponding to certain art periods. We will stick to uniform ranges and plot how our data looks like with one-year bins, 50-year bins and 10-year bins:

"Three histogram plots of the number of paintings produced each year, with each plot produced from different bin sizes of one-year, fifty-year, and ten-year bins, showing the differences in granularity."

We can see that the choice of granularity matters: the first plot tells us very little about the overall shape of the distribution while the second plot hides too much of the detail, such as the dip in the number of artworks around 1850. The third histogram sums up the temporal distribution of the data rather well: the Tate collection contains oil paintings painted primarily in the range 1750-1975 and the distribution has two peaks around 1825 and 1925; in technical terms it is bimodal. If we compare this distribution to historical art periods, we might attribute the first peak to Romanticism (1780–1850) and the second peak to the various movements in Modern art (1860-1970). However, without inspecting the actual images or artists, we cannot be sure that the distribution is not simply an artefact of the Tate collection’s history.

Let us turn our attention to the columns ‘width’ and ‘height’. Similar to the distribution of artworks over time, we can find out how the distribution of size looks like by plotting two histograms:

"Two histogram plots of the number of paintings plotted against the width and height of the painting respectively."

However, these plots hide one important fact: every painting, of course, has both a height and a width and those two measures are probably related to each other. After all, a painting that is several meters wide is unlikely to be only a few centimetres high. Here we see the limitation of looking at columns individually: we cannot see whether there is a pattern that connects values in one column to values in another column.

The scatter plot

More concretely, the histograms above show us that the width of paintings has a much larger range than the height, but the distribution of both is positively skewed, it tends to be centred around 500-1,000mm with a declining tail from 1,000-4,000mm. Our intuition tells us that therefore most paintings should have a width and height in the range of 500-1,000mm, but we have to be careful: these plots only show the distribution of the individual columns, we do not know yet how these values pair up, ie, what is their joint distribution?

This brings us to the second type of plot: what if we would like to see the joint distribution of width and height, ie, which height measurements are related to with which width measurements? The plot of choice for such a situation is a (two-dimensional) scatter plot in which we draw one point per row with x-coordinate equal to the first dimension (width) and y-coordinate equal to the second dimension (height). Here we have drawn each point with some transparency by adjusting the alpha channel of the data points so we can get a feeling for where many of the data points overlap, which appear as denser colours in the chart due to the overlap of points:

"Two scatter plots comparing the height with the width of the painting. The left plot shows the points moving diagonally upwards with the number of points decreasing as both height and width are increased in magnitude. The right plot is a logarithmic plot (or log-log plot), which reveals two diverging diagonal lines."

The left scatter plot shows our data in a coordinate system that you will probably find familiar. We see that a majority of the data is concentrated or (more technically) clustered in the bottom left corner, which makes it hard to discern any potential patterns. You might, however, be able to make out one prominent pattern: the data points cluster around two straight lines that diverge from each other.

We can make this pattern more visible by plotting our data in a log-log coordinate system as shown on the right, ie, both the x and y axes undergo a logarithmic transformation. This means that the axes values do not increase by an additive increment per unit (eg 2,000 mm per tick as on the x-axis on the left), but instead, it increases by a logarithmic factor. The result is that our data appears more spread out and we can now clearly see a pattern in the form of two parallel lines in the log-log-plot on the right.

This is a promising lead in our effort to classify paintings: we just found that according to their dimensions, there seem to be two large groups of paintings and we can hope that these groups roughly correspond to landscapes and portraits.

Your task

Interact with various Matplotlib plots. (30 mins)
In this task you will be introduced to the Matplotlib Python library for producing plots and visualisations of the data. You will be introduced to the basics of plotting, explore common plots including histograms, scatterplots, bar charts, and custom visualisations.
At the end of the task, you will be familiar with the basics of visualisation and plotting to aid in identifying patterns in our data.
© Coventry University. CC BY-NC 4.0
This article is from the free online

Applied Data Science

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education