# Basic visualisation

Find out how visualisation helps us to spot patterns and practise plotting in order to help you do this.
© Coventry University. CC BY-NC 4.0

Before we apply any of the powerful data analysis machinery that is at our disposal, we should first have a good look at our data. We have already seen parts of the data in tabular form, but of course, we cannot scroll through thousands or more rows of data (approximately 3,000 in this case).

In order to get a better understanding of the data. We need a way to visualise our data in order to get a feeling how it is distributed; this will, intuitively, indicate to us how the data is spread across its possible values and which values are more likely to occur. As a first step, we simply look at each column individually.

Preparing a histogram

For numerical data, the plot of choice is a histogram, which displays how many data points fall within certain ranges called bins. If we want to see how many paintings are in the collection and from which time periods, those bins could be single years, decades, or non-uniform ranges corresponding to certain art periods. We will stick to uniform ranges and plot how our data looks like with one-year bins, 50-year bins and 10-year bins:

We can see that the choice of granularity matters: the first plot tells us very little about the overall shape of the distribution while the second plot hides too much of the detail, such as the dip in the number of artworks around 1850. The third histogram sums up the temporal distribution of the data rather well: the Tate collection contains oil paintings painted primarily in the range 1750-1975 and the distribution has two peaks around 1825 and 1925; in technical terms it is bimodal. If we compare this distribution to historical art periods, we might attribute the first peak to Romanticism (1780–1850) and the second peak to the various movements in Modern art (1860-1970). However, without inspecting the actual images or artists, we cannot be sure that the distribution is not simply an artefact of the Tate collection’s history.

Let us turn our attention to the columns ‘width’ and ‘height’. Similar to the distribution of artworks over time, we can find out how the distribution of size looks like by plotting two histograms:

However, these plots hide one important fact: every painting, of course, has both a height and a width and those two measures are probably related to each other. After all, a painting that is several meters wide is unlikely to be only a few centimetres high. Here we see the limitation of looking at columns individually: we cannot see whether there is a pattern that connects values in one column to values in another column.

The scatter plot

More concretely, the histograms above show us that the width of paintings has a much larger range than the height, but the distribution of both is positively skewed, it tends to be centred around 500-1,000mm with a declining tail from 1,000-4,000mm. Our intuition tells us that therefore most paintings should have a width and height in the range of 500-1,000mm, but we have to be careful: these plots only show the distribution of the individual columns, we do not know yet how these values pair up, ie, what is their joint distribution?

This brings us to the second type of plot: what if we would like to see the joint distribution of width and height, ie, which height measurements are related to with which width measurements? The plot of choice for such a situation is a (two-dimensional) scatter plot in which we draw one point per row with x-coordinate equal to the first dimension (width) and y-coordinate equal to the second dimension (height). Here we have drawn each point with some transparency by adjusting the alpha channel of the data points so we can get a feeling for where many of the data points overlap, which appear as denser colours in the chart due to the overlap of points:

The left scatter plot shows our data in a coordinate system that you will probably find familiar. We see that a majority of the data is concentrated or (more technically) clustered in the bottom left corner, which makes it hard to discern any potential patterns. You might, however, be able to make out one prominent pattern: the data points cluster around two straight lines that diverge from each other.

We can make this pattern more visible by plotting our data in a log-log coordinate system as shown on the right, ie, both the x and y axes undergo a logarithmic transformation. This means that the axes values do not increase by an additive increment per unit (eg 2,000 mm per tick as on the x-axis on the left), but instead, it increases by a logarithmic factor. The result is that our data appears more spread out and we can now clearly see a pattern in the form of two parallel lines in the log-log-plot on the right.

This is a promising lead in our effort to classify paintings: we just found that according to their dimensions, there seem to be two large groups of paintings and we can hope that these groups roughly correspond to landscapes and portraits.

Interact with various Matplotlib plots. (30 mins)
In this task you will be introduced to the Matplotlib Python library for producing plots and visualisations of the data. You will be introduced to the basics of plotting, explore common plots including histograms, scatterplots, bar charts, and custom visualisations.
At the end of the task, you will be familiar with the basics of visualisation and plotting to aid in identifying patterns in our data.
© Coventry University. CC BY-NC 4.0