New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

# Convolutional layers

A short article describing convolutional layers in convolutional neural networks

As the name suggests, convolutional layers are the building blocks at the heart of convolutional neural networks. But what are they, and how do they work?

## Convolution

If you completed another of our preceding courses Introduction to Image Analysis for Plant Phenotyping, you will be familiar with the idea of image convolution via the use of filters.

In summary, an image filter looks at every pixel in an image in turn, along with a fixed set of its neighbours and performs a simple calculation to produce a new number representing that pixel in a new image. Most commonly, the set of pixels used for each pixel is the pixel itself, along with all its immediate neighbours in 3 x 3 grid.

The dimensions of this set of pixels is known as the kernel size. Though 3 x 3 is probably the most common other kernel sizes may be used, though generally the dimensions will have odd numbers so the pixel of interest will always be in the centre of the kernel (e.g. 5 x 5, 7 x 7, etc).

Once the kernel size is set, the result of the convolutional filter is found by multiplying each value in the kernel of a given pixel by some predetermined weight and adding them up.

In the video the Sobel filter was given as an example of a convolutional filter, which is used to detect edges in image data. There are two Sobel filters, one to detect edges in the X direction:

And one to detect edges in the Y direction:

So if for example the pixel values in the kernel for a particular pixel were as in the central panel below:

then applying the Sobel X filter would give a value of:

[(1 times 20)+(0 times 8)+(-1 times 4)+] [(2 times 15)+(0 times 6)+(-2 times 2)+] [(1 times 10)+(0 times 3)+(-1 times 1) = 51]

for the centre pixel in the new convolved image.

Similarly, applying the Sobel Y filter would give a value of:

[(1 times 20)+(2 times 8)+(1 times 4)+] [(0 times 15)+(0 times 6)+(0 times 2)+] [(-1 times 10)+(-2 times 3)+(-1 times 1) = 23]

in the image representing the Sobel filter in the Y direction.

So in this case the filters have detected a stronger edge in the X direction than in the Y direction. In practice the results of the image filters will be normalised to make sure the results stay in some range (e.g. between 0 and 255 to display as a grayscale image).

In the Sobel example above we took a single number for each pixel, as in a grayscale image, and used nine weights for the 3 x 3 kernel to produce a single output number for each pixel. However, you can also apply convolutional filters to three channel RGB data, you will just need to define more weights. So for a 3 x 3 kernel with 3 channel input data you will need to define 3 x 3 x 3 = 27 weights.

## Convolutional layers

Convolutional layers then, are a powerful way to combine the two ideas we have discussed already – artificial neural network layers and convolutional filters.

In the image filters we have talked about up until now the weights in the filters are predetermined to pick out specific features in an image. So the Sobel operator aims to pick out the edges in an image, and the result can then be used to train a multilayer perceptron or other machine learning model. But what if the weights of the image filters could themselves be trained to get the best results? This is the main idea behind convolutional layers.

In convolutional layers, we can take image data as an input (this can be one-channel grayscale or three channel RGB) and output several different versions of that data each processed with a different convolutional filter. However, the weights in those convolutional filters are trainable parameters in the network, rather than being predetermined like those in a Sobel filter.

So in the simplest case, we might have a convolutional layer that takes one-channel grayscale data and outputs a single output channel using a 3 x 3 kernel. And so in this case there would be 1 (input channel) x 3 x 3 (kernel) x 1 (output channel) = 9 trainable weights in the layer.

More commonly though, to maximise our chances of learning something useful, there are many more output convolution channels, and we could also be using 3-channel RGB colour image data.

Let’s say we have a convolutional layer with 8 output channels working on RGB data, as shown in the image above, using a 3 x 3 kernel. Then for this layer we would have 3 x 3 x 3 x 8 = 216 trainable weights in the layer (plus a constant ‘bias’ weight for each output channel).

Rather than 3-channel image data, the output of this layer is effectively 8-channel image data, nearly three times as large. What’s more, a key feature of deep learning is lots of layers (hence the deep part). If, for example, we add another convolutional layer, this one taking the 8-channels produced by the first layer, but outputting 64 channels using a 3 x 3 kernel, we will now have 8 x 3 x 3 x 64 = 4608 trainable weights in this new layer.

As you can see, the number of trainable parameters and the sheer size of datasets in CNNs can accumulate very rapidly, which is why they are very computationally expensive to train.

We’ll show you how to make convolutional layers in PyTorch later in this week’s course.

Images (c) The University of Nottingham