New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

What are Convolutional Neural Networks?

A Convolutional Neural Network (ConvNet / CNN) is a Deep Learning algorithm which, when given an input image, can assign value or importance to various aspects or objects in the image and be able to differentiate one from the other. The pre-processing required in a CNN is much lower than in other classification algorithms. While in primitive methods filters are hand-engineered, with enough training, ConvNets or CNNs have the ability to learn these filters/characteristics.

A Convolutional Neural Network (ConvNet / CNN) is a Deep Learning algorithm which, when given an input image, can assign value or importance to various aspects or objects in the image and be able to differentiate one from the other. The pre-processing required in a CNN is much lower than in other classification algorithms. While in primitive methods filters are hand-engineered, with enough training, ConvNets or CNNs have the ability to learn these filters/characteristics.

Here we will look at Convolutional Neural Networks to see how they work in more detail. The building blocks of CNNs are convolution layers.

Tensors

The input to a convolution layer is a 3-dimensional array. In mathematics this is called an order-3 tensor. A tensor has two spatial dimensions (the height, H, and width, W) and a feature dimension comprising one or more (C) feature channels for an overall size of H x W x C. For a colour input image, the tensor would have size H x W x 3, with the C=3 channels storing the red, green and blue colours. For a grayscale input image, the size would be H x W x 1.

Inside the network, we will usually derive many features so the number of channels at a deeper convolution layer might be much more than 3.

Inside a convolution layer, we store some weights. Unlike a neuron in an MLP however, we do not need a different weight for every input. Instead, the weights are stored in small tensors called filters or kernels. These are a different spatial size to the input but must have the same number of channels as the input. We’ll call the filter dimensions (H_f) x (W_f) x (C). Usually the spatial dimensions of the filter are much smaller than the input.

Convolution

We now convolve the input tensor with each of the filters. This involves centering the filter over a pixel in the input image and then combining the values in the filter with the values in the image in the region covered by the filter. To combine the values, we simply multiply values at corresponding positions and add all of those products up. Hence, the convolution at one location produces a single output value.

We repeat this process, sliding the filter over the image left to right, top to bottom. We get an output value for each pixel location, giving us two spatial dimensions in the output. We also repeat this whole process for each filter, giving us an output channel for each of the F filters in the layer.

There are a number design decisions to make when specifying a convolution layer. These determine the size of the output image:

1. How much should we move the filter each time we slide it over the image? This is called the stride. Usually, we move it one pixel at a time, i.e. stride=1.
2. How should we handle the boundary? If we wish to centre the filter over pixels on or near the boundary, some of the filter may lie outside the input tensor. This is handled by padding. This requires us to decide what values we should use for the missing pixels. Usually, we just zero (called zero padding) or sometimes replicate the closest value to the boundary.

A very common set up for a convolution layer is (H_f = W_f = 3), stride=1 and padding=1 with zero padding. Together, this means that the output has the same spatial dimensions as the input. Specifically, the output has size H x W x F.

Like in an MLP, a nonlinear activation function (most commonly ReLU) is applied to every convolution output. This is usually considered part of the convolution layer.

Pooling

As an image passes through a CNN, the features become more abstract. At the input layer, the features are simply pixel colour values. At the output layer, we provide a high level description such as the class of object in the image. For this reason, the spatial resolution generally decreases as we get deeper into the network. This is achieved by a process called pooling. We simply replace a block of pixels (usually a 2×2 block) with a single value. This value is usually either the mean average or maximum of the feature values within the block. Using the maximum (max pooling) gives us some spatial invariance: we’re saying that we don’t really mind precisely where we saw a feature, we just want to know if it was present.

Convolutional Neural Networks

A CNN is constructed by combining many such convolution layers, one after another. These layers are often followed by a number of fully connected layers with ReLU activation until the final layer which has the desired number of output neurons and usually no ReLU activation.

As a concrete example, a very famous architecture is shown below, the VGG-16 architecture from 2015. You can see that this has 13 convolution layers followed by 3 fully connected layers, the last of which is an output layer with 1,000 neurons (for object recognition with 1,000 possible classes).

References

Karen Simonyan, Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of ICLR. 2015.