Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only T&Cs apply

Find out more

Deep learning on images

An image is simply a rectangular array of pixels. Dr Kofi Appiah explains how we can process and understand that data.
An image is simply a rectangular array of pixels. At each pixel we either store a single grayscale value or three values for an RGB colour image. Even a small image contains 10s of thousands of pixels. An image taken by the camera on your phone will contain millions of pixels. This presents a challenge for building neural networks using the MLP architecture you’ve seen so far. Imagine that an input image contains 1 million grayscale pixels. We could ignore the fact that it is an image and just treat this as a vector of 1 million values and use it as input to an MLP. However, this would cause two problems.
First, each node in the first layer of our MLP would require 1 million weights. For problems of typical complexity, we might want 1000 nodes per hidden layer. We would therefore need 1 billion weights just for the first layer! This quickly becomes infeasible to train - we have too many parameters to learn and will overfit to our training data. The second problem is that there are some ways to change an image that might not affect the result we want the network to predict. For example, if you are classifying cat images, you don’t care if the cat is moved around in the image - the correct label is still cat.
But if you do move the cat around, the vector input to the MLP would completely change and the network would need to see many such examples to learn that this change does not matter. Both of these problems are fixed by the following observations. First, we can start by looking at small regions of the image to extract what we call local features. As we go deeper into the network, these local features can be aggregated to produce increasingly global features. Second, we probably want to look for the same features in multiple parts of the image. For example, if we’re learning to classify cat images, we might learn an eye feature and use it to detect both of the cat’s eyes.
Or we might learn more basic features such as lines or edges that might be encountered anywhere in the image. Based on these two observations, we arrive at the standard method for doing deep learning on images. First, instead of connecting every pixel to every node in the next layer, we only use pixels in a small local region around each pixel. Second, we use a concept called weight sharing - the same weights are used for every local region. You can think of sliding this small window of weights around on our image. Each small window of weights - called a filter - only requires a very small number of parameters since they are usually only 3 pixels by 3 pixels in size.
This process is called convolution and when we stack multiple such layers we call it a convolutional neural network.

An image is simply a rectangular array of pixels.

Dr Kofi Appiah explains how we can process and understand that data.

This article is from the free online

Intelligent Systems: An Introduction to Deep Learning and Autonomous Systems

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now