Skip main navigation

Deep learning on images

An image is simply a rectangular array of pixels. Dr Kofi Appiah explains how we can process and understand that data.
5.6
An image is simply a rectangular array of pixels. At each pixel we either store a single grayscale value or three values for an RGB colour image. Even a small image contains 10s of thousands of pixels. An image taken by the camera on your phone will contain millions of pixels. This presents a challenge for building neural networks using the MLP architecture you’ve seen so far. Imagine that an input image contains 1 million grayscale pixels. We could ignore the fact that it is an image and just treat this as a vector of 1 million values and use it as input to an MLP. However, this would cause two problems.
51.6
First, each node in the first layer of our MLP would require 1 million weights. For problems of typical complexity, we might want 1000 nodes per hidden layer. We would therefore need 1 billion weights just for the first layer! This quickly becomes infeasible to train - we have too many parameters to learn and will overfit to our training data. The second problem is that there are some ways to change an image that might not affect the result we want the network to predict. For example, if you are classifying cat images, you don’t care if the cat is moved around in the image - the correct label is still cat.
96.2
But if you do move the cat around, the vector input to the MLP would completely change and the network would need to see many such examples to learn that this change does not matter. Both of these problems are fixed by the following observations. First, we can start by looking at small regions of the image to extract what we call local features. As we go deeper into the network, these local features can be aggregated to produce increasingly global features. Second, we probably want to look for the same features in multiple parts of the image. For example, if we’re learning to classify cat images, we might learn an eye feature and use it to detect both of the cat’s eyes.
146.3
Or we might learn more basic features such as lines or edges that might be encountered anywhere in the image. Based on these two observations, we arrive at the standard method for doing deep learning on images. First, instead of connecting every pixel to every node in the next layer, we only use pixels in a small local region around each pixel. Second, we use a concept called weight sharing - the same weights are used for every local region. You can think of sliding this small window of weights around on our image. Each small window of weights - called a filter - only requires a very small number of parameters since they are usually only 3 pixels by 3 pixels in size.
194
This process is called convolution and when we stack multiple such layers we call it a convolutional neural network.

An image is simply a rectangular array of pixels.

Dr Kofi Appiah explains how we can process and understand that data.

This article is from the free online

Intelligent Systems: An Introduction to Deep Learning and Autonomous Systems

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education