£199.99 £139.99 for one year of Unlimited learning. Offer ends on 28 February 2023 at 23:59 (UTC). T&Cs apply

# Backpropagation and SGD

Dr Will Smith explains backpropagation and stochastic gradient descent.

We have seen that, if we can compute the gradient of the loss function with respect to the parameters (weights and biases) of our MLP, we can perform gradient descent and train our network. Remember, the MLP itself is nothing but a rather complicated mathematical function. It is computed by applying lots of functions one after another (composition). How do we compute the gradients of such functions? The answer is something you might already have come across in your mathematical studies: the chain rule!

$u = f(x,y) = x^2 + xy + y^2$

Then let’s define another function:

$z = g(u) = u^2 + u$

So that we can write z as a composition of one function followed by the other:

$z = g(f(x,y))$

Now, how would we compute the partial derivatives $$\frac{\partial z}{\partial x}$$ and $$\frac{\partial z}{\partial y}$$ that we need to do gradient descent? The answer is wonderfully simple and is given by the chain rule:

$\frac{\partial z}{\partial x} = \frac{\partial z}{\partial u}\frac{\partial u}{\partial x}$

Each of the two terms on the right is easy to compute using basic differentiation:

$\frac{\partial z}{\partial u} = 2u + 1$ $\frac{\partial u}{\partial x} = 2x + y$

Finally, we simply multiply the two together, substituting $$u = f(x,y)$$ and we’re done. We can use the same process for the other partial derivative.

### Backpropagation

So how do we use this approach in the much more complicated composition of functions we encounter in an MLP? The answer is repeated application of the chain rule, working backwards from the loss function back through the MLP. We start by taking our input and running it forwards through our network, called a forward pass. This tells us the loss value for current network parameters. Now we work function by function backwards through our network computing the derivative of the loss with respect to each intermediate value and parameter in the network. This is called a backwards pass. Whenever we compute a derivative with respect to a weight or bias, we take a gradient descent step for that parameter, i.e. update the parameter, before continuing with the backwards pass.

This process is called backpropagation or backprop. So long as we know how to differentiate each of the component functions within our MLP and loss function, then we can always use the chain rule to work our way backwards through the network, computing the derivatives we need for gradient descent.

There is one final important detail to know about how we train neural networks. Until now, we’ve shown that we sum the loss over all training samples. This means that to perform one iteration of gradient descent we need to pass our entire dataset through the network and then do the chain rule starting with a loss function that is summed over the entire dataset. This causes two problems:

1. Our dataset is likely to be huge. If we try to process the entire dataset in one go, we’re going to run out of memory.
2. It’s likely that the loss summed over all training samples will contain bad local minima. Intuitively: if we’re considering the entire dataset, it’s likely that any improvement we can make with respect to one training example might make performance worse for another training sample.

We can crack both of these problems in one go using a technique called stochastic gradient descent (SGD). The idea is that for each iteration of gradient descent, we do not use all training examples. Instead, we randomly select a small subset of the training examples, called a mini-batch. For example, our dataset might contain one million images but a mini-batch might only contain 8 images.

We run one iteration of gradient descent using only this mini-batch of training examples to compute the loss. You can think of the gradient we compute using only mini-batch as a noisy estimate of the true gradient we would have got if we’d used the whole dataset.

Then we select a new random mini-batch and repeat the process. We keep doing this, choosing training examples that haven’t been used yet. Once we’ve run through the whole dataset, we have completed what’s called an epoch. For the next epoch, we will choose different random mini-batches.

It’s obvious that this fixes problem number one: the dataset for a single iteration is now tiny so we won’t run out of memory. But how does it help with problem number two? The answer is that the local minima in the loss function for one mini-batch are probably totally different than for a different mini-batch. So, even if we would get stuck in a local minima if we used only one mini-batch, we can escape it by continuously changing our training data.