£199.99 £139.99 for one year of Unlimited learning. Offer ends on 28 February 2023 at 23:59 (UTC). T&Cs apply

Gradient Descent is an optimising algorithm used in Deep Learning algorithms. The goal of Gradient Descent is to minimise the objective convex function f(x) using iteration.

Gradient Descent is an optimising algorithm used in Deep Learning algorithms. The goal of Gradient Descent is to minimise the objective convex function f(x) using iteration.

It’s time to get our hands dirty and dive into the theoretical detail of how we actually optimise parameters in order to learn from our training data. This is very challenging from a mathematical perspective. But the good news is, it is based entirely on A-level standard maths and might provide you with some inspiration that the pure mathematics you may have studied actually has some hugely important practical applications!

First, let’s remember what we’re trying to do. Our MLP has lots of parameters (the weights and biases of every neuron). We’ll call all of these together (mathbf{w}) (you can think of writing all the parameters in a long list, i.e. (mathbf{w}) is the vector of all of the parameters). Now, recall that the goal is to solve the following optimisation problem:

Find the (mathbf{w}) minimises (L(mathbf{w})) where:

[L(mathbf{w}) = sum_{i=1}^n E(f_{mathbf{w}}(x_i) ,y_i)]

With (E) being our loss function, (L) the total loss, (y_i) a label, (x_i) an input and (f_{mathbf{w}}) representing our parametric function, for example an MLP. We’re going to make an initial guess for (mathbf{w}) by using random values. It turns out that the precise way you choose those random values is important and there are now standard tricks for doing it, but the details of these don’t matter to us for now.

Now, we need to know how to change our current estimate of (mathbf{w}) so that the loss reduces. If we can keep doing this, we’ll gradually improve the performance of our MLP. An adjustment to w that reduces the loss is called a descent direction. We’ll now see the simplest way to compute such a direction.

We will use as our descent direction the gradient of our loss with respect to the parameters of our MLP. In case you’ve never heard of a gradient, this is a generalisation of the derivative when you have more than one input.

[z(x,y) = x^2 + xy + y^2]

This function has two inputs: x and y. We can pretend it only has one input (let’s say (x)), treat the other input as if it were a constant value and then do standard differentiation. We call this a partial derivative and write it using a symbol that looks a bit like a d:

[frac{partial z}{partial x}(x,y) = 2x + y]

Hopefully that looks easy to you from differentiation you may have studied before. Now let’s write down the other partial derivative (with respect to y this time):

[frac{partial z}{partial y}(x,y) = 2y + x]

Finally, we can put these partial derivatives together into a vector that we call the gradient and denote by an upside down triangle:

[nabla z(x,y) = left[ frac{partial z}{partial x}(x,y), frac{partial z}{partial y}(x,y) right] = [2x + y, 2y +x]]

Now, here’s the clever part. The gradient effectively tells us which direction is “uphill” at any given point. In other words, what small change would we need to make to our parameters to increase the function value. Since we want to reduce loss, we will go in the opposite direction, i.e. the negative of the gradient. Repeatedly stepping in the negative gradient direction is called gradient descent.

Let’s say after t iterations, we have estimates of the parameters (x_t) and (y_t). To get the estimates at the next iteration (t+1), which will have a smaller function value, we do:

[left[x_{t+1}, y_{t+1}right] = left[x_t, y_tright] – gammanabla z(x_t,y_t)]

The value (gamma) is called the step-size or (more commonly in machine learning) the learning rate. This will determine how fast our optimisation reduces the loss. Learning rate is a hyperparameter – a value that we need to choose manually that will affect the training of our system. Too big and we’ll overshoot. Too small and it might take a very long time to reach a good solution. If we find a place where the gradient is all zeros we can stop as we have found a local minimum. This is a place where the loss can’t be reduced further by moving a small distance in any direction.

So, we now know how to do gradient descent for a very simple function.