Skip main navigation

Optimisation and gradient descent

Optimising loss functions is the key to machine learning. Watch Dr Jenn Chubb explain more.
We’ve seen that deep neural networks provide a powerful way of representing functions for machine learning. We’ve also seen that the weights and biases in each neuron allow us to control the behaviour of the function and that this behaviour determines the loss with respect to our training data. As we change our parameters, the loss of our network will change. You can think of this as a loss landscape. Imagine we only had two parameters. Then we could visualise this loss landscape as a 3D surface. Hills represent bad parameter settings with high loss, valleys represent good solutions with lower loss. The process of learning is to find parameters that minimise the value of our loss function. So how do we actually do this?
This is done by a process called gradient descent. We look at the loss value with our current set of parameters
and then ask: what small change would I need to make to the parameters of my network in order to slightly reduce the loss value? These small changes are called a descent direction. The most straightforward way to find a descent direction is to compute the gradient of the loss function with respect to our parameters. You may be familiar with the concept of the gradient of a function from your mathematical studies. If this process is repeatedly applied, we will gradually walk downhill until we cannot find a direction that reduces the loss any further. At this point, we say that we have converged to a minimum and we stop training.
However, we can’t say this is necessarily the best solution - which would be called the global minimum. Instead, all we can say is that we’ve reached a local minimum. The local minimum we end up in depends upon where we started, i.e. how we initialised our parameters. Perhaps surprisingly, the preferred way to initialise the parameters of a neural network is to use random values. It’s important that each neuron is initialised with different weights to encourage them to specialise in different ways and randomness helps with this. One way that we attempt to overcome the problem of becoming stuck in suboptimal local minima is to also introduce randomness into our descent direction.
Instead of computing the gradient of the loss using all of our training data, we randomly select a small subset and only use this. This reduces the amount of computation needed and so is more efficient but also, in practice, leads to convergence to better solutions.

Optimising loss functions is the key to machine learning. A simple optimisation algorithm is gradient descent.

Watch Dr Jenn Chubb explain more.

This article is from the free online

Intelligent Systems: An Introduction to Deep Learning and Autonomous Systems

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education