Skip main navigation

New offer! Get 30% off your first 2 months of Unlimited Monthly. Start your subscription for just £29.99 £19.99. New subscribers only. T&Cs apply

Find out more

Optimisation and gradient descent

Optimising loss functions is the key to machine learning. Watch Dr Jenn Chubb explain more.
7.1
We’ve seen that deep neural networks provide a powerful way of representing functions for machine learning. We’ve also seen that the weights and biases in each neuron allow us to control the behaviour of the function and that this behaviour determines the loss with respect to our training data. As we change our parameters, the loss of our network will change. You can think of this as a loss landscape. Imagine we only had two parameters. Then we could visualise this loss landscape as a 3D surface. Hills represent bad parameter settings with high loss, valleys represent good solutions with lower loss. The process of learning is to find parameters that minimise the value of our loss function. So how do we actually do this?
49.7
This is done by a process called gradient descent. We look at the loss value with our current set of parameters
55.5
and then ask: what small change would I need to make to the parameters of my network in order to slightly reduce the loss value? These small changes are called a descent direction. The most straightforward way to find a descent direction is to compute the gradient of the loss function with respect to our parameters. You may be familiar with the concept of the gradient of a function from your mathematical studies. If this process is repeatedly applied, we will gradually walk downhill until we cannot find a direction that reduces the loss any further. At this point, we say that we have converged to a minimum and we stop training.
91.7
However, we can’t say this is necessarily the best solution - which would be called the global minimum. Instead, all we can say is that we’ve reached a local minimum. The local minimum we end up in depends upon where we started, i.e. how we initialised our parameters. Perhaps surprisingly, the preferred way to initialise the parameters of a neural network is to use random values. It’s important that each neuron is initialised with different weights to encourage them to specialise in different ways and randomness helps with this. One way that we attempt to overcome the problem of becoming stuck in suboptimal local minima is to also introduce randomness into our descent direction.
132.3
Instead of computing the gradient of the loss using all of our training data, we randomly select a small subset and only use this. This reduces the amount of computation needed and so is more efficient but also, in practice, leads to convergence to better solutions.

Optimising loss functions is the key to machine learning. A simple optimisation algorithm is gradient descent.

Watch Dr Jenn Chubb explain more.

This article is from the free online

Intelligent Systems: An Introduction to Deep Learning and Autonomous Systems

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now