New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

Loss Functions in Machine Learning

One of the key factors in machine learning is the loss function. This tells the machine learning algorithm how well the trained system is currently performing. The goal of learning is to reduce the value of this loss function, i.e. to make our machine perform better.

One of the key factors in machine learning is the loss function. This tells the machine learning algorithm how well the trained system is currently performing. The goal of learning is to reduce the value of this loss function, i.e. to make our machine perform better.

Supervised Learning

In supervised learning, our training data provides us with the correct or desired output – known as the label – for each corresponding input. The loss function compares the label against the output that our system currently predicts. It gives us back a non-negative number indicating the disagreement or error between the desired and predicted outputs. A loss value of zero means perfect performance.

Squared Error

Loss functions come in many flavours depending on the task being solved. Perhaps the simplest loss function is the squared error. For example, consider the following system which is learning to predict the age of a person from a photograph. It outputs its current guess in years (21.3) and we compare this against the known actual age of the person in years (39.9). We take the difference (-18.6) and square it (345.96). The machine learning algorithm now knows that the system is not performing perfectly for this input and will try to adapt the machine to improve performance. We’ll start to understand how this is done below.

In practice of course, we don’t just have one training example, we have lots (usually tens of thousands or even millions for a computer vision problem). So, we need to combine the loss values for all training examples. Typically, this would mean either taking the average (giving us the mean squared error in this case) or summing them up (sum of squared errors).

Parametric Machine Learning

The “function we’re trying to learn” is the mapping from input to output. We’ll call this (f). This could be any function of any complexity. You may have studied linear functions or quadratic functions in mathematics. These would be possible choices, though much too simple to work well on any serious problem. You will have seen many more complicated mathematical functions such as trigonometric functions, exponentials or higher order polynomials. Again, these might work as part of a larger, more complicated function. The problem is that the set of possible functions is infinite. How can we choose a suitable one?

Instead, most machine learning methods use a function of fixed form with a fixed number of parameters. The behaviour of the function is then determined by the value of those parameters. So now, instead of trying to choose a function, we reduce the problem to adapting the value of the parameters. This is called parametric machine learning. Let’s take a very simple example in which the input to our machine is a single number (x). And the output is also a single number obtained by applying the following function:

[f(x) = w_1x + w_2]

In this case, the output is a linear function of the input and the function itself depends on two parameters: (w_1) and (w_2). For real problems, many more parameters would be required. For example, it’s not unusual for an image classification network to have millions, even tens of millions of parameters. We’ll call the set of all of the parameters of the network (mathbf{w}). And we’ll write the function that depends on (mathbf{w}) as (f_{mathbf{w}}).

Learning = Optimisation

Now we come to a slightly anticlimactic discovery. When we talk about machine learning, all we really mean is adjusting the parameters of our function in order to reduce the loss, averaged over our whole set of training data, until we reach a point where it can’t be reduced any further. Adjusting parameters to find the minimum of a function is called optimisation.

We can start to write this down mathematically. Let’s say we have lots of training examples, each comprising an input (x_i) and a corresponding desired output (y_i) (i.e. our label for that input). The prediction our machine makes for input (x_i) is (f_{mathbf{w}}(x_i)). Therefore our loss function should compare (f_{mathbf{w}}(x_i)) and (y_i). We’ll write this as (E(f_{mathbf{w}}(x_i),y_i)), where (E) is our loss function, i.e. “error”. But remember, we have lots of training examples, not just one input/label pair. Let’s do this simplest thing possible and just add up the loss over all of our (n) training examples:

[sum_{i=1}^n E(f_{mathbf{w}}(x_i),y_i)]

You should have seen the “sigma notation” used here in your maths before, but in case you haven’t, it’s just a compact way of writing:

[E(f_{mathbf{w}}(x_1),y_1) + E(f_{mathbf{w}}(x_2),y_2) + dots + E(f_{mathbf{w}}(x_n),y_n)]

Finally, we can write down the goal of machine learning:

Find the w that makes the following as small as possible:

[sum_{i=1}^n E(f_{mathbf{w}}(x_i),y_i)]

As machine learning engineers, it’s our job to choose the loss function, the form of (f) and decide what to use as our training data (inputs and labels).

© University of York