Skip main navigation

Classification versus regression

In development.
Computer code on screen
© University of York

So far, we’ve seen how MLPs can be used for parametric machine learning. The MLP outputs one or more values, we give this to a loss function and we use SGD to update the parameters of the network to reduce the loss. We call this learning.

If we are solving a regression problem, this is straightforward. Our network predicts a value and we want to make this as close as possible to the desired value. In this case, something like mean squared error will work fine as a loss function. But what about if we want to solve a classification problem where we want the network to choose one from a discrete set of possibilities, e.g. “cat”, “dog” or “person”? How do we make a function that outputs one of those three options? And how would we compute a loss when the label says “cat” and the network says “dog”?

There is a very nice trick we can use to do this and it involves turning a classification problem into a regression problem.

The idea is that we will ask the network to predict probabilities for each of the possible output classes conditional on the input to the network. In other words, it won’t make a hard decision about the class, but rather give an indication of how likely it thinks each class is. So, if we are given input (x), the network will predict (p(text{cat})|(x)), (p(text{dog})|(x)) and (p(text{person})|(x)). These probabilities should form a valid discrete probability distribution. This means each probability should be (geq 0) and they should sum to one. So now the question becomes: how can we make our network output a valid probability distribution?


The answer lies in a function called softmax. If our MLP is to predict one of K classes, then we design it to have K output neurons without the ReLU nonlinearity. This means that the values of these neurons could be anything: positive or negative, greater than 1. The softmax function takes these K values and maps them to another K values such that they are positive and sum to 1. Specifically, if we call the raw network output values (z_i) where i ranges from 1 to K then the corresponding softmax value for the ith element is given by:


For the example described above, the raw network output and the values after softmax might look like: softmax formula

When we use our trained network to infer the class of an input, we can simply choose the class that is assigned the highest probability.

Classification loss

The last thing we need to know is how to compute a loss to train our network. For a classification problem the label is the correct class for each input. Another way to think about this is that the label is a probability distribution which has 1 for the correct class and zero for all others. Let’s take the example above and say that the correct class is person. We therefore want a way to compare the following two probability distributions:

estimated and correct probability distribution graphs

Let’s call the estimated probabilities (S_i) and the correct probabilities (T_i). The standard way to compare the two is using something called the cross entropy loss. We don’t have time in this course to delve into where this loss comes from but we can see how to compute it and how it behaves. The formula for the loss is given by:

[L(S)= -sum_{i=1}^K T_i log(S_i)]

Since only one of the (T_i) will contain 1 and all others zero, this simplifies to (-log(S_i)) for the (i) corresponding to the correct class. We can have a look how this loss varies as (S_i) varies between 0 and 1:

loss variation graph

Things to note:

  1. When (S_i=1), we are performing perfectly and the loss is zero as we expect.
  2. As (S_i) decreases (i.e. our confidence in choosing the correct label), so the loss increases. Again, this is what we expect.
  3. Finally, the steepness of the curve increases as (S_i) decreases. Thinking about gradient descent, this means that when we’re making a really bad prediction, there will be a strong gradient to tell the network to improve its classification for this training sample. Once we’re above 50% confidence, the gradient flattens out since we’re now predicting the right class.
© University of York
This article is from the free online

Intelligent Systems: An Introduction to Deep Learning and Autonomous Systems

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education