New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

# Classification versus regression

In development.

So far, we’ve seen how MLPs can be used for parametric machine learning. The MLP outputs one or more values, we give this to a loss function and we use SGD to update the parameters of the network to reduce the loss. We call this learning.

If we are solving a regression problem, this is straightforward. Our network predicts a value and we want to make this as close as possible to the desired value. In this case, something like mean squared error will work fine as a loss function. But what about if we want to solve a classification problem where we want the network to choose one from a discrete set of possibilities, e.g. “cat”, “dog” or “person”? How do we make a function that outputs one of those three options? And how would we compute a loss when the label says “cat” and the network says “dog”?

There is a very nice trick we can use to do this and it involves turning a classification problem into a regression problem.

The idea is that we will ask the network to predict probabilities for each of the possible output classes conditional on the input to the network. In other words, it won’t make a hard decision about the class, but rather give an indication of how likely it thinks each class is. So, if we are given input (x), the network will predict (p(text{cat})|(x)), (p(text{dog})|(x)) and (p(text{person})|(x)). These probabilities should form a valid discrete probability distribution. This means each probability should be (geq 0) and they should sum to one. So now the question becomes: how can we make our network output a valid probability distribution?

### Softmax

The answer lies in a function called softmax. If our MLP is to predict one of K classes, then we design it to have K output neurons without the ReLU nonlinearity. This means that the values of these neurons could be anything: positive or negative, greater than 1. The softmax function takes these K values and maps them to another K values such that they are positive and sum to 1. Specifically, if we call the raw network output values (z_i) where i ranges from 1 to K then the corresponding softmax value for the ith element is given by:

[frac{e^{z_i}}{sum_{j=1}^Ke^{z_j}}]

For the example described above, the raw network output and the values after softmax might look like:

When we use our trained network to infer the class of an input, we can simply choose the class that is assigned the highest probability.

### Classification loss

The last thing we need to know is how to compute a loss to train our network. For a classification problem the label is the correct class for each input. Another way to think about this is that the label is a probability distribution which has 1 for the correct class and zero for all others. Let’s take the example above and say that the correct class is person. We therefore want a way to compare the following two probability distributions:

Let’s call the estimated probabilities (S_i) and the correct probabilities (T_i). The standard way to compare the two is using something called the cross entropy loss. We don’t have time in this course to delve into where this loss comes from but we can see how to compute it and how it behaves. The formula for the loss is given by:

[L(S)= -sum_{i=1}^K T_i log(S_i)]

Since only one of the (T_i) will contain 1 and all others zero, this simplifies to (-log(S_i)) for the (i) corresponding to the correct class. We can have a look how this loss varies as (S_i) varies between 0 and 1:

Things to note:

1. When (S_i=1), we are performing perfectly and the loss is zero as we expect.
2. As (S_i) decreases (i.e. our confidence in choosing the correct label), so the loss increases. Again, this is what we expect.
3. Finally, the steepness of the curve increases as (S_i) decreases. Thinking about gradient descent, this means that when we’re making a really bad prediction, there will be a strong gradient to tell the network to improve its classification for this training sample. Once we’re above 50% confidence, the gradient flattens out since we’re now predicting the right class.