Network output and loss functions
To train a network, we need some way of evaluating its performance. So we need to consider what the network will output, and how we will compare that output to what we expect to see.
So far, we have discussed both the data going into networks, e.g. digital images, and the various layers and functions that data might pass through as it travels through the network. In this article, we will focus on the end point of the network (what does the last layer output?), and how we evaluate that output against a known or desired output using what are known as loss functions.
Network output
Broadly speaking, as with other artificial neural networks, the output of CNN should reflect the question you are asking. Is it a regression problem? Is it classification you are after? If so, how many classes are there?
As we saw in the exercise last week, if we are trying to make a classifier, with e.g. ten classes, we will need ten output nodes in the final layer. If on the other hand we just want a yes or no answer (e.g. is this plant diseased or not?), we would just use one node, and train it’s output to be above some value for YES, below the value for NO.
For a regression task we would again use a single node, but here the value output by the network should be trained to be near or equal to some target value, or some function of the output value should match the target. We will look more at regression in week 4.
We may even want to answer more than one question with the network, for example: how many flowers are there in this image, and what species are they? To do that we will need to look at multitask learning, which we will look at in week 6.
For now, we will keep our focus on the classifier we started to build last week, where we used the log softmax function on our final layer of ten nodes, a common choice to interpret the output of networks designed for classification.
Softmax and Log Softmax
Softmax is slightly different to ReLU, and other activation function like tanh and sigmoid in that rather than a function that works on each value independently, it works on a set of values instead.
It’s commonly used at the end of networks to make a decision based on the output of a final set of neurons, and what it does is convert that output into a probabilty between 0 and 1 that the input results in the output at each decision neuron.
For example, if we have a network with five output nodes, softmax will scale the output so that each of the five nodes will output a probability between 0 and 1, the sum of which add up to 1.
Mathematically the function is defined as follows, for a vector of (n) values (mathbf{x}=(x_1,x_2,…x_n)):
[sigma(mathbf{x})_i=frac{e^{x_i}}{sum_{j=1}^n{e^{x_j}}}]
so in words, for a given node its just the exponential of the value at that node, divided by the sum of exponentials of all the nodes.
To use it in PyTorch, we need the function torch.nn.functional.softmax
(abbreviated to F.softmax
here):
import torch
eg = torch.tensor([1,0.1,0.032,0.25,9])
print('softmax',torch.nn.functional.softmax(eg,dim=0))
# The dim keyword determines which dimension of
# the tensor we want to compare values along.
# We only have the '0' dimension here
softmax tensor([4.5379e05, 1.3633e04, 1.2736e04, 1.5839e04, 9.9953e01])
In the example above, there is a probability of >0.9995 the final neuron would be selected.
Commonly, Log softmax is used rather than softmax. This just outputs the natural log of each probability found using softmax, and is computationally faster and more stable to compute.
It’s used in a similar way as softmax, but using torch.nn.functional.log_softmax.
Returning to the example above:
log_sm = torch.nn.functional.log_softmax(eg,dim=0) # find the log softmax
print('log softmax:',log_sm)
exp_log_sm = torch.exp(log_sm) # get the exponents to check
print('exponential:',exp_log_sm)
sum_softmax = torch.sum(exp_log_sm)
print('sum:',sum_softmax)
log softmax: tensor([1.0000e+01, 8.9005e+00, 8.9685e+00, 8.7505e+00, 4.6755e04])
exponential: tensor([4.5379e05, 1.3633e04, 1.2736e04, 1.5839e04, 9.9953e01])
sum: tensor(1.)
Loss functions
Once our network outputs the answer to our question in the way we want it, we need some way of evaluating it against our known values, our target values.
In the case of a regression this can be done by taking the sum of squared differences between the predicted value and the true value, as we have seen elsewhere in machine learning approaches. Effectively what we are doing is measuring the distance between the network output and the truth, and describing it with a single number.
For a classifier, the aim is the same. We have a set of predicted probabilities, and a known value, and we want to measure the distance between the two in terms of a single number somehow. A common way of doing this is by calculating the cross entropy.
Cross entropy
We won’t go into detail on how cross entropy is calculated, as it’s a little complicated, the important thing is how we will use it in PyTorch, and what it means.
Conveniently, one of the inputs we will need is the log softmax of the output layer, as we have already seen how to calculate. Remember, this returns the log of a probability between 0 and 1 that the true value corresponds to each node in the output layer. Since we take the log of those probabilities, we end up with negative values, with small negative values near zero being most likely, and large negative value being most unlikely.
To calculate the cross entropy on a set of outputs all we need is a tensor containing an integer target value that corresponds to the node we expect to be the most likely output value from the network.
Returning to the example above:
import torch
eg = torch.tensor([1,0.1,0.032,0.25,9])
output = torch.nn.functional.log_softmax(eg,dim=0)
print(output)
tensor([1.0000e+01, 8.9005e+00, 8.9685e+00, 8.7505e+00, 4.6755e04])
We see that the network predicts node 4 (counting from zero) as the most likely class value.
Suppose the network is correct and our target value is indeed 4. Then to calculate the cross entropy we need to pass that target value to the torch.nn.functional.cross_entropy
function along with the log softmax output values:
target=torch.tensor(4)
loss = torch.nn.functional.cross_entropy(output, target)
print(loss)
tensor(0.0005)
Since the network is predicting the correct output value the cross entropy is relatively low.
If the true value is another class, say 3, so that the network is predicting the wrong value:
target=torch.tensor(4)
loss = torch.nn.functional.cross_entropy(output, target)
print(loss)
tensor(8.7505)
we see that the cross entropy loss is much higher.
If we pass multiple output and target values to the cross entropy in a batch, the function will (by default) return the average loss.
Time to train a network
Now that we have a way to convert our network output into a single number representing how well it matches the target data, we are ready to make a training loop and begin training a network, as we will see in the following article.
Reach your personal and professional goals
Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CVbuilding certificates.
Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.
Register to receive updates

Create an account to receive our newsletter, course recommendations and promotions.
Register for free