Skip main navigation
We use cookies to give you a better experience, if that’s ok you can close this message and carry on browsing. For more info read our cookies policy.
We use cookies to give you a better experience. Carry on browsing if you're happy with this, or read our cookies policy for more information.

Training

We have seen that a basic neural network is a series of feature transformations culminating in either linear regression (for regression problems) or logistic regression (for classification problems). Both these transformations and the final linear/logistic regression output are parameterized by the weights in the network. The task of training the network is one of optimizing these weights given the data.

As we saw in the first module, this process involves the specification of a loss function which is then minimized. Depending on the loss function, this may be achieved in different ways. Regarding neural networks, the overwhelmingly most common approach is to utilize some form of gradient descent.

Common loss functions are mean square error for regression, and cross-entropy for classification:

In the cross-entropy loss, the Y is assumed to be binary labels 0 and 1. It is very popular in training deep neural networks for classification tasks.


Aside: Cross-entropy

Minimizing the cross entropy seeks to minimize the Kullback Liebler divergence (a measure of divergence between two probability distributions) between the empirical distribution (from the data) and the predicted distribution (from the model). Entropy, cross-entropy and the Kullback Liebler divergence and concepts from information theory.


Given a loss function, the basic gradient descent algorithm is the same as that given in week one. Before beginning, we specify a constant learning rate, R, or a learning rate schedule, R(i), where i is the step of the algorithm is specified. We assume a constant rate for simplicity.

  1. Assign random initial values to the weights. These should be close to, but not precisely, zero. The weight cannot be uniformly assigned zero because the weight gradients will all be zero at this point and gradient descent will not go anywhere!

  2. Repeat until convergence:

    i. For each weight, , calculate its partial derivative with respect to the Loss function, .

    ii. Update each weight given its partial derivative and the learning rate:


Aside: Backpropagation

You may have heard of backpropagation. This is a means of efficiently calculating the partial derivatives of the weights in the network. The name comes from the fact that the partial derivatives are calculated backwards through the network using the chain rule of derivation.


Stochastic & Mini-Batch Gradient Descent

Running the simple algorithm is problematic. The loss functions as defined are functions of the whole data set. This means that every iteration of the algorithm (every update to the weights) requires calculating the estimated value for the target variable for every datum. Since there will be many iterations of the algorithm this could be a very long process.

Stochastic and mini-batch gradient descent are two methods of responding to this problem. Rather than calculating the loss function over the entire training data at each iteration, the loss function is calculated over some subset, known as a batch. This is mini-batch gradient descent. The extreme case when the loss function is calculated only for a single datum each iteration is known as stochastic gradient descent. Accordingly, when you train a neural network you are often asked to specify a batch size to be used in this way.

Chosing a batch size, however, involves a trade off. The smaller the batch size the quicker the gradient descent iterations are to run. But equally, the smaller the batch size the more noisy the minimization progress will be. While following the derivatives for the loss function calculated on the entire data set will be sure to point in a direction reducing the loss function on the data, the derivation calculated on an individual datum or small subset of the data may be inaccurate, as seen in the diagram below.

Comparative visualizations of stochastic, mini-batch and full gradient descent. We see the progress of the three methods in terms of the values assigned to two parameters. Full gradient descent smoothly progresses to the local optima with the smallest number of steps. Mini-batch gradient descent is somewhat noisier and takes more steps. Stochastic gradient descent is much noisier and takes many more steps.

Epochs

Notice that the gradient descent algorithm looks at the data repeatedly. In the basic algorithm specified above, we continue until convergence is reached. If each step using one of the data, there is no guarantee that convergence will be reached in steps. So we will have to cycle through the data more than once. When training networks, it is common to be asked for the maximum number of times that you want to cycle through the data. This is known as the number of epochs. When proceeding from one epoch to another, the data is normally shuffled (reordered in a random order) leading to the use of different subsets as batches.

Learning Rate

We note that before beginning the gradient descent algorithm we should choose a learning rate or learning schedule. Here we discuss what a learning schedule is and why it can be useful.

Consider that the loss function for many datasets will contain (i) large areas of almost flat gradient corresponding to poor values for the model parameters; and (ii) high quality local optima at the ‘bottom’ of steep inclines. We are likely to start our optimization in an area of the first sort, and wish to end it in an area of the second sort.

Working with a slow learning rate, it will take a long time to tranverse the almost flat, low quality areas. This is because the gradient of these areas is small, and the iterations are the product of the gradient and learning rate. Having a fast learning rate in this area would be useful in avoiding wasting time. We might also be able to jump over shallow, low quality local optima that might exist in such an area.

But working with a fast learning rate can cause difficulties in small areas of steep gradient. At worst, the gradient descent algorithm may jump right out of the small areas of interest. If that does not occur, it may well still keep jumping over the nadir of the indentation. This may lead to the learning algorithm taking a long time to converge or failing to converge at all.

Since we expect a fast learning rate to be advantageous in early iterations, and a slow one to be advantageous in later iterations, it makes sense to have a learning rate that slows down as the algorithm progresses. We can do this with a learning schedule.

An example of gradient descent undertaken with a small learning rate. The progress of the gradient descent algorithm in this case is extremely slow, with many steps required to move into an area of interest. An example of a large learning rate leading to the gradient descent algorithm being unable to converge into the local optima. Instead it jumps from side to side of this optima and begins to diverge *away* from it. An idealized graph indicating the effects of ‘very high’, ‘high’, ‘good’ and ‘low’ learning rates. The good learning rate reduces the loss function relatively quickly early, and then slowly there after. It reaches the lowest loss of any options. The high learning rate initially out-performs the good rate, but quickly levels off at a much higher loss value than that reached by the good rate. The very high learning rate initially gets some small gains and then rapidly starts getting worse (essentially a *very* high learning rate will just randomly jump around the loss surface). The low learning rate decreases the loss value much slower than either the good or high learning rates. It eventually produces loss values lower than the high learning rate, and looks like it might eventually equal the good learning rate but the graph cuts off (the message being that a low learning rate will probably get a good result, but it takes too long).

In addition to specifying a learning schedule, additional ways of adjusting the learning rate are sometimes used. An example is the use of momentum, which will adjust the size of the adjustment to the weights in a step of the learning algorithm based on the size of the previous step. There are various other approaches that seek to work with an adaptive learning rate which will adjust the learning rate for individal parameters. See here for a simple overview of some of these approaches.

Initialization of Weights

There are a number of ways to randomly initialize the weights in a network, from simple approaches that use Gaussian distributions with 0 mean and small variance, to more complicated initialization schemes such as Xavier initialization, which seeks to keep the amount of variance equal in different layers. The interested student can read more about Xavier initialization here.

Stopping Training

The basic specification of the gradient descent algorithm continues until convergence. In fact, doing so will often lead to overfitting, even when regularization is performed (see next step). Accordingly, it is normal to periodically test the performance of a network being trained on hold-out validation data. Most quality neural network implementations will allow you to save a copy of the network parameters periodically during the learning process. This means that you can examine the performance of the network on the validation data, and recover the values of the parameters when this performance peaked.

Choosing a maximum number of epochs will, of course, also provide a termination criterion for the algorithm. Given the above, you want to make sure you choose enough epochs that you are able to find the point at which performance peaks on the validation data.

Share this article:

This article is from the free online course:

Advanced Machine Learning

The Open University

Contact FutureLearn for Support