Advanced Machine Learning: Training Basic Neural Networks
Aside: Cross-entropy
Minimizing the cross entropy seeks to minimize the Kullback Liebler divergence (a measure of divergence between two probability distributions) between the empirical distribution (from the data) and the predicted distribution (from the model). Entropy, cross-entropy and the Kullback Liebler divergence and concepts from information theory.Given a loss function, the basic gradient descent algorithm is the same as that given in week one. Before beginning, we specify a constant learning rate, R, or a learning rate schedule, R(i), where i is the step of the algorithm is specified. We assume a constant rate for simplicity.
- Assign random initial values to the weights. These should be close to, but not precisely, zero. The weight cannot be uniformly assigned zero because the weight gradients will all be zero at this point and gradient descent will not go anywhere!
- Repeat until convergence:i. For each weight, \(w_*\), calculate its partial derivative with respect to the Loss function, \(\frac{\partial L}{\partial w_*}\).ii. Update each weight given its partial derivative and the learning rate: \(w_* = w_* – \frac{\partial L}{\partial w_*} \cdot R\)
Aside: Backpropagation
You may have heard of backpropagation. This is a means of efficiently calculating the partial derivatives of the weights in the network. The name comes from the fact that the partial derivatives are calculated backwards through the network using the chain rule of derivation.Stochastic & Mini-Batch Gradient Descent
Running the simple algorithm is problematic. The loss functions as defined are functions of the whole data set. This means that every iteration of the algorithm (every update to the weights) requires calculating the estimated value for the target variable for every datum. Since there will be many iterations of the algorithm this could be a very long process.Stochastic and mini-batch gradient descent are two methods of responding to this problem. Rather than calculating the loss function over the entire training data at each iteration, the loss function is calculated over some subset, known as a batch. This is mini-batch gradient descent. The extreme case when the loss function is calculated only for a single datum each iteration is known as stochastic gradient descent. Accordingly, when you train a neural network you are often asked to specify a batch size to be used in this way.Chosing a batch size, however, involves a trade off. The smaller the batch size the quicker the gradient descent iterations are to run. But equally, the smaller the batch size the more noisy the minimization progress will be. While following the derivatives for the loss function calculated on the entire data set will be sure to point in a direction reducing the loss function on the data, the derivation calculated on an individual datum or small subset of the data may be inaccurate, as seen in the diagram below.
Epochs
Notice that the gradient descent algorithm looks at the data repeatedly. In the basic algorithm specified above, we continue until convergence is reached. If each step using one \(\frac{1}{n}\) of the data, there is no guarantee that convergence will be reached in \(n\) steps. So we will have to cycle through the data more than once. When training networks, it is common to be asked for the maximum number of times that you want to cycle through the data. This is known as the number of epochs. When proceeding from one epoch to another, the data is normally shuffled (reordered in a random order) leading to the use of different subsets as batches.Learning Rate
We note that before beginning the gradient descent algorithm we should choose a learning rate or learning schedule. Here we discuss what a learning schedule is and why it can be useful.Consider that the loss function for many datasets will contain (i) large areas of almost flat gradient corresponding to poor values for the model parameters; and (ii) high quality local optima at the ‘bottom’ of steep inclines. We are likely to start our optimization in an area of the first sort, and wish to end it in an area of the second sort.Working with a slow learning rate, it will take a long time to tranverse the almost flat, low quality areas. This is because the gradient of these areas is small, and the iterations are the product of the gradient and learning rate. Having a fast learning rate in this area would be useful in avoiding wasting time. We might also be able to jump over shallow, low quality local optima that might exist in such an area.But working with a fast learning rate can cause difficulties in small areas of steep gradient. At worst, the gradient descent algorithm may jump right out of the small areas of interest. If that does not occur, it may well still keep jumping over the nadir of the indentation. This may lead to the learning algorithm taking a long time to converge or failing to converge at all.Since we expect a fast learning rate to be advantageous in early iterations, and a slow one to be advantageous in later iterations, it makes sense to have a learning rate that slows down as the algorithm progresses. We can do this with a learning schedule.


Initialization of Weights
There are a number of ways to randomly initialize the weights in a network, from simple approaches that use Gaussian distributions with 0 mean and small variance, to more complicated initialization schemes such as Xavier initialization, which seeks to keep the amount of variance equal in different layers. The interested student can read more about Xavier initialization here.Stopping Training
The basic specification of the gradient descent algorithm continues until convergence. In fact, doing so will often lead to overfitting, even when regularization is performed (see next step). Accordingly, it is normal to periodically test the performance of a network being trained on hold-out validation data. Most quality neural network implementations will allow you to save a copy of the network parameters periodically during the learning process. This means that you can examine the performance of the network on the validation data, and recover the values of the parameters when this performance peaked.Choosing a maximum number of epochs will, of course, also provide a termination criterion for the algorithm. Given the above, you want to make sure you choose enough epochs that you are able to find the point at which performance peaks on the validation data.Our purpose is to transform access to education.
We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.
We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.
Learn more about how FutureLearn is transforming access to education