# Regularization

In module 1 we discussed how important regularization is: It permits us to work with sophisticated statistical models, with many parameters, while controling their complexity to manage overfitting. With their large number of parameters, neural networks (and particularly deep neural networks) are prone to serious overfitting. It is therefore essential to work with some form of regularization. Here we examine some of the most common regularization techniques for use with neural networks: Early stopping, L1 and L2 regularization, noise injection and drop-out.

## L1 and L2 regularization

We discussed L1 and L2 regularization in some detail in module 1, and you may wish to review that material. As a short recap, L1 and L2 regularization introduce a second term into the function minimized during training, thereby imposing a penalty based on the size of the parameters in the model. L1 imposes a penalty based on the sum of the absolute value of the parameters, and L2 does so based on the sum of squared parameters. In both cases, this imposes a trade-off in the optimization problem between minimizing the original loss function, such as MSE or CE, and keeping the weights small to minimize the regularization penalty. The relative importance of the two components is governed by the regularization hyper-parameter, \(\lambda\). We reproduce the defining equations here, remembering that L represents the basic loss function:

\[\hat\beta = argmin_\beta L(\beta,X,Y) + \lambda R(\beta)\] \[R_{L1} = \sum_{i=0}^m |\beta_m|\] \[R_{L2} = \sum_{i=0}^m \beta_m^2\]Both techniques are commonly used with neural networks, and are available in many implementations. It is worth noting that in the neural network context, L2 regularization often goes by the name of *weight decay*. Accordingly, when you are asked to provide an optional weight decay parameter, this is the \(\lambda\) hyper-parameter governing the L2 regularization penalty.

When selecting values of \(\lambda\) to try, it can also be worthwhile to examine an implementation’s documentation. It can be that rather than being the weight given to the regularization component of the function to be minimized, it is scaled by the number of rows, \(N\), in the data. You may, for example, find cases where \(\lambda\) enters into the L2 optimization function as:

\[\hat\beta = argmin_\beta L(\beta,X,Y) + \frac{\lambda}{2N} R_{L2}(\beta)\]And into the L1 optimization function as:

\[\hat\beta = argmin_\beta L(\beta,X,Y) + \frac{\lambda}{N} R_{L1}(\beta)\]The 2 in the denominator for L2 regularization just simplifies the partial derivative equation for weights used when performing gradient descent. In general the use of \(N\) reflects the fact that more regularization is required when working with less data.

## Early Stopping

Early stopping was once the most common method of regularizing neural networks. The idea behind it is simple: The parameters of the model are being optimized to minimize some loss function on the training data. Typically there are a large number of parameters, and if they are optimized until convergence this will result in a model that severely overfits - it will be customized to the perculiarities of the training data rather than capturing patterns in the training data that generalize to the general population. So if we stop the optimization process early, we can avoid this overfitting.

If we take the early stopping procedure to mean that we arbitrarily decide upon some maximum number of iterations of the gradient descent training algorithm and stop at that point, this is a poor method of regularization and should be avoided. It is ad-hoc, and since the training algorithm produces different results each time we run it (due to different randomly assigned initial weights) we cannot hope to build a sequence of such models with different degrees of ‘regularization’.

However we still do use early stopping in a more sophisticated way. As we discussed in the last step, during training we periodically check the performance of the model on a hold-out validation data set. Parameter values at each of these points are saved, and eventually we will chose to use the parameter values that correspond to the best performance on the validation data. This should be done even when other forms of regularization are used.

## Noise Injection

A third form of ‘regularization’ is the addition of noise to the training data. Typically, this takes the form of independent Gaussian noise added to each input feature every time a datum is used in the training algorithm. In general, this works to prevent the algorithm from memorizing the training data (since the cases are slightly different each time) and forces it to learn locally smooth functions (since each time it is given the same datum in each epoch it is given slightly different input values with the same target value, training it to map all vectors in a region around each feature vector to the same output value). Sometimes noise is added not only to the input features, but also to the hidden nodes as well.

#### Aside: Denoising Autoencoder

Adding noise can be used to train neural network models that will denoise signals, such as images or audio, through the use of autoencoders. The network is trained so as to try to reproduce the input features (the target variables are simply the input features themselves), but noise is added to the input features (as inputs, the target values are the un-noised feature values) during training.

## Drop-out

The final form of regularization that is commonly used in neural networks is known as dropout regularization. This works by randomly removing a certain percentage of hidden nodes from the network at each iteration of the training algorithm. Dropout regularization works very well and is particularly popular when training deep neural networks.

The theoretical basis for a complete explanation of why dropout works so well is complicated: Using it can be seen as training a number of weaker neural networks and then averaging over them when applying the network to predict new data. At each training step, the dropout procedure creates a simpler network by randomly removing some components of the network. This simpler network is trained on only one batch of data, where the initial weights of that network is inherent from previous networks in the sequence. In this way we train many simple networks and have them share parameters. Under such an interpretation, dropout can be viewed as being a form of *ensemble learning*.

A simpler, if incomplete explanation, is that by randomly removing components of the network at different times during training, we inhibit the ability of the network to learn complex functions based on the interaction of multiple nodes.

Typically some or all hidden layers of a network are subjected to drop out, with a common choice being to drop out half of the nodes at each learning step.

An alternative method is to randomly remove edges into hidden nodes at each iteration of the training algorithm. This is known as drop-connection.

© Dr Michael Ashcroft