# Regularisation

An article on regularisation techniques, including normalisation, standardisation, lasso and ridge regression and AIC/BIC information criteria.

In the videos we briefly mentioned some regularisation techniques such as Lasso regression, which we will discuss in more detail in this article.

Before we start our discussion on regularisation though, it’s worth mentioning a couple of other similar sounding and related terms that are useful to understand, normalisation and standardisation.

## Normalisation

In general, normalisation ensures numerical data from different features are in a similar range, specifically in the range 0 to 1. The way this is usually done is either by dividing every value in a feature by the maximum value observed, or by dividing every value by some theoretical maximum value. For example, with image data 8-bit pixel values are constrained between 0 and 255, so you might normalise by dividing each value by 255, even though the number 255 might not appear in your data.

The reason for normalising your data is so that some features are not given more weight in machine learning models just because the absolute numbers themselves are larger in those features. By normalising your data, features can be weighted more equally. Normalisation is particularly important for algorithms where some measure of distance between data points is taken, such as K-nearest neighbour.

## Standardisation

In some cases, it might be desirable to go a step further than normalisation and perform standardisation instead. This is done by subtracting the mean of a feature from every value in that feature, and then dividing each by the standard deviation. This results in value of that feature being centred at zero, with positive and negative values either side of zero having a variance of one.

Standardisation is particularly useful if you have good reason to believe the data in your features is normally distributed, or the machine learning model itself makes that assumption (e.g. Naive Bayes). It’s also important to standardise your features before performing Principal Component Analysis (PCA).

## Reversing normalisation and standardisation

Of course, if you are using your model to make regression predictions, and you normalise or standardise your target data, you need to remember to transform your predictions back to the original range pre-normalisation. For normalisation this is just multiplying back by the original peak value, and for standardisation this is multiplying by the original standard deviation and adding back the mean.

As usual, scikit-learn has tools that look after the details of this for you. See StandardScaler, for example.

## Lasso and Ridge regression

Two common regularisation methods that aim to reduce model overfitting are Lasso and Ridge regression. In a nutshell, both aim to limit the magnitude of individual parameters in a regression by introducing a penalty to having large parameters in a model.

In a regular linear regression, if we have (n) features, the model is a linear combination of every feature at that point ((x_1, x_2, …, x_n)), and is used to predict some target value (y). If we label the weights or parameters in the model as (p_0, p_1, … , p_n) then the model is:

[y = p_0 + p_1x_1 + p_2x_2 + … p_nx_n]

The cost function the algorithm is looking to minimise (let’s call it (F)) is just the sum of squared differences between the actual data values for your target vector (let’s call them capital (Y)) and the values predicted by the model, (y). If we have (m) observations in our training data, labelled (1) to (m) we have:

[F = sum_{i=1}^m(Y_i – y_i)^2]

All Lasso and Ridge regression do is add an extra term to the cost function that depends entirely on the values of the weights (p_0, p_1, … , p_n), and an additional hyperparameter called (lambda) that determines how much of a penalty we want to add to large weights.

### Lasso regression

In Lasso regression the penalty term is just the sum of the absolute (i.e. ignore any negative signs) values of all the weights multiplied by (lambda), so that:

[F = sum_{i=1}^m(Y_i – y_i)^2 + lambda sum_{i=1}^n|p_i|]

If (lambda) is large, then the penalty term has a strong effect, and most parameters will be at or close to zero, potentially risking the model underfitting. On the other hand, if (lambda) is very small, or zero as in the case of regular linear regression, you may risk overfitting. To select the optimal value of (lambda) you can always try the techniques of splitting data into training, validation, and test sets, and using cross-validation, as discussed in the preceding activities.

### Ridge regression

Ridge regression is very similar to Lasso, except that the square of each weight is added up, rather than the absolute value:

[F = sum_{i=1}^m(Y_i – y_i)^2 + lambda sum_{i=1}^n p_i^2]

As before large values of (lambda) result in a larger penalty and a stronger regularisation effect.

One disadvantage of Ridge regression relative to Lasso is that while Lasso can set some parameter values to be exactly zero, so reducing the complexity of your model, Ridge regression does not do this. The less important parameters will be set very close to zero instead.

## AIC and BIC

The final regularisation techniques we will mention are methods to evaluate whether the complexity of models are appropriate for a given dataset. These are the Akaike Information Criterion (AIK) and Bayesian Information Criterion (BIC).

They do this by comparing the number of fitted parameters in a model ((k)) with the maximised likelihood function ((L)).

(L) is a measure of how well a given model fits with the data, higher values being better. It is a probability, so is between 0 and 1. An overfitted model might have a high likelihood with a lot of parameters, while an underfitted model might have a lower likelihood but fewer parameters. AIC and BIC both aim to optimise this trade off.

AIC is calculated as follows:

[AIC = 2k – 2ln(L)]

While BIC is similar but also takes into account the number of observations in your dataset ((n)) and is calculated as:

[BIC = kln(n) – 2ln(L)]

For both AIC and BIC, the lower the value, the better your model is performing at balancing complexity with its ability to generalise.