Expected Loss, the Bias-Variance & Amp – Overfitting
Before considering what this desire to minimize expected loss on new data means for us when training a model, let us first analyze the expect loss, here after the expected error, of a model on data.To do this, we introduce three concepts:
- The Irreducible Error. This is the inherent randomness in the system being modelled.
- The Bias of the Model. This is the extent to which the expected value of the system differs from the expected value of the model.
- The Variance of the Model. This is just what is stated: The variance of the model function. Intuitively it can be understood as the extent to which the model moves around its mean.
Want to keep
The Open University online course,
Advanced Machine Learning
OverfittingWhen we utilize a model that is too complex for the function being modelled (given the amount of learning data we have available) we see that the variance component of the expected error increases. In popular parlance this is known as overfitting.It is important to note that increasing the complexity of a model will always lead to better performance on the training data. Essentially, by increasing complexity (increasing the number of parameters that can be fit to the training data) we increase the ability of the model to customize itself to the training data. This is great if what it is doing is fitting itself to patterns in the training data that are present in the entire population (i.e. in all data). But it is not good if what is actually going on is that the model is fitting itself to random noise present only in the training data. An example is given in the following graph:The blue line is a linear regression (OLS) model. The red is a polynomial regression model of third order. The form of these models (written more clearly than in the legend) is:Linear Regression Model: \(\hat y=\beta_0 + \beta_1x\)Polynomial Regression Model: \(\hat y=\beta_0 + \beta_1x+ \beta_2x^2+ \beta_3x^3\)Note that the linear model has two parameters and the polynomial model has four. These models have been fit to minimize the MSE loss function on the training data. Clearly the polynomial model has a lower MSE on the training data. In fact it has no MSE – it fits the data perfectly. But looking at the true function we know that it will perform worse on new data. It will not generalize well. It has overfitted the training data.To emphasise the fact that complexity causing overfitting is related to the amount of training data, examine the effect of training the two models on 100 points instead of 4:Overfitting is one of the most constant dangers of advanced statistical machine larning. The models we work with tend to be very complex, and so able to overfit even when trained of reasonably large sets of training data.How do we choose models that have low expected error? We will have discuss this in detail in week 2. But already it should be obvious that one thing we can do is see how models perform and data that they have not been trained on. This is the basis of validation techniques for model evaluation and selection.
Advanced Machine Learning
Our purpose is to transform access to education.
We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.
We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.