Want to keep learning?

This content is taken from the The Open University & Persontyle's online course, Advanced Machine Learning. Join the course to learn more.

Skip to 0 minutes and 1 second So here we’re going to be looking at the tidbits bias variance.R script. And what we’re looking at here is just a collection of examples and graphs that will give us some idea about the bias variance decomposition, some visual impressions of it. Now, you can of course go through this code. But in this case, this is not exercise code. It’s code that you can look at if you want to see how this was done. But there’s no need for you to do so. Now, we start off. We generate this graph here, which is an idealised bias variance graph. So we have on the vertical axis the expected error, and on the horizontal axis the complexity.

Skip to 0 minutes and 56 seconds We see the expected error comes in essentially three forms. The bias– the blue line– starts off high for a simple model, decreases as the model gets more complex. Variance starts off low for a simple model, increases as the model gets more complex. And the irreducible error, which stays the same throughout. And then the purple line is the total error. Now, when we’re trying to find the optimal statistical model or the optimal level of complexity of the statistical model, what we want to do is to find the minimal total error. We want the complexity of the model to hit the sweet spot, where we reduce the sum of the error due to bias and the error due to variance.

Skip to 1 minute and 45 seconds In this case, we get this horizontal black line, and we see where it is.

Skip to 1 minute and 51 seconds Now, what’s important is that in this graphic, we’re holding the amount of data we’re working with static. Now, we can change the complexity of the model. But this, of course, involves a trade-off. We can reduce bias, but only by increasing variance, and vise versa. But we can reduce bias and variance together if we manage to get more data. So we can see what would be expected to occur if we were to get more data. And here, we’re going from the previous graph is now in the dashed lines. And with this increase of data, we move to the now solid lines. And as we would expect, when the amount of data increases, the optimal model that we want becomes more complex.

Skip to 2 minutes and 45 seconds We see, of course, that as well the total error has reduced as we get more data, and the complexity of the model associated with the optimal complexity is more complex. If we’re thinking in terms of the result of training a model, and finding the minimal total error is what we’re aiming for, a situation where we find a model that is to the left of the optimal location– that is to say, a model that has higher bias and is simpler– is a situation where we’re under fitting the data.

Skip to 3 minutes and 24 seconds The model that we’re working with, the form of the model, is not complex enough to capture the patterns that we’re seeing in the data, the patterns that represent the function in the real world we’re trying to model. And on the other hand, if we end up with a model that’s to the right of the optimal model– that is to say, a model that’s more complex– we’re going to end up overfitting the data. We’re going to end up having a model that essentially ends up fitting to the noise it finds in the training data, rather than to patterns that are representative of not just the training data, but the entire population data that can be generalizable to new data cases.

Skip to 4 minutes and 12 seconds So here we’ve left unspecified what the measure of complexity is. We just have complexity here. In this course, and in general, it’s very common to use the number of parameters in a statistical model as the measure of complexity. Now, this actually works perfectly for linear models. It does not work unproblematically for non-linear models. Now, for example, there can be very simple non-linear models– simple in the sense that they have only a small number of parameters that are able to massively over fit on data. And here’s a very small example. This is a cosine model with two parameters. And we are capable of overfitting that, in fact to any size dataset.

Skip to 5 minutes and 9 seconds Essentially, the non-linearity here gives us a degenerate case, where we’re able to overfit the two-parameter model on any data. It will just give us smaller and smaller cycles in the cosine so that they can run through every data point. So this is just a bit of a warning, really. This measure of complexity when we’re working with non-linear models– this idea that the number of free parameters is a good measure of complexity– is imperfect. It generally works very well. But there are cases where it will break down.


A discussion of the bias/variance decomposition of expected error for a statistical model, the relationship between expected error and data size and expected error and model complexity, as well as how we can measure complexity of statistical models (and the weaknesses of the chosen measure of complexity).

This is not a video-exercise step - there is no optional associated code exercise which you are able to complete. You can, though, look at the code used in this video. It can be found in the Tidbits Bias Variance.R file.

Share this video:

This video is from the free online course:

Advanced Machine Learning

The Open University