Skip to 0 minutes and 9 seconds Now, we’ve already talked a little bit about overfitting. And the idea of overfitting is that you get a model that does very well on the training data, but is such that it’s unlikely to do well on new data. It’s not generalizable. It’s overfitting to the noise in the training data. And we can see that this red curve is doing that, as opposed to, say, the green curve. Now, in an example like this, it’s immediately obvious that overfitting is related essentially to what we call the smoothness of the function. This green function is quite smooth. This red function is not at all smooth. And when you are overfit, you’re often going to end up with a function that’s not very smooth.
Skip to 1 minute and 9 seconds Now, consider what happens when you do an optimization on a loss function.
Skip to 1 minute and 31 seconds So here we have the loss function. Here we have one dimension of parameters. And of course, as we try to optimise the loss function, we seek to minimise it. To end up moving from wherever we started down until we find along the loss surface, following the gradient, until we find the local optimum or, hopefully, the global optimum. Now, if we’re overfitting, then as we continue moving along this loss surface toward the optimal point, our function is going to get less and less smooth. We might be up here with the green one. But by the time we come down here, we’re with the red one.
Skip to 2 minutes and 20 seconds Now, if this is the case, it makes sense to think, hey, if I stop before I move all the way down to the optimum of the gradient of the loss surface, then I’m likely to get a function that does not overfit. Now, this is a very ad hoc idea, this thought that, hey, if I’m in a situation where if I get the optimal point on the loss surface, I’m going to get a function that overfits, that if I stop before I get there, I might get a function that is probably going to be smoother, probably going to overfit less. It’s an intuitive idea, but it’s also very ad hoc. It might be that you get a function that’s terrible.
Skip to 3 minutes and 12 seconds But it might be that you’re lucky, and you do indeed get a function that is smoother and likely to generalise. This is the idea behind early stopping.
Early Stopping and Overfitting
A discussion of the intuitive idea behind using (the naive version of) early stopping to avoid overfitting. We discuss the intuitive relation between overfitting, function smoothness and model complexity, in the context of the optimization of a loss function.
© Dr Michael Ashcroft