Skip to 0 minutes and 1 second When I teach the idea that we should split our dataset into three rather than two, and have trading validation and test data, a lot of students are perplexed as to why we need this third dataset, why the evaluation data, the validation data that we used to select the best model doesn’t also give us an unbiased estimate of the performance of the best model on new data. So here’s a little walk through to sort of give you a real demonstration of why that isn’t the case. I’ll explain why. And it explains why in the article you’ve just read, as well.
Skip to 0 minutes and 41 seconds But I think this example explains, really pushes home the fact that the validation data that we use to select the model will not give us an unbiased estimate of that model’s performance on new data. So theoretically, we remember that the performance of a model on new data is itself a random variable. And this performance is going to vary depending on the samples that were used for training the model, and the samples that are used in the validation data for evaluating the model. Now we can see this variance in a really simple example. Imagine that our data set is just a coin. It’s a random coin so the chance of it coming up heads or tails are equal or 0.5.
Skip to 1 minute and 37 seconds Now we can imagine that we’ve got a bunch of data, historic data, about the coin coming up heads or tails. And we can train a whole bunch of models on this. Now in this case, we know that no model can do better than 50% accuracy in general, because everything here is irreducible error. There is no pattern to be found. No model can hope to do better than simply random guessing. Nonetheless, if we were to build sufficiently many models, some of them on the validation data will do better or worse than 50%.
Skip to 2 minutes and 23 seconds Now because we are going to pick the best model of however many models we’ve built, the chances are very high that the model we select is going to have performed better than 50% on the validation data. And so if we were to use its performance on the validation data as an estimate of how well it will perform in future, our estimate is going to be biassed because we picked the best model. So here in this example, called coin flippers, if you just– well, if you source the code, you can run it– by default it will build 20 models that are trying to estimate heads or tails based on the training data.
Skip to 3 minutes and 13 seconds Now all these models are just going to throw out random heads or tails as guesses on both the validation data and the test data. But remember, that’s fine, because no model can do better than random guessing. When we run it, we see that the best model with a 0.8%– 0.8 accuracy on the validation data. That’s to say all 20 models we created, we saw how they performed on the validation data. We know in general, all of them will perform at 0.5 in the long run. But on the validation data, one of them managed to perform with 0.8 accuracy. It got 80% of the cases right. Now, we picked that one because it did best. This is obviously biassed.
Skip to 4 minutes and 1 second It did better than 50% only because of the variance of throwing a whole bunch of models at this entirely random data. We then see how this model alone performs on the test data. And here it is getting 0.5. This estimate of its performance on the test data is unbiased, because we’re only seeing how that single model will perform on the test data. If you run this yourself a few times, you’ll see that the pattern basically is, when we throw a bunch of models at the validation data and pick the best one, it would typically have a performance better than 50%.
Skip to 4 minutes and 42 seconds But then when we see how that model performs on its own on the test data, it will, on average, end up getting a performance of 50%. Now we can see this even more clearly if we plot the performance of the selected model on the validation data and on the test data over multiple runs of this experiment. So we can look at what happens over 1,000 runs. And we see exactly as we expect, that the selected model on the validation data, over 1,000 runs the mean performance on the validation data of the chosen model is up at 0.78, far higher than the expected performance we know is the true expected performance at 0.5.
Skip to 5 minutes and 38 seconds So it’s clearly a biassed estimate of the expected performance of the selected model. But the selected model’s performance on TestEra is indeed converging in on 0.5, which is the known expected performance.
Why do we need test data?
We give an example that illustrates the need for hold out test data in validation techniques whenever we desire an unbiased estimate of the performance of the final/selected model.
This is not a video-exercise step - there is no optional associated code exercise which you are able to complete. You can, though, look at the code used in this video. It can be found in the Tidbits Test Data.R file.
Please note that the audio quality on parts of this video are of lower quality that other videos in this course.
© Dr Michael Ashcroft