Skip main navigation

Cross Validation Exercise 1

The first exercise on cross validation.
So now we’re going to look at an example exercise where we perform cross-validation. And we’re actually going to be performing cross-validation to work out an optimal value for a hyperparameter. So what are hyperparameters? Well, hyperparameters are not parameters of the model. They’re parameters of the algorithm that generates the model. And examples that we have, are– just so far– are, for example, the order of a polynomial regression model, or the number of hidden nodes in a neural network. Since we’ve worked both with polynomial regression and with neural networks in these examples, we’ve actually been using hyperparameters. The question then is, obviously, how can we work out a good value for a hyperparameter?
How can we work out what order polynomial regression models should be, or how many hidden nodes a neural network should have. Well, the answer is that it all just comes down– well, the simplest way to do it is to simply build models with different values of the hyper parameter, and then perform model selection on the resulting models. So we’re going to do this. In this example, we’ll do it with polynomial regression. We’ll build a series of polynomial regression models of different orders. And then we will evaluate the performance of the models generated with different order polynomial regression on validation data to see which one performed best. OK, so this example is in cross-validation EX1.
Like always, we prepare the data.
We’ve just got a synthetic y versus x. We’re going to do a two-way split. Now this will give us training and test data. But of course, in this case, because we’re doing cross-validation, the training data will actually participate in the validation itself. How is that going to work? Well, we remember cross-validation essentially uses the training data and splits it up into subsets and proceeds to build a series of models using every subset bar one, and evaluating the performance of the model built on those particular subsets on the remaining hold out subset. And it would do this holding out each possible subset. Of course, you’ve gone over that in the article. So we don’t need to labour the point.
Let’s see how this works in this case. What we’re going to do is we’re going to build polynomial regression models of order three, four, five, and six. And we’ll use tenfold cross-validation to evaluate their performance. I’ve built this function here, this sapply function that will do that.
What it’s going to do is it’s going to split the training data up into 10 different subsets. And inside the loop or the implicit loop in the sapply function, it will proceed to build models for each set of subsets by a particular one. And based on those models, see how they perform in evaluating the holdout subset. You guys, of course, can attempt to replicate this to get some practise.
The result, once again, this is total squared error rather than mean squared error. This time it’s total squared error. And here we see the total squared error results for the four different models. Third order, up over 60 million. Fourth order, little over 50 million– 50.18 million. Fourth order, 50.20– sorry, fifth order, 50.20. And sixth order, 50.20. So the best model, the best ordered polynomial regression model, was clearly fourth order.
We wouldn’t normally actually look at the results. We’d automate it, allow the computer to examine the total squared error results and find the best model, which we do here. And now we’re going to create a model of that order using the whole training set and see how that performs on the test data. What I’m going to want to do is compare how it performs on the test data to how it performed on the validation data. So I’m just going to change the total squared error into a mean squared error.
And then also generate the mean squared error on the test data.
Now that it’s done, now that we have both the model’s performance on the test data and on the cross validated data, it would be possible to perform a statistical test to make sure that the model is not doing significantly worse on the test data than it was on the validation data. And in doing this, if we can confirm that that’s not happening, then this can allow us to be confident that we weren’t merely selecting that model based on luck in how well it performed on that particular validation data. Any rate, let’s see how our chosen model performs.
And I’ve actually also here generated, worked out the standard deviation of the residuals on the test data. And I’m going to use that to create some confidence intervals.
And there we are. Confidence intervals, exactly like we talked about in the article. I’m just doing the prediction, the regression curve, plus or minus two standard deviations, where the standard deviations were generated by the test data. Standard deviation of the residuals were found from the test data. So the black line is our regression curve. The red lines are our confidence intervals at two standard deviations.
The first exercise on cross validation. The associated code is in the Cross Validation Ex1.R file. Interested students are encouraged to replicate what we go through in the video themselves in R, but note that this is an optional activity intended for those who want practical experience in R and machine learning.
In this video-exercise, we perform cross-validation to determine a good value for a hyper-parameter in the training algorithm. The example looks at determining the optimal order of a polynomial regression model on a synthetic data set.
We divide the data into training/validation and test subsets, where we then perform cross-validation using the training/evaluation data. We then build a set of polynomial regression models of different order and evaluate their performance via cross-validation. We use these results to determine the best order of polynomial regression models for this problem, and build a model of this order from the combined training/evaluation data and obtain an unbiased estimate of this models expected performance of new data using the test data.
In addition we look at how we can create confidence intervals around our regression curve for our chosen model, and discuss how statistical hypothesis tests on the chosen model’s performance in the validation and test data can be used as an additional safe-guard to ensure that our model-selection process lead to a reasonable result.
Note that the stats R package is used in this exercise. You will need to have it installed on your system. You can install packages using the install.packages function in R.
This article is from the free online

Advanced Machine Learning

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education