Skip to 0 minutes and 1 secondOK, so now we're going to have a little example, an example exercise that looks at these statistical tests we talked about or we saw in the article. This is statistical test EX1.

Skip to 0 minutes and 16 secondsNow follow me while I'm looking over the code. You, of course, can run the entire function once you source the code by just running the statistical testing exercise one function. We'll go through it line by line. And if you're interested, you can attempt to replicate it yourself. So we've prepared the data.

Skip to 0 minutes and 38 secondsIt's time-- This data is a set of careers from, I think, 1950 USA. And what we're going to be looking at is the relationship between income and prestige for different careers at this time in America. This dataset only has 45 rows. Since this is such a small dataset, what we'll do is all but one cross-validation. So we will, though, want to also have a test set so that we can get an unbiased estimate of the chosen model's performance on new data. So we'll do a two-way split between training data and test data. And then we'll do cross-validation on the training data. So let's do that random split now. 35 rows in training, 10 in test.

Skip to 1 minute and 46 secondsNow we're going to do all but one cross-validation for three different model types. We're going to do ordinary least squares, Poisson regression, and a second order polynomial regression model. All of these, as I said, we're going to do all but one cross-validation. Since it's 35 rows in the training data, that means we'll make 35 models, each of them using all of the training data, bar one row. And then we'll see how well that model performs on the holdout row. And we'll average the results or combine the results. So we do that for ordinary least squares, for Poisson regression, and for polynomial regression. Now, we're going to want to find the best and second best algorithms, according to these cross-validation results.

Skip to 2 minutes and 41 secondsHere is our results. And we'll calculate the mean squared errors from the residuals, work out which model was best, which model was second best, and which model was worst. And we'll give the mean squared error for each. So here we go. And we see that ordinary least squares was best. It had a mean squared error of 381. Second best was polynomial regression with mean squared error of 396. And the worst was Poisson regression, with 440 mean squared error. Now we want to do is we want to do a T-test to get some indication of how confident we can be that the ordinary least squares algorithm really can be expected to perform better on new data than the polynomial regression model.

Skip to 3 minutes and 39 secondsSo we're going to be testing to see how probable it is that we see the difference we do observe in the situation where, in the long, the expected performance of ordinary least squares and polynomial regression is equal. So we did this T-test and we see, OK, the p-value for the T-test between OLS and polynomial was 0.46. And that is a very high value. We cannot rule out the possibility that, in fact, a polynomial regression is equally as good or better than ordinary least squares. Or at least, we can't rule that out based on the performance we've had in this cross-validation. And that's not very surprising, because there was very little data.

Skip to 4 minutes and 31 secondsEven doing more than one cross-validation, we managed to only evaluate our models on 35 rows. If you were to see something like this, you would need to concede that you didn't really have enough validation data-- in this case, cross-validation data-- to confidently determine which model can be expected to perform better on new data, which model is the best?

Skip to 5 minutes and 3 secondsNonetheless, we will proceed using our chosen model, the ordinarily least squares model. And we're going to see how well it performs on the test data.

Skip to 5 minutes and 30 secondsWe see that our chosen algorithm, the ordinary least squares algorithm, obtained a mean squared error on the test data of 113. But we want to get more information than that. We want to get some sort of confidence interval around this unbiased estimate of expected performance on new data. And we can do that, once again, using a T-test. We're going to be looking at the square of the residuals for the ordinary least square model on the test data. And we want our confidence intervals to be at 0.995, which is to say, we want to know the interval within which we can have 99.5% confidence that the true expected error-- expected means squared error-- of our chosen model on new data lies.

Skip to 6 minutes and 30 secondsAnd here we have the 99.5% confidence interval for our chosen model. It lies between negative 97 and positive 324. Which again, is a huge interval. Essentially, we shouldn't trust this estimate of the expected performance of our chosen model on new data one bit. The estimate says we expect the mean squared error of the ordinary least squares model on new data to be 113. That's an unbiased estimate. But we should have very little confidence in it. Again, this is unsurprising, because we had so little data. Our test data only had 10 rows.

Skip to 7 minutes and 20 secondsAs the amount of test data increases, we will be able to be more and more confident in our estimate of the expected performance of our chosen model. But in this case, we simply should have very little confidence in it at all.

Skip to 7 minutes and 37 secondsYou might be wondering what's going on when the lower limit of our confidence interval for a means squared error estimation is a negative number. Now, means squared error is-- well, any squares cannot be negative. So a negative number is actually impossible. Now what is going on is that we're using the T-test here. And this is only an approximation. It's making a number of assumptions about the statistic, in this case the means squared error, that are not true in reality. We can't, for example, model it with a normal distribution. So in this case, our confidence is so low that these assumptions are basically failing badly. And the confidence interval is spreading out into impossible negative values.

Skip to 8 minutes and 30 secondsOnce again, this simply shows that we should have very, very little confidence in the estimate of the expected means squared error of our chosen OLS model.

# Statistical Tests Exercise 1

A code exercise about using statistical hypothesis testing in model evaluation. The associated code is in the *Statistical Test Ex1.R* file. Interested students are encouraged to replicate what we go through in the video themselves in R, but note that this is an optional activity intended for those who want practical experience in R and machine learning.

On a small data we perform all-but-one cross-validation for model selection, and then evaluate our best performing model on some hold-out test data. During this process, we perform a simple statistical test comparing the performance of the best and second best performing models during the model selection step. This gives us an idea of how confident we should be that the best performing model really is better than the second best model. In addition, we calculate confidence intervals regarding the expected MSE of our chosen model based on its performance on the test data.

Note that the *utils* and *car* R packages are used in this exercise. You will need to have them installed on your system. You can install packages using the *install.packages* function in R.

Please note that the audio quality on parts of this video are of lower quality that other videos in this course.

© Dr Michael Ashcroft