Skip to 0 minutes and 2 seconds All right, now we’re looking at the second learning graph example. Now, what we’re going to here is very similar to what we did in the first learning graph example, but we’re going to be doing it with a bunch more models. And we’re going to see the differences in these learning graph models for the different types of models, the differences in the learning graphs for the different types of models, and in particular for basic models compared to complex models. First of all, let’s generate the data. Here, I’m just going to use synthetic data.
Skip to 0 minutes and 37 seconds And once again, we’re going to use this synthetic data– we can have a little bit of a look at it. I called it train, because as I said, you’d be doing learning graphs on the training data, not on the complete data. It’s got 500 rows. It’s called an x column and a y column. We’ll be estimating the y based on the x. So as before, we’re going to be generating models from 10%, 20%, up to 100% of the training data, and estimating their in-sample and out-of-sample mean squared area. The out of sample will estimate 10-fold cross-validation. But rather than doing it for a single ordinarily squared model, we’re going to be doing it for six polynomial regression models.
Skip to 1 minute and 21 seconds And the orders are going to be order one, two, three, four, five, and 20. Now, of course polynomial regression of order one is just linear regression. So you can look over these lines and my comments a little bit more slowly in your own time, but it’s doing exactly the same thing that we did in the last excise, only now for these six models rather than a single one.
Skip to 1 minute and 51 seconds Once we’ve got these results, we’re going to graph them, just like before. Except we’re now going to get six graphs.
Skip to 2 minutes and 7 seconds And notice that they look quite different for the different models.
Skip to 2 minutes and 13 seconds The low-order polynomial models, order one, order two, notice that the lines are very close together from the start. They’re not getting a great deal of improvement from more data.
Skip to 2 minutes and 30 seconds Now, as we go to the higher order, three, but particularly four and five, there is a bit of a gap at the start, but that’s disappeared pretty much by 150 to 200 or so. Notice also though, that these models are converging together at a much lower level than the first order one and order two models. These more complex models are basically zooming in on a mean squared area of around about 6,000, whereas the order one and two models, we’re zooming in on a main squared area of up around 10,000 or 11,000.
Skip to 3 minutes and 11 seconds So clearly, the more complex models of polynomial regression models for orders four or five are performing better and they appear to need maybe 200, but certainly say 150 rows of training data. Let’s jump to the really complex model, order 20. Now, this too is zooming in on about the same, 5,000. But clearly, it still needs a lot more data to be able to really maximise its potential. It doesn’t perform at all well up until about 150. It’s off this graph. But the lines are still clearly separate up at the maximum amount of training data, 500. Now, this is actually exactly what we would expect from a model that is overly complex.
Skip to 4 minutes and 10 seconds If you give it enough data, this overly complex model will end up performing as well as an optimal model. If you give it less than that amount, it’s going to over fit. So the order 20 model, if you give it sufficient amount of data, it will converge to the same sort of performance as the order four and order five models. But it needs that extra data to be able to do so. If it doesn’t get that extra data, it’s going to over fit. The order four and order five models, they appear to be roughly the right complexity to model this function.
Skip to 4 minutes and 45 seconds If you actually look at how we generated the synthetic data, you’ll find I think that it is a fourth order polynomial. So that’s no surprise. They managed to converge to performing very well, not much of a difference between the two lines, very quickly, needing about 150, 200 data. The simple models, order one and order two, they converged to doing as well as they can very quickly. The two lines come together very quickly. But they’re simply too unsophisticated. They’re too simple to be able to model the data generating function well. And so, although they converged to doing as well as they can quickly, they actually can’t do very well.
Skip to 5 minutes and 27 seconds Their mean spread area is up at about 11,000, or almost twice what the order four and five models do.
Learning Graphs Exercise 2
A second exercise video for learning graphs. The associated code is in the Learning Graphs Ex2.R file. Interested students are encouraged to replicate what we go through in the video themselves in R, but note that this is an optional activity intended for those who want practical experience in R and machine learning.
We will generate learning graphs for six different polynomial regression models, varying in complexity, on synthetic data. We discuss the six different graphs and the information they give us regarding the utility of the different model types for this problem. These graphs, and this discussion, form the basis for the quiz given in the next step.
Note that the stats R package is used in this exercise. You will need to have it installed on your system. You can install packages using the install.packages function in R.
© Dr Michael Ashcroft