Skip to 0 minutes and 11 seconds Hello again, and welcome back to Advanced Data Mining with Weka. We’re going to look at the time series forecasting package now to do roughly what we did in the last lesson without the time series forecasting package. I’ve got the airline data loaded here. The time series package has given me this additional Forecast tab. I’m going to go straight to that, and without any more ado I’m just going to click Start and see what happens. Well, the time series package transforms the data into a large number of attributes. Unfortunately, you don’t get to see the attributes in the Preprocess panel. We still just have those two attributes there. You don’t see the generated attributes there.
Skip to 0 minutes and 50 seconds You have to go to the Forecast panel and look here. Here’s the original attributes, and here’s transformed
Skip to 1 minute and 0 seconds training data: passenger_numbers; we’ve got month, quarter, date-remapped. The date-remapped is like what we did for the date in the last lesson. We did it manually, which changed it from milliseconds since January 1, 1970 into something more sensible. This actually does a better job, because it takes proper account of which years are leap years and which years aren’t leap years. Then we’ve got these lagged variables. The passenger_numbers lagged by – we just had 12 before – but now we’ve got the lags by 1, 2, 3, right up to 12 for 12 months, I guess.
Skip to 1 minute and 31 seconds We’ve got the square of the date-remapped and the cube of the date-remapped, in case you need those, and a bunch of other things, the date-remapped times these lagged variables. That’s a lot of variables. Underneath here is the generated model, which is very complicated. Let’s see how well it does. Actually, it doesn’t show here how well it does. To see that, we have to turn on Perform Evaluation. Let me click that here. Run it again, and we get a root-mean-squared
Skip to 2 minutes and 2 seconds error of 10.6 on the training set, which looks good: last time we got 16.0. That was the best figure we got. But remember, this is the error on the training set. That’s always very misleading. Let’s make a simpler model. There’s a lot of attributes here. We can’t edit the generated attributes, like I said, but we can apply a filter. So I’m going to go to Advanced Configuration, and for my base learner, I’m going to choose the FilteredClassifier. And in the FilteredClassifier, I’m going to specify linear regression just like we had before, and for the filter, I’m going to choose the Remove attribute filter. Here it is.
Skip to 2 minutes and 53 seconds I’m going to configure that to remove attributes number 1, 4, and 16, which I happen to know the correct ones. I’m sorry. I’m going to leave attributes 1, 4, and 16, and I’m going to set invertSelection to True. So these are the three attributes that I leave. Well, let’s just see what happens. Go back and look at my attributes, and here’s the generated attributes that we saw before. Now here’s the filtered attributes. We’ve got passenger_numbers, we’ve got date-remapped, and we’ve got this lag by 12. This is what we did in the last lesson, remember? Let’s see how we get on here. We got a root-mean-squared error of 27.8.
Skip to 3 minutes and 35 seconds Actually, we got that on the last lesson, but we got even better results by deleting the first 12 instances. Remember the first 12 instances have got lagged values with unknown values, and linear regression does bad things with unknown values, at least as far as time series are concerned. So I want to delete the first 12 instances.
Skip to 3 minutes and 58 seconds Now, I could do that by applying two filters: removing attributes and removing instances and I could use the multifilter. But actually on the time series forecasting panel, there’s an easy way of doing that, which you really need to learn, because you’re going to be doing it a lot. In Advanced Configuration, we’re going to look at Lag creation and the More options. We’re going to say remove leading instances with unknown lag values. Let me run that, and now I get a root-mean-squared error of 15.8, and a model which is exactly the same
Skip to 4 minutes and 33 seconds as the model we got on the last lesson: 1.07 times lag_passenger_numbers plus 12.7. That’s what we got before. Now, let’s just return to this full model that we had. We won’t use the filtered classifier; we’ll just use linear regression. Here it is. Now, we get a Root mean squared error of 8.7. It looks fantastic. But the model looks extremely complicated. We looked it before. Here it is again. Look at the complexity of this model. So it’s probably overfitted. What we’d like to do is to evaluate this on held out training data. We can do that with the Evaluation panel.
Skip to 5 minutes and 20 seconds I’m going to evaluate on – we can either have a fraction here or a number of instances – I’m going to evaluate on 24 instances, that is two years’ worth of instances and run that. I get an error on the test data of 59. That’s huge. The error on the training data is only 6.4. So let’s just have a look at this on the slide. With the full model, all the attributes, we’ve got this enormous gap between the training error and the test error. And with this simple model, with just two attributes there, there’s a little gap, but not very big. So we could try reducing the attributes in other ways. We could actually use the AttributeSelectedClassifier.
Skip to 6 minutes and 9 seconds I won’t do that for you, but to do that I’d have to choose the metalearner AttributeSelectedClassifier and specify linear regression as the base learner and then specify some attribute selection method. If I left that at all the defaults, I would in fact get four attributes selected. And I’d get a training and test error of 11 and 19. Still some indication of overfitting. The gap between these two figures really indicates overfitting. Now, we reduced the model to two attributes using a filter, the Remove filter. But actually there is a simpler way of doing that, which you need to learn, in the Forecast panel.
Skip to 6 minutes and 51 seconds If you go to Lag creation, it’s going to create lags between 1–12 – we saw those – but if you use custom lag lengths, we can increase that to 12, and now it’s only going to create a lag length of 12. I can remove the powers of time. Remember we had the time squared and the time cubed. We can remove the product of time and lagged variables. And if I go to periodic attributes here and click Customize, then I can include whichever ones of these attributes it wants to generate. Now, I’m not going to include any of those. So that will get us the simplest attribute set. I’ll just run that, and let’s look
Skip to 7 minutes and 37 seconds now at the attributes being used, just three of them: passenger_numbers, date-remapped, and this lag by 12. Down here, of course, we’ve got the same result as we got before. We’ve got the same model and the same training and test errors. If we plot these things, this is the training data. Now remember we’re ignoring the first 12 instances at the beginning because we have unknown values for the lagged variable, and we’re reserving 24 instances at the end for testing. So if we look now at the full model, we get this red line, and you can see that the predictions over the test data are starting to vary from those data points.
Skip to 8 minutes and 14 seconds If you look at the simple model, the one with just two attributes, then we get a more accurate line. Here they are, in fact, both together, and you can see the blue one from the simple model is more accurate than the red one for the full model. We’re using one-step-ahead predictions to evaluate the error here, which means that they can propagate. If you look at the solid red line toward the end, the first of those big dips is an error, and then the second sort of ‘double dip’ is an error that’s propagated from the first error.
Skip to 8 minutes and 46 seconds Once it starts making an error in this kind of evaluation, when we’re evaluating the one step ahead each time, the errors are going to propagate. So it’s a pretty bad thing once you start making errors, they get worse and worse. OK. That’s it. Weka’s time series forecasting package makes it easy to experiment with lagged variables and other kinds of things like that. It automatically generates many attributes, perhaps too many attributes, so it’s a good idea to always try simpler models. You can use the Remove filter which we did at first, or you can choose which attributes you want using the Lag creation and Periodic attributes tabs under Advanced Configuration.
Skip to 9 minutes and 30 seconds As always in data mining, you need to be wary of evaluation based on the training data, and you can hold data out using the Evaluation tab. Finally, we’re evaluating time series using repeated one-step-ahead predictions, which means that errors propagate.
Using the time series forecasting package
Dealing manually with time series is a pain, as we learned in the last lesson. Weka’s time series forecasting package automatically produces lagged variables, plus many others – perhaps too many! It transforms the data by adding a large number of attributes, which, unfortunately, invites overfitting. This is indicated by a large discrepancy between error on the training set and error on independent test data. You can configure Weka to reduce the number of added attributes.