10.9

In this lesson, we’re going to start talking seriously about time series forecasting. We’re going to look at linear regression with lags. We’re not going to use the time series forecasting package yet; we’ll start that in the next lesson. We’re going to load a time series data set here. We’re going to go to the Explorer. I’m going to load airline. This is where my Weka datasets are. I don’t know where yours are. I’m going to load airline.arff. Here it is. I’m going to just have a look at this data with the edit button. You can see that there’s a passenger_numbers attribute and then a Date attribute that goes from the first of January 1949 through to the first of December 1960.

58.2

So this is ancient airline passenger data. We’re going to go to Classify here, and we’re going to predict with linear regression in the functions category. This is important. We’re going to predict passenger_numbers. It’s the first attribute, so we need to set it here from the default, because Weka by default predicts the last attribute. I’m going to just click Start. We’re going to be looking at the root-mean-squared error here. 46.6 is what we get. We could look at the classifier errors. Now, this is a linear regression, so we’re expecting a linear line here. That’s what linear regression predicts.

100.2

On the y-axis I’m going to put the predicted passenger numbers; on the x-axis I’m going to put the date, and there we have it. This is the predicted line. The size of these crosses incidentally indicates the size of the error at that point, but for our purposes here, it’s a linear regression. Not really very interesting. One thing that’s a little bit surprising is the model is 0 times date plus this constant and that would be a horizontal line if it was really true. There’s something a little bit funny about this. What is funny about it is the date. If I go back and look here, the date attribute has got values ranging from these numbers here. Is that 662 billion?

152.7

– 662 billion here. And that’s because these dates are measured in milliseconds since January 1, 1970. So I’m going to convert them into months since the beginning of the dataset. I’m going to do that with a filter. There’s different ways of doing this, but I’m going to use the AddExpression filter, and I’m going to make an expression that takes the second attribute, the date attribute, that’s a2. And I’m going to divide that by – that’s in milliseconds. I’m going to make it seconds, and then I’m going to make it minutes, and then I’m going to make it hours, and then I’m going to make it days. Then I’m going to make it years. 365¼ days in a year.

203.1

I’m going to add 21 to get from 1949 to 1970. I’m going to make this in months. It took me a little bit of a while to figure this out. I hope it’s going to work. I’m going to call that attribute NewDate. Let’s see what happens here. I’m going to apply the filter, and now I’ve got NewDate, which goes from round about 0 to about 143. Now, there’s a little issue here with leap years, right? I’m using this figure of 365.25 days in a year, which is pretty accurate on average, but I should really take into account exactly which years are leap years and so on, so there’s a bit of inexactness going on here. But never mind.

248.3

It’s just a bit approximate. I’m going to delete the Date attribute, remove the Date attribute. I’m going to look at the model again. I’m going to remember every time – this is a bit of a nuisance – every time I’ve got to remember to predict passenger_numbers. And if I run that, then we’re getting this model 2.66 times the NewDate plus 90. It’s the same model as before, but we’ve kind of scaled NewDate, so now this coefficient, which used to be rounded down to 0, is something more sensible. OK. So far so good, and so far not very interesting. Here is the regression line, and you can see the data. The data’s kind of cyclic when you look at it.

288.7

Passenger numbers, it depends on the month, you know, and yet the regression line is just a straight linear prediction. Not so interesting. Let’s do something a little bit more interesting. I’m going to copy the passenger_numbers attribute. We’re going to add a delayed version of passenger_numbers. I’m going to use the Copy filter to create a new attribute. I’m going to copy the first attribute and apply that. And here we’ve got Copy of passenger_numbers. I’m going to take this attribute and subtract 12, I’m going to lag it. I’m going to delay it by 12 months, so it’s going to contain last year’s value for that month. I’m going to do that with a TimeSeriesTranslate. I’m going to configure that.

333

I’m going to translate the third attribute. I’m going translate it by 12 months, subtract 12 months from that. I think that’s ok. And then I need to actually – this particular filter doesn’t work on the class, so I’m going to set the class back to passenger_numbers, and then I’m going to run it and see what happens here. If I go to Edit, now I can see this is my new attribute, and you can see that that 112 is this 112 here. In fact, this is a delayed version of this attribute. This gives for this month, month number 13, this gives the figure for the year before and these are unknown values. Terrific! That’s what I wanted to do.

391.5

Then I’m going to go back and predict this with linear regression. I need to remember to predict passenger_numbers. There we go, and now I get a different model and a better root-mean-squared error, 31.7. This is a model that uses the date and then a little bit of the 12-month-before copy.Now actually, this is not a very good model. It’s a little bit crazy, and the reason it’s a little bit crazy is because of those missing values. We’ve got missing values at the beginning of the dataset, and we’re going to get much better results if we delete those instances with missing values. I’m going to do that with a filter.

434.3

I’m going to do that with an instance filter called RemoveRange and I’m going to remove instances from 1–12. And if I apply that, then now if I look at my data, I don’t have missing values. This starts out with the 112 data which is 12 months before, and this starts out on the 13th month of the original data, which is what I want. So I’m then going to go now and classify that with linear regression. Don’t forget to predict passenger_numbers. There we go. And now I get a much smaller root-mean squared-error of 16, and I’m getting quite a sensible model. This says passenger numbers increase a little bit.

487.6

Take the passenger_number of the year before, add 7% and then just a little offset here. I could try and visualize this model. I’ll just show you. If I do it this way, it’s not really very informative, because this is predicted passenger_numbers on the y-axis against Date on the x-axis. And you can’t see any pattern here, there is actually a cyclic pattern, but it’s completely obscured by the size of these x’s, which are not very interesting for our purposes at the moment. In order to get a better look at that I’m going to use the AddClassification filter. I’m going to add a classification. It’s supervised attribute filter, AddClassification. I’m going to add the classification created by linear regression.

542.5

Output the classification, and I need here to say what we’re going to be predicting, which is passenger_numbers. I’m going to apply this filter, and now I get a new attribute, classification, which I can then visualize. So I’m going to look at classification against NewDate. And this shows you this cyclic prediction that we’re getting here. So adding this delayed attribute gives us a cyclic prediction.Let’s go back to the slide and have a look at this. We have a graph here, which shows the prediction with lag_12. There is no prediction for the first 12 instances, I deleted those.

580.7

So these are the predictions, this cyclic wave, and you can see this fits pretty well the actual values of passenger_numbers, which are the black dots here. It’s a much better fit, this cyclic prediction, than the original rather boring red linear prediction, and these are the two equations of those lines. So adding this simple lag variable allows us to break away from the linear paradigm, even though we’re using linear regression, and get nonlinear predictions. I think that’s pretty exciting, actually. I’ve done a lot of things rather quickly here, and you’re going to be redoing them yourself with a different classifier. I’ve got a list of some of the pitfalls that I’ve done, and you might want to refer back to this.

630.9

Just to summarize. We’ve learned that linear regression can be used for time series forecasting and that lagged variables yield much more complex models than straight line ones. In this case, we chose the appropriate lag by eyeballing the data and noticing that it varied in an annual cycle. And we can include more than one lagged variable with different lags, and we could think about seasonal effects, you know. We could think about yearly, quarterly, daily, hourly data. Of course, doing all of this manually is a pain, adding these variables. So the time series forecasting package helps you do this in a much easier, quicker, more convenient way.