Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only T&Cs apply

Find out more

Lag creation and overlay data

There are many parameters and options for deriving time-dependent attributes, as Ian Witten explains.
Hi! Welcome back to New Zealand for some more Advanced Data Mining with Weka. This is the last lesson on the time series forecasting facilities. We’re going to look at some features that we haven’t looked at so far. First of all, the timestamp. Any attribute of type “date” is used as the timestamp by default, but you can change this under the basic configuration parameters. I’ve loaded the airline data once again, and if I go to the Forecast panel, it’s going to use Date as the timestamp, but I could change that to another attribute if I wanted. Also, the periodicity. We’ve been detecting the periodicity automatically.
This data is monthly; I think there are 143 monthly instances, but we can specify something else if we prefer. We could actually specify, let’s say, weekly. This is not necessarily a very sensible thing to do, but what would happen if we specified weekly? First of all, it affects the lagged variables, the variables that are generated. Now we’ve got a large number of lagged variables. Actually, with weekly data, we’ve got 52 lagged variables generated. 52 weeks in a year, a whole year’s worth. As well as that, Weka inserts interpolated instances for the missing values.
So if we’re trying to do this weekly and the data was only monthly, then there’s a whole lot of weeks which need to be interpolated, and these are them. These weeks, and there’s a long list of weeks here, have been interpolated into the data. Then, of course, in order to get values for the training instances, they’re all missing values, so Weka interpolates the values for all of the attributes. These values have been interpolated. In this case, the airline data monthly is 144 instances. Weekly, we’ve got 573 instances here, and if I were to specify hourly we’d have 104,000 instances. The periodicity, as
I said, determines what attributes are created: different numbers of lagged variables depending on whether it’s monthly, weekly, daily or hourly. If it’s daily, then we include a Day-of-the-week attribute and Weekend attributes. If it’s hourly, we include a Morning or Afternoon attribute. Of course, you can override all of these attributes using the Advanced Configuration panel. I bet you’re tired of the airline data now. I’m going to open another dataset, the Apple stocks data. We need to find this data. When you install a package in Weka, it installs the package information in your home folder, so I’m going to go to my home folder, wekaFiles / Packages / timeseriesForecasting package, and here I’ve got some sample data, time series forecasting data.
I’m going to open appleStocks. Now, this data contains more than one thing to predict. It’s actually got the daily high, low, opening, and closing values for Apple stocks in the year 2011, plus the sales volume. I’m going go here – I need to tell it what to forecast. I’m going to forecast Close. Let me just see what happens. It’s generated lags here. It’s generated 12 lags. I think I want to tell it this data is weekly actually. I don’t think it’s figured that out. The Periodicity is weekly. No, I’m sorry, the periodicity is daily for this data. Let me do that, and now I’ve got 7 lagged variables, so a whole week’s worth of lagged variables.
There were some missing values, and instances were inserted, a few instances. Those were mostly weekends, actually, those instances. That’s what the skip list is for. I don’t really want to include weekends, because the stock market is closed. If I type “weekend” here, and do it again, then I will have reduced the number of interpolated instances. There are still a few of them – 5 of them – and those correspond to holidays when the stock exchange was closed. I can actually specify a list of dates here, as well as the word “weekend”. Let’s specify a list of dates in the format that’s on the slide. Let me just try that. Now I’m hoping for no interpolated instances. Yep, there’s none there.
I think what I’d like to do is to specify under the lags, I want to use maybe 2 weeks worth – that would be 10 working days. Let’s up that number to 10. OK, that’s the data prepared. Now, let’s do some evaluation on this data. First of all, I’m going to remove the leading instances, which are the ones with unknown lag values, which is a good idea. And then we’re going to hold out some of the instances. Let’s go and Remove leading instances, and then go to Evaluation. We’re going to evaluate on training and test, and I’m going to leave this at 30%. We’re going to use 30% of the dataset for testing. OK.
I’m going to look here at the mean absolute error. We’ve got these numbers here, 7.7 on the slide, you can see that since we’ve removed the leading instances, we get slightly better results than if we hadn’t done that. We can predict more than one target with this data, and if we do that we’re going to get lagged versions of each of the targets, and that might help. Let’s go and predict Close and High. We’re going to get lagged values of both of these variables, and it’s possible that we might get better predictions. Well, actually, we don’t.
These are the values we get: 8 on the test data and 3.4 on the training data, slightly worse than before. If we were to select all of the variables as targets, we’d get even worse results. We get quite bad overfitting here, with a much smaller training error, 2.5, than the test error, 9.6.Now, another thing that you need to know about is overlay data. Overlay data is additional data that might be relevant to the prediction. It’s not to be forecast. It can’t be predicted, and it’s available in the future. Overlay data is available in the future.
We don’t have overlay data for the Apple stocks problem, but I’m going to cheat by using one of the existing attributes as though it were overlay data, as though we knew it even in the future. Let me just predict Close. I’m going to go and specify some overlay data. We’re going to use Open as overlay data, and I can then see what happens. I got a complaint here from Weka. It’s unable to generate a future forecast because there’re no future values available for the overlay data. Well, let’s just stop it trying to generate future forecasts. If I just take out these output future predictions and do it again, then I won’t get that error message.
Back on the slide, we can see that the overlay data has improved things quite a bit. By including Open, the test error has got down to 5.9, and if we include High as well, it gets down even further. And although I won’t do this for you, if I were to change the base learner to SMO, a better learner, I would get even better results, down to a very small error on the test data, 2.4.In fact, I would get these graphs if I looked at the predictions. Again to save time I won’t do that, but you can see the prediction on the training data, the prediction on the test data. We’re getting very good predictions using this overlay data.
Well, we’ve covered quite a few options in the time series forecasting package. When you’re starting with a new dataset, you should start by getting the time axis right. Don’t forget that missing instances are automatically interpolated, and you can select the periodicity yourself if you like, and there’s a skip facility to ensure that time increases linearly. Then you need to select your target, what you’re going to predict (or targets). Overlay data can help a lot, obviously. If you can get hold of it, that’s always wonderful.

There are many parameters and options for deriving time-dependent attributes, such as which attribute holds the timestamp and what is the periodicity of the data. Periodicity affects the lagged variables that are generated. Weka interpolates instances for missing dates, which you can suppress manually if you wish. You can predict several variables, and Weka generates lagged values of each one. You can incorporate “overlay data” – additional data that might be relevant to the prediction that is available for the future (e.g., weather forecasts).

This article is from the free online

Advanced Data Mining with Weka

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now