Hello! My name is Bernhard Pfahringer. I’m with the Computer Science Department here at the University of Waikato, the home of Weka and MOA. Today, I’m going to tell you a little bit more about MOA. Specifically, we’re going to talk about change. Change is everywhere in data streams. It’s the distinguishing feature of data stream mining. To think about it like you want to predict electricity consumption, there is a big peak in the morning and there’s another peak in the late evening. There’s some level of consumption during the day, and there might be a much lower level of consumption during nighttime. There’s different levels in summer to winter and so on and so forth.
If we look at Tweet streams, for instance, right now the Zika virus is trending. A couple of months ago, at least in New Zealand, it was Rugby World Cup. Again, we see change happening all the time. New topics are coming in, becoming more interested, peak, and then they slowly die away, making place for new, other, topics. How do we deal with change? Some algorithms can just basically, because of the way they work, implicitly adapt to change, but a better way is to do it explicitly. We need a change detector. There are a couple of algorithms that have been described in the literature for that purpose, one of them is ADWIN, which is short for Adaptive Windowing.
It’s used in a number of algorithms inside MOA. It keeps an adaptive sliding window, which tries to estimate the mean of a numeric variable that you’re monitoring. Whenever things are stable, the window grows, but once there seems to be some change, we cut the window into two parts, the old part and the new part. The old part is being discarded. The new part, hopefully, gives us the new, correct mean. Now, this is not just a heuristic algorithm. It comes with guarantees, theoretical guarantees, on the rates of false positives and the rates of false negatives and on the size related to the rate of change.
I don’t have to worry about the size and it being too big because the way it’s stored internally it uses an exponential compression scheme, so that’s fine, as well. Now, ADWIN is being used in a number of algorithms, as I said. One of them is the Hoeffding Adaptive Tree, which, as the name implies, is a variant of the Hoeffding Tree that Albert has explained to you previously. Now, in the Hoeffding Adaptive Tree, you do not just grow the tree on the most promising tests, you also look at the second best ones. So you grow alternative branches, as well, and you monitor the performance of those alternatives over time using ADWIN.
Whenever ADWIN detects that these alternative sections now have become better or are out-performing the official main structure, you replace the main with the alternative. Of course, at the same time, you start growing new alternatives, as well, to be prepared for further change in the future. OK, so the Hoeffding Adaptive Tree was a single classifier. Of course, we all know that, when it comes to utmost predictive accuracy, you want to have ensembles. A very simple ensemble method is bagging. Now, bagging is easily made into an online algorithm, because, just think about it. If you bag, say, an incremental algorithm like the Hoeffding Tree, you have an incremental ensemble immediately. No problem there.
Of course, there’s a little issue, and that is the way bagging distributes examples across the different classifiers. In the batch-learning version, you do bootstrap sampling with replacement. Classifier 1 in our simple example here – which has a dataset which comprises only four examples, A, B, C, and D – might see a sample that is B, A, C, B. Classifier 2 might get D, B, A, D. Classifier 4, a very extreme one, might actually get three copies of B and maybe only one C and never see A or D. How do we solve this issue with getting a procedure that’s basically batch oriented, like bootstrap sampling, and turn it into something incremental? Well, there’re two simple steps to look at here.
First of all, we can actually look at our sub-samples by sorting them. Thus, we restore the original order of the data. A, B, C, D was the original order. Classifier 1 would see A, B, B, C. Classifier 2 would see A, B, D, D. The extreme one would see B, B, B, and C. The next step is to actually look at that as every classifier sees all the examples in order – A, B, C, and D – but with a weight attached. For the first classifier, the weight would be 1 for A, 2 for B, 1 for C, and 0 for D. The extreme classifier would have 0 for A, 3 for B, 1 for C, 0 for D.
Now, that looks promising. In 2001, there was a very interesting paper called “Online Bagging and Boosting” authored by Oza and Russell, where they had this following genius insight. Their insight was that we can actually approximate the binomial distribution with a Poisson distribution where the mean is set to 1. That gives you roughly the same behavior that the binomial distribution in the batch-based bagging has. Roughly about 37% of the time you get a 0 weight. 37% of the time you get a weight of 1. 18% of the time you get a weight of 2, and higher weights are more and more unlikely, but still possible.
Now, using that in online bagging makes it very easy to, whenever an example comes in, you just draw the weight from the Poisson distribution for the first classifier, another weight for the second classifier, and so for all the classifiers. Now, we have a fully incremental setup. In MOA, we have an algorithm called ADWIN Bagging that couples that with ADWIN, as well, for explicit change detection again. We use Poisson with a mean of 1 to weight every example differently for every classifier, but we also detect when things deteriorate. We do that by monitoring the overall performance of the ensemble and whenever things fall below a threshold, we identify – the worst classifier is being removed and replaced by a new one.
Now, in a number of experiments, what we also found is that we can improve on that by playing around with the parameters. For the Poisson distribution, if we use a mean that is larger than 1, like 2 or maybe even up to 6, what we find is that across a large range of benchmark datasets in stream mining, we get better results. We’ve called it leveraging bagging. Again, it’s coupled with ADWIN for explicit change detection. When things get bad, you remove the worst classifier and replace it by a new one. In summary, when looking at change in classifiers and explicit change detection.
Hoeffding Adaptive Tree is a single classifier, and two variants of bagging, ADWIN bagging and leveraging bagging, as an example of ensembles that deal explicitly with change. I invite you to do your own experiments with MOA classifiers and change detection. For that purpose in MOA, you will find that data stream generators also come with a component that also simulates change over time. For instance, in the RandomRBF generator, you get drift by basically having the centers of the kernels move around in space.