Skip to 0 minutes and 11 seconds Hello! My name is Albert Bifet. I’m a member of the Weka machine learning group and I work at Telecom ParisTech in Paris, France. We’re going to talk about data stream mining in Weka and MOA. Data stream mining is a way of doing real-time analytics, so this is going to be very, very important for big data and the Internet of Things. As you know, in Weka usually what we do is that we store all the dataset in memory and then what we do is that we build our classifier using this dataset that is stored in memory. This is what is called a batch setting.
Skip to 0 minutes and 41 seconds In the incremental setting, what we do is that we update our classifier one instance by one instance. There’s a huge difference between the two settings. Let’s look at the incremental setting with more detail. So the idea is that, what we do is we process an example every time, so only one example, not a dataset of examples. We use a limited amount of memory, because we don’t store all the dataset in memory. We work in a limited amount of time, so we need to be very, very fast. And we are ready to predict at any point. That’s the main difference.
Skip to 1 minute and 14 seconds So in the batch setting what we do is that we process all the dataset and then we are ready to predict at the end of building the classifier. Weka has many different incremental methods. To know which ones are incremental, we need to look that they should implement the UpdateableClassifier interface. We can find many different methods, as NaiveBayes, NaiveBayesMultinomial, k-nearest neighbors, Stochastic Gradient Descent, and also some decision trees, such as the Hoeffding Tree. As you know, the standard decision tree is not incremental, so we need to have all the dataset in memory. The Hoeffding Tree is the first – is a state-of-the-art incremental decision tree learning.
Skip to 1 minute and 57 seconds The Hoeffding Tree was proposed by Pedro Domingos and his group around 2000, and the difference with the standard decision trees and in the standard decision tree what we do is that when we need to decide when we want to split or not, what we do is that we look at the data that we have memory and then we compute the statistics and then we decide if we split or not. But, in the incremental setting, we don’t store the dataset in memory, so then, what we need to do if we need to decide if we want to split or not, we need to wait for new instances to arrive.
Skip to 2 minutes and 31 seconds So how many instances do we need to decide if we need to split or not? This is something that is computed using the Hoeffding bound, and this is why the Hoeffding tree has this name.Another interesting thing of the Hoeffding Tree is that it has theoretical guarantees that if the number of instances that we are using to build the model is large enough, the decision tree is going to be similar to a decision tree built using the standard decision tree method. Let’s see an example in Weka. We’re going to generate a dataset and then we’re going to evaluate using a Hoeffding Tree. Let’s start generating the data. We are going to use the Random radial basis function generator.
Skip to 3 minutes and 20 seconds This is a generator that generates data by creating for each class a random set of centers. In this case, we’re going to use 50 centroids, 10 attributes, 2 classes. We’re going to create 1000 instances. Let’s generate the data. Now we have the data, 1000 instances. Now, let’s classify it using the Hoeffding Tree. Let’s choose the classifier. Hoeffding Tree. We’re going to run a 10-fold cross-validation. We see this is very fast, and we get an accuracy of 71%.Now, let’s do it again but generating 100,000 instances. We change this parameter. We generate the data. Then again we run the 10-fold cross-validation. We see that it takes more time, but still it’s really fast.
Skip to 4 minutes and 19 seconds At the end, what we get is the accuracy goes to 89%. With 100,000 instances, we get 89% and with 1000 instances, we were getting only 71%. Increasing the number of instances that the Hoeffding Tree processes that allows us to have a much better accuracy. In this lesson, we have seen the two settings of Weka, the batch setting and the incremental setting. In the batch setting what we did is that we stored all of the dataset in memory. In the incremental setting, what we did is that we built the classifier one instance by one instance. The nice thing about the incremental setting is that we can be much more efficient and that we use less memory and we are much faster.
Incremental classifiers in Weka
Albert Bifet introduces data stream mining. It requires incremental operation rather than the batch mode used so far. Weka includes many different incremental methods. Updating decision trees presents an interesting challenge that is solved using the “Hoeffding bound” to estimate how many instances should be examined before deciding whether to split a node. Incremental methods like this typically require more training data to reach a given level of accuracy than batch-mode ones, but they can be very fast, and use much less memory.
© University of Waikato, New Zealand. CC Creative Commons Attribution 4.0 International License.