Skip to 0 minutes and 11 seconds Hello! In the last lesson, we looked at using a classifier in Weka, J48.
Skip to 0 minutes and 16 seconds In this lesson, we’re going to look at another of Weka’s principal features: filters. One of the main messages of this course is that it’s really important when you’re data mining to get close to your data, and to think about preprocessing it, or filtering it in some way, before applying a classifier. I’m going to start by using a filter to remove an attribute from the weather data. Let me start up the Weka Explorer and open the weather data.
Skip to 0 minutes and 46 seconds I’m going to remove the “humidity” attribute: that’s attribute number 3. I can look at filters; just like we chose classifiers using this Choose button on the Classify panel, we choose filters by using the Choose button here. There are a lot of different filters. Allfilter and MultiFilter are ways of combining filters. We have supervised and unsupervised filters. Supervised filters are ones that use the class value for their operation. They aren’t so common as unsupervised filters, which don’t use the class value. There are attribute filters and instance filters. We want to remove an attribute. So we’re looking for an attribute filter. There are so many filters in Weka that you just have to learn to look around and find what you want.
Skip to 1 minute and 35 seconds I’m going to look for removing an attribute. Here we go, “Remove”. Now, before, when we configured the J48 classifier, we clicked here. I’m going to click here, and we can configure the filter. This is “A filter that removes a range of attributes from the dataset”. I can specify a range of attributes here. I just want to remove one. I think it was attribute number 3 we were going to remove. I can invert the selection and remove all the other attributes and leave 3, but I’m just going to leave it like that. Click OK, and watch “humidity” go when we apply the filter. Nothing happens until you apply the filter.
Skip to 2 minutes and 18 seconds I’ve just applied it, and here we are, the “humidity” attribute has been removed. Luckily I can undo the effect of that and put it back by pressing the “Undo” button. That’s how to remove an attribute.
Skip to 2 minutes and 29 seconds Actually, the bad news is there is a much easier way to remove an attribute: you don’t need to use a filter at all. If you just want to remove an attribute, you can select it here and click the “Remove” button at the bottom. It does the same job. Sorry about that. But filters are really useful, and can do much more complex things than that. Let’s, for example, imagine removing, not an attribute, but let’s remove all instances where humidity has the value “high”. That is, attribute number 3 has this first value. That’s going to remove 7 instances from the dataset. There are 14 instances altogether, so we’re going to get left with a reduced dataset of 7 instances.
Skip to 3 minutes and 10 seconds Let’s look for a filter to do that. We want to remove instances, so it’s going to be an instance filter. I just have to look down here and see if there is anything suitable. How about RemoveWithValues? – the RemoveWithValues filter. I can click that to configure it, and I can click “More” to see what it does. Here it says it “Filters instances according to the value of an attribute”, which is exactly what we want. We’re going to set the “attributeIndex”; we want the third attribute (humidity), and the first value. We can remove a number of different values; we’ll just remove the first value. Now we’ve configured that. Nothing happens until we apply the filter.
Skip to 4 minutes and 1 second Watch what happens when we apply it. We still have the “humidity” attribute there, but we have zero elements with high humidity. In fact, the dataset has been reduced to only 7 instances. Recall that when you do anything here, you can save the results. So we could save that reduced dataset if we wanted, but I don’t want to do that now. I’m going to undo this.
Skip to 4 minutes and 28 seconds We removed the instances where humidity is high. We have to think about, when we’re looking for filters, whether we want a supervised or an unsupervised filter, whether we want an attribute filter or an instance filter, and then just use your common sense to look down the list of filters to see which one you want. Sometimes when you filter data you get much better classification. Here’s a really simple example. I’m going to open the “glass” dataset that we saw before. Here’s the glass dataset. I’m going to use J48, which we did before. It’s a tree classifier.
Skip to 5 minutes and 6 seconds I’m going to start that, and I get an accuracy of 66.8%. Let’s remove Fe, that is, Iron. Remove this attribute, and we get a smaller dataset. Go and run J48 again. Now we get an accuracy of 67.3%. So we’ve improved the accuracy a little bit by removing that attribute. Sometimes the effect is pretty dramatic. Actually, in this dataset, I’m going to remove everything except the refractive index and Magnesium (Mg). I’m going to remove all of these attributes, and am left with a much smaller dataset with two attributes. Apply J48 again.
Skip to 6 minutes and 3 seconds Now I’ve got an even better result, 68.7% accuracy. I can visualize that tree, of course – remember? – by right-clicking here and visualizing the tree, and have a look and see what it means. It’s much easier to visualize trees when they are smaller. This is a good one to look at and consider what the structure of this decision is. That’s it for now. We’ve looked at filters in Weka; supervised versus unsupervised, attribute versus instance filters. To find the right filter you need to look. They can be very powerful, and judiciously removing attributes can both improve performance and increase comprehensibility. Bye for now!
Using a filter
Weka include many filters that can be used before invoking a classifier to clean up the dataset, or alter it in some way. Filters help with data preparation. For example, you can easily remove an attribute. Or you can remove all instances that have a certain value for an attribute (e.g. instances for which humidity has the value high). Surprisingly, removing attributes sometimes leads to better classification! – and also simpler decision trees.
© University of Waikato, New Zealand. CC Creative Commons Attribution 4.0 International License.