Skip main navigation

The Attribute Selected Classifier

Experimenting with a dataset to select attributes and applying a classifier to the result risks cheating! Ian Witten explains what to do about it.
We’re going to open a dataset, glass.arff, apply J48, and see what we get. Then we’re going to use Wrapper attribute selection with J48, and we’re going to see what attributes that gives us. Then we’re going to use just those attributes and run classification with J48 again on the new dataset. Let’s go over to Weka and try it. I’ve got the glass data open here. I’m going to classify it with J48 with default parameters, and I get 67%. Then I’m going to go to Attribute selection. I’m going to use the Wrapper method, and within the Wrapper method I’m going to choose J48. I’m going to leave everything else at its default value and see what attributes I get.
And we get the attribute set {RI, Mg, Al, K, Ba}. So I’m going to go back. RI, Mg, Al, K, Ba; delete the rest; and now I’ve got just that set {RI, Mg, Al, K, Ba}.
I’m going to classify it again with J48, and I get better accuracy: 71%. (I’m just going to back and undo that filter.) Back to the slide here. I got improved accuracy. If I did the same thing with IBk, I’d get 71% for IBk on the glass data set and 78% if I used just the attributes selected by wrapper attribute selection using IBk.
The question is: is this cheating?
And the answer is: yes, it certainly is. The reason is, because we’re using the entire data set to decide on the attribute subset. We should really just use the training data to decide on all of those things and then test the final result on the test set. This is just what the AttributeSelectedClassifier does. Remember the FilteredClassifier we used for supervised discretization? Well, the AttributeSelectedClassifier is the analogous thing for attribute selection. It selects attributes based on the training data only, even if we are within a cross-validation. Then it trains the classifier, again on the training data only. And then it evaluates the whole thing on the test data. I’m going to use that to wrap J48 and see what I get.
It’s a little bit complex to set up. I’m going to get the AttributeSelectedClassifier from meta. I’m going to use the J48 classifier. I’m going to use the wrapper subset evaluator, and within that I get to choose a classifier to wrap for attribute selection.
I can choose any classifier – I don’t have to choose J48 again – but I will: I’m going to use J48 both for attribute selection and for classification. Leave everything else at its default value, and run it again. It’s finished now, and I get an accuracy of 72%. Back on the slide. This is not cheating. Actually, it’s slightly surprising that I get a higher accuracy (72%) when I’m not cheating than I did, the 71% that I got when I was cheating. But, you know, we should really use the Experimenter to get reliable results; this is just the result of one run here.
If we were to do the same thing with IBk, I got a dramatic improvement in IBk by using the correct attributes, 78% – by cheating, that is. If I then use the AttributeSelectedClassifier, well, of course I can decide what I’m going to wrap. If I were using IBk, then I’d probably want to select attributes using IBk. That would give me the figure of 71% at the bottom right. Of course, I can use a different classifier to do the attribute selection than the one I’m using for classification. The 69% figure is when I wrap IBk, do attribute selection using IBk, and then classify using J48. The 74% is when I do attribute selection using J48 and then classify using IBk.
And it’s slightly surprising – you would expect that you would get a better attribute subset by using the classifier that you’re going to be using for classification, so it’s slightly surprising to see that 74% coming in larger than the 71% figure. But, you know, surprising things happen, and if we did a more extensive run with the Experimenter we probably wouldn’t find that. I’m going to check the effectiveness of the AttributeSelectedClassifier for getting rid of redundant attributes. I’m going to open the diabetes dataset and I’m going to use the AttributeSelectedClassifier with Naive Bayes. Remember with Naive Bayes, when you add redundant attributes the performance of Naive Bayes gets worse.
So I’m going to add redundant attributes, copies of attributes, and then I’m going to use the AttributeSelectedClassifier and see if the performance still gets worse. I’m hoping it doesn’t. The AttributeSelectedClassifier should get rid of those redundant copied attributes. I’m going to open diabetes, and I’m going to use the AttributeSelectedClassifier. I’ll just going to do this one more time, because configuration takes a bit of thought. I’m going to use as my classifier Naive Bayes. And I’m going to use the WrapperSubsetEvaluator. Within that, as the classifier to wrap, I’m going to choose Naive Bayes again. I don’t have to, but it’s probably better to use the same classifier. I’m going to leave everything else at its default value, and let’s run that.
Here I get 75/76% accuracy, 75.7%. Back on the slide then. If I just run Naive Bayes on the diabetes dataset, I would get 76.3%. Using the attribute selection in the proper way – that is, not cheating – with the AttributeSelectedClassifier, I get 75.7%. It’s a little disappointing that attribute selection didn’t help much on this dataset. But let’s now copy the attributes. I’m going to copy the first attribute. Naive Bayes give me 75.7%, and the Attribute [Selected] Classifier also gives me 75.7%. If I add a bunch more copies of that attribute, 9 further copies, then the performance of Naive Bayes deteriorates to 68.9%, whereas the AttributeSelectedClassifier stays the same, because it’s resistant to these redundant attributes.
And if I add further copies, then Naive Bayes will slowly get worse and worse, whereas the AttributeSelectedClassifier continues at its standard level of 75.7%. The conclusion is that attribute selection does a good job of removing redundant attributes. In this lesson, we’ve looked at the AttributeSelectedClassifier, which selects attributes based on the training set only, which is the right way to do it.

Experimenting with a dataset to select attributes and applying a classifier to the result is cheating, if performance is evaluated using cross-validation, because the entire dataset is used to determine the attribute subset. You mustn’t use the test data when setting discretization boundaries! But with cross-validation you don’t really have an opportunity to use the training data only. Enter the FilteredClassifier, which solves the problem. (Does that ring a bell? You saw it before, in Week 2.)

This article is from the free online

More Data Mining with Weka

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education