Skip main navigation

“Wrapper” attribute selection

Fewer attributes often yield better performance! Ian Witten explains the "wrapper" method of attribute selection.
In this lesson we’re going to talk about attribute selection using the wrapper method. D’you remember way back in Data Mining with Weka, in the first class, you looked at glass.arff, you ran J48, you removed some attributes and – much to your surprise – you sometimes got better results with fewer attributes? Well, that was a laborious manual process, where you started with the full attribute set and removed the best attribute by selectively trying all possibilities; then you carried on doing that. You probably remember the pain involved. Well, of course there’s a better way, and that’s what the Select Attributes panel does.
We’re going to go to Weka, and I’ve opened the glass data set – there it is: 214 instances. I’m going to go to the Select Attributes panel. We’re talking here about the wrapper method of attribute selection, and that involves wrapping up a classifier. We’re going to wrap up J48, which is exactly what you did back then all those weeks ago. I’m going to use 10-fold cross-validation, which actually is what you did – although in Class 1 of Data Mining with Weka you’d never heard of cross-validation. That looks pretty good to me.
I’m going to select a threshold of –1: I’ll explain that later on. Then we have a search method. We’re going to use the BestFirst search, but we’re going to search backwards – I’ll talk about that later on. And we’re going to have a search termination … yes, we’re going to leave it at that. OK. Let’s just run it. Now it’s running, doing all those cross-validations. And, lo and behold, it’s got the same attribute subset as you got before, and it’s got a “Merit of best subset” of 74%. Going back to the slide here, that’s really the same thing as we got before. Same subset, and the “merit” is the same as the accuracy.
It’s a little bit of a coincidence that we got the same results, because Weka doesn’t do exactly the same thing, you know, the setting of the random number generator and so on. But anyway, we did get the same result here in this situation.
A good question is: how many subsets we had to evaluate? How many attributes we had to evaluate? How much experimentation we had to do? I’m going to go back here, and I’m going to set the searchTermination to 1 – and again I will explain that in a minute – and run it again. And here it tells me that it’s evaluated 36 subsets. Back on the slide, you can count these subsets. It took the complete attribute set and then it tried removing all of the 9 attributes, one by one. That’s 9 more evaluations. Then it removed another attribute, 8 evaluations, and another one and another one, which gave it the final attribute subset.
But to check that it was the final attribute subset and you couldn’t do better by removing another attribute, it had to do a further 5 evaluations. And if you add up all of those, you get 36 subsets evaluated. The wrapper method involves an evaluation method and a search method. Let’s talk about search. We were doing backwards searching. We started with all 9 attributes, and selected one to remove, and so on and so forth until we decided to stop; the searchTermination criteria. It would be equally viable to do forwards search, starting with a 0-attribute subset and adding the best attribute each time until you decided to stop. Or you could do bi-directional search.
You could start with some random attribute subset – actually, Weka allows you to specify what attribute subset to start with – and then either add or subtract an attribute depending on which gives the most performance improvement.
Or you could do exhaustive search: in this case there are 512 possible subsets of 9 attributes and you could simply try them all. The searchTermination criterion is interesting.
When we did this manually, we stopped as soon as the results started to get worse: we got a local maximum in the search space. But you might do better by plowing on through that minimum that you get, going a little bit further to see if perhaps you might reach an even higher peak further on. If you set the searchTermination criterion to something greater than 1, then Weka will try a little bit harder, go a little bit further, before deciding to abandon the search. Now, I’m not going to show you all these different searches, but here are some results. It’s a pretty [lengthy] process. I showed you Backwards search, and we got that first subset at a 0.72 evaluation.
And then we set the searchTermination up to 5 which gives us a chance of powering on past a local maximum, finding an even bigger maximum in the search space, and that gives us a better evaluation. Or with Forwards search, you get that 3-attribute subset RI, Al and Ca. If you search on a little bit further instead of terminating the search prematurely, you can get a better subset with better accuracy. And Bi-directional search will give that 3-attribute subset, and again you can improve that by setting the searchTermination criterion to search a little bit further.
Note that we are always finding a local optimum, but setting the searchTermination criterion to more than 1 gives you a chance of traversing a valley in the search space to find a better local optimum. I turns out that, on this dataset, “Al” is the single best attribute to use (OneR will confirm that for you), so all Forward search results will include “Al”. Curiously, “Al” is the best single attribute to drop. So if you start with a full set, the best one to drop is “Al”. This sounds pretty strange, and I must admit it is pretty unusual. But nevertheless it’s true, and it’s certainly not impossible. Let’s go back to Weka here. I’m going to set cross-validation and see what happens.
What it’s doing now is doing the attribute evaluation 10 separate times. It’s showing us here how many times this attribute, RI, appeared in the final attribute subset. In this case, it appeared in 9 out of the 10 attribute subsets.
Coming back to the slide: in how many folds does this attribute appear in the final subset? You can see that RI and Mg and Ba appear in all 10 of the folds, and Al, Si, K and Fe appear in not too many, 2 or 3 of the folds. This gives you an indication of the stability of the attribute selection method. For this dataset it’s not really very stable, as we’ve seen by getting all those different subsets when we try different parameters of the wrapper method. If we do forward search, of course, we will definitely choose Al, so this was done with Backwards search. The gory details of the Wrapper method.
In general, Weka implementations follow descriptions in the research literature, so these parameters came from the research literature. It tries to do a 5-fold cross-validation by default, not a 10-fold cross-validation, but it doesn’t necessarily do all 5 folds. It does at least 2 and up to 5 runs, and stops when the standard deviation is less than a user-specified threshold. Setting a negative threshold, which is what we did, forces a single cross-validation each time. The BestFirst search method is the default, and the searchTermination defaults to 5 for traversing valleys. The Wrapper method uses cross-validation to select the best attribute to add or drop at each stage. If we go back to Weka, there’s another attribute evaluator, which is called the ClassifierSubsetEvaluator.
That allows us to specify a classifier and also a hold-out file, so here we would use the hold-out file to evaluate each subset in turn. That’s attribute selection using the Wrapper method.
We use a classifier to find a good attribute set: we used J48. We wrap the classifier in a cross-validation loop.
There are two components here: the attribute evaluator, which evaluates a subset of attributes; and the search method, which searches through the attribute space. Searching can be forwards, backwards, or bidirectional starting from any subset.
It’s computationally intensive: m^2 subsets need to be evaluated for m attributes, and there’s an exhaustive method which evaluates 2^m subsets. Greedy search always finds a local optimum in the search space, and you can traverse valleys by increasing the searchTermination parameter.

Fewer attributes often yield better performance! In a laborious manual process, you can start with the full attribute set and remove the best attribute by selectively trying all possibilities, and carry on doing that. Weka’s Select Attributes panel accomplishes this automatically. The “wrapper” method wraps a classifier in a cross-validation loop: it searches through the attribute space and uses the classifier to find a good attribute set. Searching can be forwards, backwards, or bidirectional, starting from any subset.

This article is from the free online

More Data Mining with Weka

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education