Skip to 0 minutes and 11 secondsHello again! We’re going to look in this lesson at another discretization technique, supervised discretization. Now you’re probably thinking, “why doesn’t he just tell us the best method of discretization and get on with it, instead of going into all of these different methods?” The answer is, there is no universally best method. It depends on the data, and what I’m trying to do in this course is to equip you to find out the best method for use with your data. Also, this class is about the FilteredClassifier, which is a slightly different classification methodology, which is useful more generally than just in discretization. Let’s take a look.
Skip to 0 minutes and 47 secondsDiscretization is about transforming numeric attributes to nominal, and supervised discretization is about taking the class into account when making those discretization decisions. Here’s the situation. In this diagram, we have numeric x attribute values along the top, and then underneath the discretized versions, which we’ll call y.
Skip to 1 minute and 6 secondsThere are two bins shown here, out of many, probably: bins c and d. It so happens that all the instances for which the x attribute is in bin c have class 1, and all the instances where the x attribute is in bin d has class 2 – with just one exception, there’s just one thing in bin c which has got class 2. It might be a good idea to shift that boundary just a little bit downwards in the direction of the arrow. Then we’d find that this particular attribute, the discretized version of the attribute, has a precise correspondence with the class values of the instances.
Skip to 1 minute and 44 secondsThere’s a bit of motivation for why it might be useful to take the class values into account when making those discretization decisions. Supervised discretization. How do you do it? Well, the most common way is to use an entropy heuristic, like the one that’s pioneered by C4.5 (which we call J48 in Weka). I’m looking now at the numeric weather data and the temperature attribute, which ranges from 64 to 85. Underneath are the class values of the instances. So there’s an instance with temperature 64 which has got a “yes” class. And there are two instances with temperature of 72, one’s got a “no” class, and the other’s got a “yes” class.
Skip to 2 minutes and 25 secondsIf we take the entropy of that boundary there, underneath the boundary are 4 “yes”s and 1 “no”, and above the boundary are 5 “yes”s and 4 “no”s. We can calculate the entropy as 0.934 bits. We talked about entropy when we talked about C4.5 in the previous class, and I’m not going to go over that again. The heuristic then is to choose the split point with the smallest entropy, which corresponds to the largest information gain, and to continue recursively until some stopping criterion is met. At the bottom, we’ve continued splitting and made 5 splitting decisions, creating 6 bins for our supervised discretization. Let’s do this in Weka. We’re going to use the supervised discretization filter on the ionosphere data.
Skip to 3 minutes and 15 secondsWe already know that J48 gets 91.5% accuracy on this data. Let’s go now to Weka, and I’ve got the data loaded in here. I’m going to choose the supervised discretization filter, not the unsupervised one we looked at in the last lesson. It’s got some parameters. The most useful is makeBinary. We saw how useful that was in the activity that we’ve just done, associated with the last lesson, but we’ll leave it at False for the moment. I’m going to apply that discretization. Now we can see that a1, the first attribute, has been discretized into 2 bins, which is not surprising, because, in fact, it was a binary attribute.
Skip to 3 minutes and 56 secondsThe second attribute has been discretized into 1 bin, so that will not participate in the decision in any form. The third attribute has been discretized into 4 bins, the fourth into 5 bins, and so on. I think all of the attributes are discretized into somewhere between 1 and 6 bins. Now we can just go ahead and apply J48. But there’s a problem here, if we’re using cross-validation. Because if we were to apply J48 with cross-validation to this discretized dataset, then the class values in the test sets have been used to help with the discretization decision. That’s cheating! You shouldn’t use any information about the class values in the test set to help with the learning method.
Skip to 4 minutes and 42 secondsSo I’m going to undo the effect of this filter. This is quite a general problem. Actually, if I just go to my slide here. We use a FilteredClassifier, which is a meta classifier. We met a meta classifier in the last course when we looked at bagging. Now we’re going to look at another meta classifier. I’m going to choose the FilteredClassifier from “meta”. If I look at the “More” information, it tells us it’s a “class for running an arbitrary classifier on data that has been passed through an arbitrary filter.” That’s exactly what we want. The structure of the filter is based exclusively in the training data. That’s exactly what we want.
Skip to 5 minutes and 24 secondsIt’s got two parameters here: the classifier, which by default is J48 and happens to be the one that we want; and a filter, which by default is the supervised discretization filter, which is also what we want. We can just run this now. That gives us 91.2% accuracy, which is quite good, nearly as good as on the original, undiscretized attribute. We found before that setting the makeBinary option was quite productive. I’m going to set that to True, and run it again. We get a very good result, 92.6% accuracy, which is better than what we got on the undiscretized version of the [dataset]. How big is the tree here? The tree has got 17 nodes, so it’s not too big.
Skip to 6 minutes and 18 secondsThat’s using the FilteredClassifier to avoid cheating, so that the test sets used within the cross-validation do not participate in choosing the discretization boundaries. Let’s try cheating! We’ll probably get a better result if we were to set makeBinary to True in the filter. Filter the data. Apply the filter, and then go back to Classify and apply J48. This is cheating, because we’re using cross-validation and the class values in the test set have been used to help get those discretization boundaries. Sure enough, cheating pays off – 94% accuracy. That is not representative of what we’d expect if we used this system on fresh data. OK, that’s it.
Skip to 7 minutes and 10 secondsSupervised discretization is when you take the class into account when making discretization boundaries, which is often a good idea. It’s important that the discretization is determined solely by the training set and not the test set. To do this when you’re cross-validating, you can use the FilteredClassifier, which is designed for exactly this situation. It’s useful with other supervised filters, as well.
“Supervised” discretization methods take the class into account when setting discretization boundaries, which is often a very good thing to do. But wait! You mustn’t use the test data when setting discretization boundaries, and with cross-validation you don’t really have an opportunity to use the training data only. Weka’s solution is the FilteredClassifier, and it’s important because the same issue occurs in other contexts, not just discretization.
© University of Waikato, New Zealand. CC Creative Commons Attribution 4.0 International License.