Skip to 0 minutes and 11 secondsDiscretizing is transforming numeric attributes to nominal. There are a couple of obvious ways of doing this. We’ve got a numeric attribute with a certain range. We could take that range and chop it into a certain number of equal parts, or “bins”. Just divide it into equal bins, and wherever a numeric value falls, we take that bin and use its identification as the discretized version of the numeric value. Or instead of using equal-sized bins, we can adjust the size to make the number of instances
Skip to 0 minutes and 43 secondsthat fall into each bin approximately the same: equal-frequency binning. We’re going to talk about those two things. We’ll talk briefly about the choice of the number of bins. Then we’ll talk about how to exploit the ordering information that’s implicit in a numeric value, but not implicit in a nominal value that you convert it to. Let’s look at equal-width binning. I’m going to take ionosphere.arff, which has got a lot of numeric attributes. I’m going to use J48. I’ve set Weka up here with ionosphere. I’ve run J48, and I get 91.5% accuracy. Let’s go and look at some of these numeric attributes. The first one, a1, has got just two distinct values, 0 and 1, actually.
Skip to 1 minute and 24 secondsYou can see the two values here. The third attribute has got a bunch of different values ranging between –1 and +1, and kind of scrunched up towards the top end. The fourth attribute also varies between –1 and +1. It looks like it could almost be a normal distribution. I’m going to go to a filter here, an unsupervised attribute filter called Discretize. Amongst the parameters here is the number of bins, and I’m going to use 40 bins. And equal frequency – we’re going to use equal-width binning, not equal frequency, leave that as False. I’m going to run it, and look at the result. Here is the first attribute from 0 to 1, just two values.
Skip to 2 minutes and 19 secondsHere’s the one that was all scrunched up to the top end. This is –1, this is 0, and this is +1. Here is the one that looked kind of normal. You can see it is sort of normal-ish except for a bunch of extra values down here at –1 and +1. I can look at those in the Edit panel, actually. If I undo the effect of that, and go and look in the Edit panel and sort by this attribute, you can see all the –1’s here, and then a bunch of numbers, and then up at the top you can see a bunch of extra +1’s in this column. Now I’ve applied the filter again.
Skip to 2 minutes and 54 secondsI’m going to classify it and see what J48 makes of that. We get 87.7% accuracy, which is not very good. I can go back and change the number of bins. I’m going to go straight to 2 bins here. I’m going to, first of all, undo the effect of this, and then apply the 2-bin version. You can see that – well, this was two bins to start off with – but you can see that this attribute, there are only two possible values, and this attribute is discretized into 2 bins. If I run J48 again, I get 90.9%, which is pretty good, actually. Going back to the slide, you can see the results for different numbers of bins here.
Skip to 3 minutes and 34 secondsThe last one, 90.9% is about the same, not too much worse than the original undiscretized version. What’s more, the tree has only got 13 nodes. It’s a much smaller, much more economical tree than the one we had before and very little loss in accuracy. So that looks really good. I’m going to move now to equal-frequency binning. Let’s go back here, and take the Discretize filter and change it to equal frequency. I’m going to go back to 40 bins here, and I’m going to run that. First, I need to undo the discretization, and then I’m going to apply this filter. Well, it can’t do much with the first attribute; that was binary to start off with.
Skip to 4 minutes and 22 secondsBut here, you can see that this is where they were all scrunched up towards the top end. This is –1, this is 0, and this is +1. You can see that, where possible, it’s chosen the size of the bins to equalize the frequency. It can’t do anything with this large bin at the top, or this one at the bottom, or this one in the middle, because all of the instances have +1 and here they’ve got 0 and here they’ve got –1. But where it can, it has equalized the frequency. This is the one that used to look normal. You can see there are some extra –1’s, 0’s, and +1’s, and it’s equalized the frequency by choosing appropriate bin widths.
Skip to 5 minutes and 4 secondsI can go and classify. J48 gives me 87%. It’s a bit disappointing, not very good at all. I can try with different numbers of bins. Let me change this to 2 bins. I need to undo this one first. Then apply. It hasn’t done much here – which was originally just 2 bins – but you can see that here we’ve got 2 equal-sized bins. That’s what histogram
Skip to 5 minutes and 33 secondsequalization, equal frequency, is trying to do: make bins with the same number of instances in each. If I just run J48 on that, I get 83%, which, again, is pretty disappointing. Coming back to the slide, you can see that all of these equal-frequency binning results are worse than the original results. The size of the tree is not hugely smaller, either. So they’re not really very good. Which method should you use? How many bins should you use? Well, these are experimental questions. There’s a theoretical result called “proportional k-interval discretization” which says that the number of bins should be proportional to the square root of the number of instances.
Skip to 6 minutes and 12 secondsThat doesn’t really help you very much in choosing the number of bins, because it doesn’t tell you what the constant of proportionality should be. It’s an experimental question. A more interesting question is how to exploit ordering information. In the numeric version of the attribute – and this is it at the top, the attribute value – we’ve got a value v here, and there’s an ordering relationship between different values of this attribute. However, when we discretize it here into five different bins, then there’s no ordering information between these bins. Which is a problem, because we might have a test in a tree, “is x < v?”, before discretization.
Skip to 6 minutes and 47 secondsAfter discretization, to get the equivalent test, we would need to ask “is y = a?”, “is y = b?”, “is y = c?” and replicate the tree underneath each of these nodes. That’s clearly inefficient, and is likely to lead to bad results. There’s a little trick here. Instead of discretizing into five different values a to e, we can discretize into four different binary attributes, k–1 binary attributes. The first attribute here says whether the value v is in this range, and the second attribute, z2, says whether it’s in this range, a or b. The third, z3, says whether it’s in this range, a, b, or c. The fourth says whether it’s in the first four ranges.
Skip to 7 minutes and 43 secondsIf in our tree we have a test “is x < v?”, then if x is less than v then z1, z2, and z3 are True and z4 is False. So an equivalent test on the binary attributes are “is z3 = True?” If we take that tree we have before, testing on “x < v”, an equivalent test is “is z3 True”. Then we have the same kind of economy of the tree underneath this without replicating different subtrees. That’s very easy in Weka. We just go to our filter, and we set makeBinary to True. That allows us to retain the ordering information that’s implicit in the original numeric attribute.
Discretizing numeric attributes
Discretizing is transforming numeric attributes to nominal. You might want to do that in order to use a classification method that can’t handle numeric attributes (unlikely), or to produce better results (likely), or to produce a more comprehensible model such as a simpler decision tree (very likely). This video explains two simple methods, equal-width and equal-frequency binning; and a third, non-obvious, method that preserves the ordering information implicit in a numeric attribute even though it has been converted to nominal. Using these methods in Weka is easy!
© University of Waikato, New Zealand. CC Creative Commons Attribution 4.0 International License.