Skip to 0 minutes and 11 seconds Hello again! We’re just going to take one last look at discretization. We’re going to look at how J48 does discretization. Because it does deal with numeric attributes, so somehow, inside of J48, it’s got to effectively discretize an attribute. Or at least it’s got to be able to determine a split point for a numeric attribute. Let’s just review how J48 works. It’s a top-down, recursive, divide-and-conquer algorithm.
Skip to 0 minutes and 39 seconds The algorithm involves first of all selecting an attribute to split on at the root node (that’s the “outlook” attribute in this case), creating a branch for every possible value of that attribute (“sunny”, “overcast”, and “rainy”), splitting the instances into subsets, then going down those three branches and repeating recursively for each branch – selecting an attribute and so on – using only instances that reach the branch.
Skip to 1 minute and 4 seconds The key questions are: what’s the best attribute to split on?
Skip to 1 minute and 8 seconds And: when should you stop the process? The answer to the first question is, the attribute with the greatest “information gain” – at least, that’s J48’s answer. Information gain is the amount of information that’s gained by knowing the value of the attribute, which is the entropy of the distribution before the split minus the entropy of the distribution after it. Entropy is defined in terms of p log p’s. We talked about that briefly in the previous course. The details we didn’t really go into, and I don’t think they’re too important for you at this point.
Skip to 1 minute and 44 seconds In the previous example, the weather data, the information gain for “outlook” was 0.247 bits according to that calculation, and that was the greatest information gain of all of the attributes. So that’s the one that was split on. That’s how it works. Now let’s look at a numeric attribute. This is the “temperature” attribute in the numeric weather data. Here the split point is a number. The trouble is, there’s an infinite number of numbers, so we can’t possibly try them all. However, we’re going to choose split points mid-way between adjacent values in the training set. This reduces it to a finite problem, n–1 possibilities.
Skip to 2 minutes and 29 seconds So here, for the “temperature” attribute, it goes from 64 at the bottom end to 85 at the top end.
Skip to 2 minutes and 35 seconds Below are the class values of the instances: when the value of temperature was 64 it was a “yes” instance; and there were two instances where the value was 72, one “no” and one “yes” instance. There are just n–1 possibilities, and we’re going to try them all. Try all possible boundaries. If we take that split point that’s shown, on the left side of the split point we’ve got 4 “yes”s and 1 “no”, and on the right side we’ve got 5 “yes”s and 4 “no”s. We can calculate the entropy before the split and the entropy after the split and subtract them, and the information gain in this case is 0.001 bits.
Skip to 3 minutes and 11 seconds We might choose that if that’s the greatest information gain of all the possible split points. That’s how it’s going to work for numeric attributes. Here’s an example. We’ve already split, let’s say, on “outlook”. We’ve chosen that at the root. If we look at the “sunny” branch, then there are 5 instances whose outlook is “sunny”. Those are in that little table there. If you look at the value of “humidity”, it’s 70 and 70 for the two “yes” instances, and 85, 90, and 95 for the three “no” instances. That neatly splits the instances into “yes”s and “no”s if we choose a split point somewhere between 70 and 85. We’re going to choose it halfway, at the point 75.
Skip to 3 minutes and 57 seconds Well, 75 isn’t halfway between 70 and 85, but we’ve got two things at 70 and one thing at 85, so we’re going to use sort of a weighted halfway point. It’s going to be a third of the way from 70 to 85. That’s where we get the 75 from. That’s the split point, and in this case, that’s the end of the matter. We’ve discriminated the instances into “yes” and “no” instances down that branch. In a more complicated example, you can imagine we might split on the same attribute more than once.
Skip to 4 minutes and 27 seconds If we have a nominal attribute, once we split on it – once we split on “outlook” here – then we’ve used all of the information in the “outlook” attribute, and we certainly aren’t going to split further down the tree on the same attribute. But that’s not true for numeric attributes. We could split with a certain threshold for a numeric attribute, and further down the tree we might split again on the same attribute, but with a different threshold.
Skip to 4 minutes and 51 seconds Let’s just think about this issue: about discretization when building a tree, as I’ve described, versus discretization in advance, which we looked at in the last couple of lessons. When you do discretization when building a tree, you’re determining the boundaries in a more specific context, in the context of that subtree. A subset of the instances get down here, so you’ve got a more specific dataset to maybe give you a better determination of where a discretization boundary should be. On the other side, the negative side, your decision is based on a small subset of the overall information.
Skip to 5 minutes and 32 seconds You’ve always got to remember when you’re working with a tree that as you get further down to the bottom of the tree, you’ve got smaller and smaller and smaller numbers of instances. Although you might have a large dataset to begin with, by the time you get way down the tree you’re only dealing with a small number of instances, which is maybe not a good basis on which to make a decision. Another issue is computational complexity. As I’ve described the algorithm, for every internal node, the instances that reach it must be sorted separately for each numeric attribute, so that you can work out which split point on which attribute gives you the best information-gain value.
Skip to 6 minutes and 7 seconds The complexity of sorting is n log n, where n is the number of things you’re sorting. It looks like you’ve got to repeatedly sort instances, which could be a bit demanding computationally. But in fact repeated sorting can be avoided with a better data structure. So it’s not computationally disastrous to do internal discretization. That’s it. C4.5 incorporated discretization very early on. Pre-discretization, as we’ve seen in the last lessons, is an alternative, which has been refined later. Supervised discretization uses essentially the same entropy heuristic as C4.5. We can retain the ordering information that numeric attributes imply. We don’t have to keep on sorting them as we go further down the tree. Will internal discretization in J48 outperform pre-discretization?
Skip to 7 minutes and 0 seconds Well, there are arguments both for and against, which we’ve talked about.
Discretization in J48
J48 effectively discretizes numeric attributes as it goes along, which sounds good because split points are chosen in a local context, taking into account just the instances that reach that node of the tree. But discretizing numeric attributes in advance may outperform this, because more data is available in the global context, leading to more reliable decisions. Which is better? it’s an experimental question! Luckily Weka makes it easy to perform the necessary experiments.
Note: in the 4th slide, “Information gain for the temperature attribute”, the entropy after the split should be 0.895 (not 0.939), making the information gain 0.045 (not 0.001).
© University of Waikato, New Zealand. CC Creative Commons Attribution 4.0 International License.