## Want to keep learning?

This content is taken from the The University of Waikato's online course, Data Mining with Weka. Join the course to learn more.
3.9

## The University of Waikato

Skip to 0 minutes and 11 seconds Hi! We’re continuing our exploration of simple classifiers by looking at classifiers that produce decision trees. We’re going to look at J48. We’ve used this classifier quite a bit so far. Let’s have a look at how it works inside. J48 is based on a top-down strategy, a recursive divide and conquer strategy. You select which attribute to split on at the root node, and then you create a branch for each possible attribute value, and that splits the instances into subsets, one for each branch that extends from the root node. Then you repeat the the procedure recursively for each branch, selecting an attribute at each node, and you use only instances that reach that branch to make the selection.

Skip to 1 minute and 2 seconds At the end you stop, perhaps you might continue until all instances have the same class. The trick is, the question is, how do you select a good attribute for the root node. This is the weather data, and as you can see, outlook has been selected for the root node.

Skip to 1 minute and 20 seconds Here are the four possibilities: outlook, windy, humidity, and temperature. These are the consequences of splitting on each of these attributes. What we’re really looking for is a pure split, a split into pure nodes. We would be delighted if we found an attribute that split exactly into one node where they are all yeses, another node where they are all nos, and perhaps a third node where they are all yeses again. That would be the best thing. What we don’t want is mixtures, because when we get mixtures of yeses and nos at a node, then we’ve got to split again. You can see that splitting on outlook looks pretty good.

Skip to 1 minute and 57 seconds We get one branch with two yeses and three nos, then we get a pure yes branch for overcast, and, when outlook is rainy, we get three yeses and two nos. How are we going to quantify this to decide which one of these attributes produces the purest nodes? We’re on a quest here for purity. The aim is to get the smallest tree, and top-down tree induction methods use some kind of heuristic. The most popular heuristic to produce pure nodes is an information theory-based heuristic. I’m not going to explain information theory to you, that would be another MOOC of its own – quite an interesting one, actually.

Skip to 2 minutes and 43 seconds Information theory was founded by Claude Shannon, an American mathematician and scientist who died about 12 years ago. He was an amazing guy. He did some amazing things. One of the most amazing things, I think, is that he could ride a unicycle and juggle clubs at the same time when he was in his 80’s. That’s pretty impressive. He came up the whole idea of information theory and quantifying entropy, which measures information in bits.

Skip to 3 minutes and 13 seconds This is the formula for entropy: the sum of p log p’s for each of the possible outcomes. I’m not really going to explain it to you. All of those minus signs are there because logarithms are negative if numbers are less than 1 and probabilities always are less than 1. So, the entropy comes out to be a positive number. What we do is we look at the information gain. How much information in bits do you gain by knowing the value of an attribute? That is, the entropy of the distribution before the split minus the entropy of the distribution after the split. Here’s how it works out for the weather data. These are the number of bits.

Skip to 3 minutes and 53 seconds If you split on outlook, you gain 0.247 bits. I know you might be surprise to see fractional numbers of bits, normally we think of 1 bit or 8 bits or 32 bits, but information theory shows how you can regard bits as fractions. These produce fractional numbers of bits. I don’t want to go into the details. You can see, knowing the value for windy gives you only 0.048 bits of information. Humidity is quite a bit better; temperature is way down there at 0.029 bits. We’re going to choose the attribute that gains the most bits of information, and that, in this case, is outlook. At the top level of this tree, the root node, we’re going to split on outlook.

Skip to 4 minutes and 39 seconds Having decided to split on outlook, we need to look at each of 3 branches that emanate from outlook corresponding to the 3 possible values of outlook, and consider what to do at each of those branches. At the first branch, we might split on temperature, windy or humidity. We’re not going to split on outlook again because we know that outlook is sunny. All instances that reach this place, the outlook is sunny. For the other 3 things, we do exactly the same thing. We evaluate the information gain for temperature at that point, for windy and humidity, and we choose the best. In this case, it’s humidity with a gain of 0.971 bits.

Skip to 5 minutes and 18 seconds You can see that, if we branch on humidity, then we get pure nodes: 3 nos in one and 2 yeses in the other. When we get that, we don’t need to split anymore. We’re on a quest for purity. That’s how it works. It just carries on until it reaches the end, until it has pure nodes. Let’s open up Weka, and just do this with the nominal weather data. Of course, we’ve done this before, but I’ll just do it again. It won’t take long. J48 is the workhorse data mining algorithm. There’s the data. We’re going to choose J48. It’s a tree classifier.

Skip to 6 minutes and 1 second We’re going to run this, and we get a tree – the very tree I showed you before – split

Skip to 6 minutes and 8 seconds first on outlook: sunny, overcast and rainy. Then, if it’s sunny, split on humidity, 3 instances reach that node. Then split on normal, 3 yes instances reach that node, and so on. We can look at the tree using Visualize the tree in the right-click menu. Here it is.

Skip to 6 minutes and 33 seconds These are the number of yes instances that reach this node and the number of no instances. In the case of this particular tree, of course we’re using cross validation here. It’s done an 11th run on the whole dataset. It’s given us these numbers by looking at the training set. In fact, this becomes a pure node here. Sometimes you get 2 numbers here – 3/2 or 3/1. The first number indicates the number of correct things that reach that node, so in this case the number of nos. If there was another number following the 3, that would indicate the number of yeses, that is, incorrect things that reach that node. But that doesn’t occur in this very simple situation.

Skip to 7 minutes and 18 seconds There you have it, J48: top-down induction of decision trees. It’s soundly based in information theory. It’s a pretty good data mining algorithm. 10 years ago I might have said it’s the best data mining algorithm, but some even better ones, I think, have been produced since then. However, the real advantage of J48 is that it’s reliable and robust, and, most importantly, it produces a tree that people can understand. It’s very easy to understand the output of J48. That’s really important when you’re applying data mining. There are a lot of different criteria you could use for attribute selection. Here we’re using information gain. Actually, in practice, these don’t normally make a huge difference.

Skip to 8 minutes and 5 seconds There are some important modifications that need to be done to this algorithm to be useful in practice. I’ve only really explained the basic principles. The actual J48 incorporates some more complex stuff to make it work under different circumstances in practice. We’ll talk about those in the next lesson.

# Decision trees

Another simple method is to build a decision tree from the training data. Start at the top, with the whole training dataset. Select which attribute to split on first; then create a branch for each of its values. This splits the training data into subsets. Repeat the procedure for each branch, selecting an attribute at each node based on just the instances that reach it. This top-down, recursive, divide-and-conquer strategy is adopted by J48 (aka C4.5), which uses a measure called “information gain” to choose the attribute at each stage.