Skip main navigation

New offer! Get 30% off your first 2 months of Unlimited Monthly. Start your subscription for just £35.99 £24.99. New subscribers only T&Cs apply

Find out more

Pruning decision trees

To prevent decision trees from overfitting the data, you need to prune them. Ian Witten demonstrates the difference between pruned and unpruned trees.
Hi! In the last class, we looked at a bare-bones algorithm for constructing decision trees. To get an industrial strength decision tree induction algorithm, we need to add some more complicated stuff, notably pruning. We’re going to talk in this [class] about pruning decision trees. Here’s a guy pruning a tree, and that’s a good image to have in your mind when we’re talking about decision trees. We’re looking at those little twigs and little branches around the edge of the tree, seeing if their worthwhile, and snipping them off if they’re not contributing. That way, we’ll get a decision tree that might perform worse on the training data, but perhaps generalizes better to independent test data. That’s what we want.
Here’s the weather data again. I’m sorry to keep harking back to the weather data, but it’s just a nice simple example that we all know now. I’ve added here a new attribute. I call it an ID code attribute, which is different for each instance.
I’ve just given them an identification code: a, b, c, and so on. Let’s just think from the last lesson, what’s going to happen when we consider which is the best attribute to split on at the root, the first decision. We’re going to be looking for the information gain from each of our attributes separately. We’re going to gain a lot of information by choosing the ID code. Actually, if you split on the ID code, that tells you everything about the instance we’re looking at. That’s going to be a maximal amount of information gain, and clearly we’re going to split on that attribute at the root node of the decision tree.
But that’s not going to generalize at all to new weather instances. To get around this problem, having constructed a decision tree, decision tree algorithms then automatically prune it back. You don’t see any of this, it just happens when you start the algorithm in Weka. How do we prune? There are some simple techniques for pruning, and some more complicated techniques for pruning. A very simple technique is to not continue splitting if the nodes get very small. I said in the last lesson that we’re going to keep splitting until each node has just one class associated with it. Perhaps that’s not such a good idea.
If we have a very small node with a couple instances, it’s probably not worth splitting that node. That’s actually a parameter in J48. I’ve got Weka going here. I’m going to choose J48 and look at the parameters.
There’s a parameter called minNumObj. If I mouse over that parameter, it says “The minimum number of instances per leaf”. The default value for that is 2. The second thing we do is to build a full tree and then work back from the leaves. It turns out to be better to build a full tree and prune back rather than trying to do forward pruning as you’re building the tree. We apply a statistical test at each stage. That’s the confidenceFactor parameter. It’s here. The default value is 0.25. “The confidence factor used for pruning [smaller values incur more pruning].” Then, sometimes it’s good to prune an interior node, and to raise the subtree beneath that interior node up one level. That’s called subtreeRaising.
That’s this parameter here. We can switch it on or switch it off. “Whether to consider the subtree raising operation during pruning.” Subtree raising actually increases the complexity of the algorithm, so it would work faster if you turned off subtree raising on a large problem. I’m not going to talk about the details of these methods. Pruning is a messy and complicated subject, and it’s not particularly illuminating. Actually, I don’t really recommend playing around with these parameters here. The default values on J48 tend to do a pretty good job. Of course, it’s become apparent to you now that the need to prune is really a result of the original unpruned tree overfitting the training dataset. This is another instance of overfitting.
Sometimes simplifying a decision tree gives better results, not just a smaller, more manageable tree, but actually better results. I’m going to open the diabetes data. I’m going to choose J48, and I’m just going to run it with the default parameters. I get an accuracy of 73.8%, evaluated using cross-validation. The size of the tree is 20 leaves, and a total of 39 nodes. That’s 19 interior nodes and 20 leaf nodes. Let’s switch off pruning. J48 prunes by default. We’re going to switch off pruning. We’ve got an unpruned option here, which is false, which means it’s pruning. I’m going to change that to true – which means it’s not pruning any more – and run it again.
Now we get a slightly worse result, 72.7%, probably not significantly worse. We get a slightly larger tree – 22 leaves and 43 nodes. That’s a double whammy, really. We’ve got a bigger tree, which is harder to understand, and we’ve got a slightly worse prediction result. We would prefer the pruned tree in this example on this dataset. I’m going to show you a more extreme example with the breast cancer data. I don’t think we’ve looked at the breast cancer data before. The class is no-recurrence-events versus recurrence-events, and there are attributes like age, menopause, tumor size, and so on. I’m going to go classify this with J48 in the default configuration.
I need to switch on pruning – that is, make unpruned false – and then run it. I get an accuracy of 75.5%, and I get a fairly small tree with 4 leaves and 2 internal nodes. I can look at that tree here, or I can visualize the tree.
We get this nice, simple little decision structure here, which is quite comprehensible and performs pretty well, 75% accuracy. I’m going to switch off pruning. Make unpruned true, and run it again. First of all, I get a much worse result, 69.6% – probably signficantly worse than the 75.5% I had before. More importantly, I get a huge tree, with 152 leaves and 179 total nodes. It’s massive. If I try to visualize that, I probably won’t be able to see very much. I can try to fit that to my screen, and it’s still impossible to see what’s going on here. In fact, if I look at the textual description of the tree, it’s just extremely complicated. That’s a bad thing.
Here, an unpruned tree is a very bad idea. We get a huge tree which does quite a bit worse than a much simpler decision structure. J48 does pruning by default and, in general, you should let it do pruning according to the default parameters. That would be my recommendation. We’ve talked about J48, or, in other words, C4.5. We talked about the progression from C4.5 by Ross Quinlan. Here is a picture of Ross Quinlan, an Australian computer scientist, at the bottom of the screen. The progression from C4.5 from Ross to J48, which is the Java implementation essentially equivalent to C4.5. It’s a very popular method. It’s a simple method and easy to use.
Decision trees are very attractive because you can look at them and see what the structure of the decision is, see what’s important about your data. There are many different pruning methods, and their main effect is to change the size of the tree. They have a small effect on the accuracy, and it often makes the accuracy worse. They often have a huge effect on the size of the tree, as we just saw with the breast cancer data. Pruning is actually a general technique to guard against overfitting, and it can be applied to structures other than trees, like decision rules. There’s a lot more we could say about decision trees.
For example, we’ve been talking about univariate decision trees – that is, ones that have a single test at each node. You can imagine a multivariate tree, where there is a compound test. The test of the node might be ‘if this attribute is that AND that attribute is something else’. You can imagine more complex decision trees produced by more complex decision tree algorithms. In general, C4.5/J48 is a popular and useful workhorse algorithm for data mining.

Decision trees run the risk of overfitting the training data. One simple counter-measure is to stop splitting when the nodes get small. Another is to construct a tree and then prune it back, starting at the leaves. For this, J48 uses a statistical test which is rather unprincipled but works well. For example, on the breast-cancer dataset it generates a tree with 4 leaves (6 nodes in total) that gets an accuracy of 75.5%. With pruning switched off, the tree has 152 leaves (179 nodes) whose accuracy is only 69.6%.

This article is from the free online

Data Mining with Weka

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now