New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only T&Cs apply

# Decision trees and rules

Any decision tree has an equivalent set of rules ... and any rule set has an equivalent decision tree. But as Ian Witten explains, it's not so simple.
10.9
Hello again! Welcome to Class 3 of More Data Mining with Weka. In this class, we’re going to look at rules and clustering. In the first couple of lessons, we’re going to look at decision rules. I’m going to look in this lesson at rules versus trees, in abstract, as it were. In the next lesson, we’ll look at how to generate decision rules. We talked a lot about decision trees in Data Mining with Weka. For any decision tree, you can read off an equivalent set of rules. You just go along the leaves. For each of the five leaves, you just read off the conditions above that leaf that get you from the root to the leaf.
50.6
“If outlook is sunny and humidity is high then ‘no’” – that’s a rule corresponding to the left-most leaf. That’s easy. We call this a “decision list” if we execute these rules in order. If we execute them in order, then we can delete some of these clauses. We can delete the “and humidity is normal” in the second rule, because we’ve already dealt with the “humidity is high” case in the first rule, and there are no other options – if it’s not high, it’s got to be normal.
80.4
Similarly, in the fourth rule, we can delete the “outlook = rainy” bit, because we’ve already dealt with sunny and overcast, and if we pass through those rules, then the outlook has got to be rainy. Using this technique, if we execute our rules as a decision list, we can make the rule set a little bit simpler. There are other rule sets that are equivalent that are even simpler. Let’s look at this 3-rule set.
104.9
Imagine: is there an equivalent tree? For any tree, you can make rules; for any set of rules, can you make a tree? The answer is, you can. This is slightly more abstract than the previous example.
117.3
I’ve got: “If x=1 and y=1 then a. If z=1 and w=1 then a. Otherwise, b.” I’m assuming that both x and y have three possible values. We branch on x and y down the left-hand branch. If they’re both 1, that’s a; fantastic! Otherwise, though, if y=2, then we’ve got to check z and w. That’s done in that gray tree underneath, which is a little bit complex. To make matters much worse, we’ve got to repeat this tree. If y=3, then we’ve got to repeat the same tree. That little gray triangle stands for that whole gray tree replicated. There are a total of 4 copies of it in this.
162.7
We start with a rather simple set of rules, and we end up with a pretty complicated tree. So in one sense, trees and rules are equivalent.
170.8
They can describe the same things: given a tree, you can create a set of rules; given a set of rules, you can create a tree. But in practice, they’re very different. Because – particularly if rules are expressed as a decision list and they’re executed in order – they can be much smaller than trees.
188.1
People like rules: they’re easy to read and understand. It’s tempting to view them as independent “nuggets of knowledge”. If you’re executing them as a decision list – and you usually are – then the meaning of the rule must be taken in the context of the rules that precede it. They don’t really stand independently, although they look like they do. One way of creating rules – let’s say you want to create rules – you could just create a decision tree. We know how to do that – the top-down, divide-and-conquer method used by J48 – and read rules off the tree, one rule for each leaf, like we did at the beginning. Very straightforward, but the rules will contain repeated tests.
230
You can get rid of some of those quite easily, but more effective conversions are not so easy to do. Another completely different approach for generating rules is to do it bottom-up, a “covering method” that’s called separate-and-conquer. We work on the different classes in turn. For each class, in turn, we find the rules that cover all its instances. We do that by first identifying a useful rule, then separating out those instances it covers; and then carry on and conquer the remaining instances in that class, finding more rules for them. Here’s a simple little example. We’re going to generate a rule for class a. We start out with a rule that says “Everything is class a!”
271.8
Of course, that rule is not correct. So we add clauses to that rule. “If x > 1.2 then class = a.” That rule is still not correct, but it’s better than the first rule.
284.6
Then we can add another clause to make it even more correct: “If x > 1.2 and y > 2.6, then class = a.” That’s completely correct. It does miss out one other a. We could add a new rule for that, or we could decide that maybe we don’t need to because it’s just one instance that’s being missed out.
303.5
Then, for a rule for class b, we could say “if x <= 1.2 than class = b”: that gets half of them. Another rule could be “if x > 1.2 and y <= 2.6, then class = b.” We could get rid of the first test if we knew those rules were going to be executed in order. We could add more rules to get a perfect rule set, or we could stop there with the rules that we have. The termination conditions are something that you have to decide on when you devise a rule-making algorithm. Here’s a decision tree that corresponds to the rule set we just looked at. Rules sets can be more perspicuous than decision trees.
341
Decision trees sometimes have to contain replicated subtrees. Also, there’s a big difference between the divide-and-conquer and the separate-and-conquer, the top-down and the bottom-up, algorithms. In a multi-class situation, the covering algorithm focuses on one class at a time, it gets rules for that class; whereas the divide-and-conquer decision tree algorithm takes all the classes into account when making any of the decisions. Here are the details of a simple bottom-up, covering algorithm for creating rules. It’s called PRISM. It consists of three loops. The outer loop is over each class.
377.4
Then, [in] the middle loop, we’re going to cover some of the instances in that class, and then carry on creating more rules that cover more of the instances until we’re satisfied we’ve covered enough. The inner loop is taking a rule and adding clauses to the rule until it’s accurate enough. It’s a simple, iterative algorithm for covering a dataset class-by-class, instance-by-instance, bottom-up; elaborating the rules with more conditions as we go along. That’s it. We’ve seen that decision trees and rules have got the same expressive power, but either can be more perspicuous than the other. It just depends on the dataset. Rules can be created using a bottom-up covering process. They’re often executed as decision lists, in order.
428.5
If rules assign different classes to an instance, then the first rule wins if you’re executing them in order – which means that the rules are not really independent nuggets of knowledge. Still, people like rules, and they often prefer them to trees.

Any decision tree has an equivalent set of rules … and any rule set has an equivalent decision tree. So they’re the same? It’s not so simple. If you read rules off a tree, you usually end up with far more rules than necessary, particularly if – as is usual – the rules end up being checked one at a time, in order. And conversely, there are small rule sets for which the equivalent tree is outrageously complex. Whereas trees are created top-down using a “divide and conquer” approach, rules are more naturally created bottom-up, using a covering process.