Skip to 0 minutes and 11 secondsIn the last lesson, we looked at decision rules versus decision trees. They are kind of similar representations, but they are interestingly different in how they express things. We also looked at a bottom-up covering algorithm for generating rules called PRISM. You will realize that PRISM is not really a terribly good machine learning method. It’s not really meant for serious use. In this lesson, we’re going to look at a couple of schemes for rule learning and show how to use them in Weka. The first scheme we’re going to look at is called PART, and it’s a way of forming rules from partial decision trees.

Skip to 0 minutes and 49 secondsIt’s the basic separate and conquer algorithm: make a rule, remove the instances it covers, and continue creating rules for the remaining instances. To make a rule, PART builds a tree. It builds and prunes a decision tree for the current set of instances, and reads off the rule for the largest leaf – the most important rule, if you like. Then it throws the tree away, and carries on with the covering algorithm. It seems very wasteful, and I suppose perhaps it is a bit wasteful, but it turns out you can build just a partial tree – you don’t have to build a whole tree. That’s how PART works.

Skip to 1 minute and 30 secondsThe second method is called RIPPER: in fact, in Weka it’s called JRip. It’s a basic incremental reduced-error pruning algorithm. There’s class of algorithms that go by this name. The idea is that PRISM is good at producing rules, but it produces exact rules. Typically, we want to produce rules that are not necessarily exact, but merely very good. What incremental reduced-error pruning does is to take the instances, the training set,

Skip to 2 minutes and 7 secondsand split them into two sets, one called Grow and one called Prune, in the ratio 2:1. It uses the Grow set for growing rules, adding clauses to rules until you get a perfect rule. Then it uses the Prune set when you’re pruning rules, deleting the clauses from the rule until you’re left with a good rule. For each class, while there are instances of that class in both these sets, we’re going to use PRISM to create the best perfect rule for that class. Then we’re going to calculate the worth of the rule. Now, we need some measure of the “worth” of a rule.

Skip to 2 minutes and 51 secondsThere are different ways of measuring the worth of a rule, and different incremental reduced-error pruning algorithms do different things. For example, you might just use the success rate, or you might use some more complicated thing, perhaps even some entropy metric. Anyway, whatever you do, let’s assume you’ve got a way of measuring the worth of a rule. We calculate the worth of that rule, and then we omit the final condition – the last one that we added – and look at the worth of that. If it’s worthwhile, then we take away that final condition and carry on trying to remove conditions from the rule until we get an optimal version of the rule.

Skip to 3 minutes and 28 secondsSo we build up the rule on the Grow set, and then we prune it back on the Prune set until we get a rule whose worth is good. It turns out it’s better to prune backwards than to prune on the way forwards. Again, it sounds a bit wasteful, but it’s a good idea to build up the whole rule and then prune backwards. Then we just carry on. We select the rule with the largest worth, and we prune it and remove the instances it covers, carrying on with the basic covering algorithm. RIPPER follows this by a fiendishly complicated global optimization step that’s really detailed, really complex, not really very principled, but works really well.

Skip to 4 minutes and 10 secondsI’m not going to tell you about that. It’s just not worthwhile – you’d never remember it, it’s just too hard to – I mean, I don’t remember it. It’s just really complicated. But this is the basic kind of incremental reduced-error pruning algorithm that it uses to generate the rule set in the first place. All right, let’s go to Weka. I’ve loaded the diabetes dataset. I’m going to try J48, PART, and JRip. So here we are in Weka.

Skip to 4 minutes and 37 secondsHere’s the diabetes dataset: 768 instances. I go to Classify, and I’ve already run J48. This is the result from J48.

Skip to 4 minutes and 47 secondsI’ve got a decision tree here, quite a complicated decision tree: it’s got 20 leaves and a total of 39 nodes in this tree. It gets 74% accuracy. PART produces a rule set that looks like this. See these rules – ”If plas <= 127 and mass <= 26.4, etc. then tested_negative” (in this dataset, there are two classes, negative and positive). There’s a rule for negative, and a rule for positive, and so on. This rule set has got 13 rules, involving 25 separate tests.

Skip to 5 minutes and 35 secondsWe get 75% accuracy. RIPPER does really well, 76% accuracy, there at the top. It has only 4 rules. Amazing, eh? In fact, going back to the slide, here are the results, and here are the 4 rules. Actually, RIPPER starts out by taking the majority class – in this case tested_negative – and leaving that to the end. So it only produces rules for the other classes, and then leaves the majority class for the default clause like this. So tested_positive is the smaller class, and tested_negative is the larger class. These are the rules it’s come up with. Only 4 rules, 9 tests, and the best performance of all. That’s pretty amazing. PART is quick and quite an elegant algorithm, really.

Skip to 6 minutes and 23 secondsRepeatedly constructing decision trees and discarding them is less wasteful than it sounds. Incremental reduced-error pruning is a standard technique.

Skip to 6 minutes and 32 secondsRIPPER does incremental reduced-error pruning followed by a global optimization step: it usually produces fewer rules than PART.

Generating good decision rules

Here are a couple of schemes for rule learning. The first, called PART, is a way of forming rules from partial decision trees. The second, called Ripper (JRip in Weka), uses incremental reduced-error pruning, followed by a fiendishly complicated global optimization step that’s detailed, complex, unprincipled, but works well and often generates unbelievably small rule sets that do an excellent job. I’ll spare you the details – you really don’t want to know! Both methods are easy to use in Weka.

Share this video:

This video is from the free online course:

More Data Mining with Weka

The University of Waikato