Want to keep learning?

This content is taken from the The University of Waikato's online course, More Data Mining with Weka. Join the course to learn more.

Skip to 0 minutes and 11 seconds Association rules are about finding associations between attributes. Between any attributes. There’s no particular class attribute. Rules can predict any attribute, or indeed any combination of attributes. For this we need a different kind of algorithm. The one that we use in Weka, the most popular association rule algorithm, is called Apriori. I don’t know if you remember the weather data from Data Mining with Weka. Here’s this little dataset with 14 instances and a few attributes. Well, here are some association rules. “If outlook=overcast, then play=yes.” If you look at that, there are 4 “overcast” instances, and it’s “yes” for all of

Skip to 0 minutes and 54 seconds them: that rule is 100% correct. “If temperature=cool, then humidity=normal”; that’s also 100% correct. “If outlook=sunny and play=no, then humidity=high.” We don’t have to predict “play” or indeed any particular attribute. If you look at rule #4, “outlook=sunny and play=no,” the first 2 instances satisfy that rule, and there are no other instances that satisfy that rule. So it’s 100% correct, but it only covers 2 instances. There are lots of 100% correct rules for the weather data. I think there are 336 rules that are 100% correct. Somehow we need to discriminate between these rules. The way we’re going to do this is to look at the “support”, the number of instances that satisfy a rule.

Skip to 1 minute and 44 seconds The “confidence” is the proportion of instances for which the conclusion holds, and the “support” is the number of instances that satisfy a rule. Here I’ve got the same rules. They all have 100% confidence, but they’ve got different degrees of support, different numbers of instances. We’re looking for high support/high confidence rules, but we don’t really want to specify 100% confidence and look for all of those rules, because, like I said, there are hundreds of them and a lot of them have very low support. Typically what we do is specify a minimum degree of confidence and seek the rules with the greatest support with that minimum confidence. I want to introduce you to the idea of an “itemset”.

Skip to 2 minutes and 29 seconds An itemset is a set of attribute-value pairs, like “humidity=normal and windy=false and play=yes”. An itemset has got a certain support given a dataset. Here there are 4 instances in the dataset that are in that itemset. We can take that itemset and permute it in 7 different ways to produce rules, all of which have a support of 4. “If humidity=normal and windy=false than play=yes” has a support of 4 and a confidence of 4/4 – that’s 100% – because all of the instances for which humidity=normal and windy=false have play=yes. As we go down this list of rules, we get a lower degree of confidence.

Skip to 3 minutes and 17 seconds The last rule, for example, doesn’t have anything on the left-hand side: “anything implies humidity=normal, windy=false and play=yes” has a support of 4, but there are 14 instances that satisfy the left-hand side. All of the instances satisfy the left-hand side, so the confidence is 4/14. You can see that as you go down this list of rules, the confidence is decreasing from 100% through 4/6 (67%) down to quite a low value, 4/14. What Apriori does is generate high-support itemsets. Then, given an itemset, it gets all the rules from it, and just takes those with more than a minimum specified degree of confidence.

Skip to 4 minutes and 10 seconds The strategy is to iteratively reduce the minimum support until the required number of rules is found with a given minimum confidence. That’s it for this lesson. There are far more association rules than classification rules. We need different techniques. The “support” and “confidence” are two important measures. Apriori is the standard algorithm, and I just want to show you that algorithm over here in Weka. In order to use Apriori, I go to the Associate panel. There are a few association rule algorithms, of which by far the most popular is Apriori; that’s the default one. Then I just run that to get association rules.

Skip to 4 minutes and 53 seconds We want to specify the minimum confidence value and seek rules with the most support, and the details of that are in the next lesson.

Association rules

Association rule learners find associations between attributes. Between any attributes: there’s no particular class attribute. Rules can predict any attribute, or indeed any combination of attributes. To find them we need a different kind of algorithm. “Support” and “confidence” are two measures of a rule that are used to evaluate them, and rank them. The most popular association rule learner, and the one used in Weka, is called Apriori.

Share this video:

This video is from the free online course:

More Data Mining with Weka

The University of Waikato

Get a taste of this course

Find out what this course is like by previewing some of the course steps before you join: