3.12

The University of Waikato

Skip to 0 minutes and 11 secondsWe left off last lesson looking at an itemset with 3 items and a support of 4, which means that there are 4 instances in the dataset for which those conditions are true. This itemset can be massaged into 7 possible rules by choosing different things on the left- and right-hand side. I said that the strategy of Apriori is to specify the minimum confidence level, and then iteratively reduce the support until enough rules are found with greater than this confidence. These rules have got support 4 and confidence values ranging from 100%, 4/4, to whatever 4/14 is. For the weather data, Apriori will first generate itemsets with support 14. There aren’t any of those.

Skip to 1 minute and 0 secondsIf there were, it would find rules in them with greater than the minimum confidence level; the default for Weka is 90% confidence level. Since there weren’t any itemsets with support 14, it would decrease the support to 13 and carry on decreasing it until it had found sufficient rules to satisfy the specified number of rules. Actually, the weather data has 336 rules with confidence of 100%. The reason why it works in this slightly crazy way is that if you started looking at high confidence rules, then you’d find huge numbers of very low support, very high confidence rules. From a large dataset, you’d have massive numbers of 100% confidence rules that weren’t very interesting because they had tiny support.

Skip to 1 minute and 46 secondsThat’s why it does this. Let’s go over to Weka. I’ve opened the weather data, the 14-instance weather data. I’m going to go to Associate and run Apriori, that’s the default association rule learner. This is the output I get.

Skip to 2 minutes and 7 secondsHere are the 10 rules; the default number of rules is 10. You can see this is the support of these rules, and it ranges from 4 down to 3 down to 2. These last 2 rules only have 2 instances in the dataset that satisfy those rules. The rules are all 100% confident. Let’s go back to the slide here. We specify the minimum confidence level; the default is 90%. We specify how many rules we want; the default is 10. We express the support as the proportion of the number of instances. Then we run the Apriori algorithm several times. We start at 100% support, and decrease it by 5% each time.

Skip to 2 minutes and 53 secondsWe stop when we get the right number of rules with more than the minimum confidence level, or we would stop when we reached another parameter, lowerBoundMinSupport. So there are quite a lot of parameters here. Let’s take a look at how this works for the weather data. On the output that I just showed you, it said that the minimum support of the top – it specifies the minimum support – is 0.15. 0.15 times 14 instances is 2 – 0.15 is the proportion of the total number of instances. The minimum confidence is 0.9; that was set as the default parameter. It actually performed 17 cycles, reducing the support each time.

Skip to 3 minutes and 37 secondsJust looking down at the bullet point underneath, the 17 cycles corresponded to having a support of 100%, and then it reduces by 5% each time – 95%, 90%, 85%, and so on. It actually got right down to 15%. When you translate those percentages into a number of instances, it was using 14 instances, 13 instances, then it did it again with 13 instances. It’s a little crazy on this tiny dataset – it’s doing a bit of extra work – but on a large dataset that wouldn’t happen. It got down to 3 instances at the 20% level, and it only found 8 rules with confidence greater than 90% and support 3.

Skip to 4 minutes and 17 secondsThat’s why it was forced to go down to a support of 2.

Skip to 4 minutes and 25 secondsWhat are these itemsets? Well, we can look at the itemsets. The itemsets are based on the final support values. Let’s go back to Weka and have a look at the itemsets. Here are the parameters for Apriori.

Skip to 4 minutes and 42 secondsAs I said before, there are quite a few of them. This is the amount by which the support is reduced each time, by 5%. This is when it stops, when it gets to a support of 10%. It’s looking for rules with a confidence greater than 90%. It’s looking for 10 rules. It’s starting as a support of 100%, which is normally what one does. Here we can output the itemsets, so let’s output the itemsets, and run it again. We’ve got the same rules, and here are the itemsets. These are the itemsets with support of at least 2. The way it works is it starts with itemsets with just 1 item, and these are the support for each of these itemsets.

Skip to 5 minutes and 37 secondsThen it adds new conditions to these to generate 2-item itemsets. This L1 here, these are itemsets with 1 item, and down here, these are itemsets with 2 items. That’s how it generates the itemsets. It’s a little bit convoluted, and it does this for efficiency reasons for large datasets. Here are the itemsets with 3, and here are the itemsets with 4 conditions in them. In fact, there weren’t any itemsets with more than that with this support level. Coming back to the slide, there were 12 1-item sets with support of at least 2. There were 47 2-item sets, 39 3-item sets, 6 4-item sets, and 0 5-item sets, which is where it stopped.

Skip to 6 minutes and 34 secondsThat’s how it goes through this, and for each itemset it converts it into rules and looks for rules with greater than the minimum confidence. Here it ended up with the 10 rules that we saw before. It’s a little bit complicated, but as I say, it’s done for efficiency reasons. There are some other parameters. In the Weka implementation, “car”, class attributes, always produces rules that predict the class attribute. You can filter rules according to a statistical test, a Chi-squared statistical test, but that’s actually unreliable because we’re making a very large number of statistical tests here and significant results will be found just by chance. We’ve talked about confidence, but there are different metrics that can be used for ranking rules.

Skip to 7 minutes and 21 secondsYou can also remove all attributes whose values are all “missing”. Those are extra parameters. You’re going to look at the supermarket data and do some market basket analysis. This data was collected from an actual New Zealand supermarket.

Skip to 7 minutes and 35 secondsLet’s just go and have a quick look at this: supermarket.arff. Here it is.

Skip to 7 minutes and 43 secondsYou can see there are departments here: there’s baking needs, and coupons, and tea, and biscuits – very popular in New Zealand – frozen foods, razor blades, gardening aids, spices. A large number of attributes, 217 attributes, and 4,600 instances in this dataset. There are 1 million attribute values if you multiply those numbers together. In this dataset, missing values are used to indicate that the basket did not contain that item. In fact, 92% of the values are missing. That means that the average basket contains 220 attributes times 8%, that’s only 18 items – the average number of items in your supermarket basket. The most popular items are bread-and-cake, vegetables, frozen foods, and biscuits. That’s Apriori.

Skip to 8 minutes and 36 secondsIt makes multiple passes through the data generating 1-item sets, 2-item sets, and so on, with more than the minimum support. It turns each one into rules and checks their confidence. It’s fast and efficient, providing that the data fits into main memory. Weka invokes the algorithm several times, gradually reducing the support each time until sufficient high-confidence rules have been found. There are many parameters that control this iteration.

Learning association rules

Apriori’s strategy for generating association rules is to specify a minimum “confidence” and iteratively reduce “support” until enough rules are found. Doing this efficiently involves an interesting and subtle algorithm. “Market basket analysis” is often considered a prime candidate for association rule mining, and we look at some data that records the contents of shopping baskets in a real New Zealand supermarket.