4.15

## The University of Waikato

Skip to 0 minutes and 11 secondsThere are two ways of making a classifier cost-sensitive. The terminology is a little bit confusing. The first method is going to be called “cost-sensitive classification” and the second method is going to be called “cost-sensitive learning”. For cost-sensitive classification, what we do is adjust a classifier’s output by re-calculating the probability threshold. I’ve opened the german_credit dataset, with 1000 instances. I’m going to classify this with Naive Bayes. I get this matrix here. If I set Output predictions, which I’ve set, then I can see in the output the actual predictions for the 1000 instances. I’ve written those down here – not all 1000, I’ve just taken every 50.

Skip to 1 minute and 5 secondsI’ve got 20 results here: the actual class of the instance, the predicted class of the instance, and Naive Bayes’ probability that the instance is a “good” one rather than a “bad” one. And I’ve sorted this list by the probability column. In fact, the effect of Naive Bayes is, it looks to see if the “good” probability is bigger than the “bad” probability, which is the same as saying, “is the good probability bigger than 0.5?” It’s like drawing a horizontal line at 0.5, between instance number 750 and instance number 800. Everything above that line is going to be classified as “good”, and everything below the line is going to be classified as “bad”.

Skip to 1 minute and 55 secondsGoing back to that “classified as” matrix, the confusion matrix, 605 plus 151, that’s 756 instances that are going to be classified as “good”. 95 plus 149, that’s [244], are going to be classified as “bad”. Then, within those, if we were to look at the matrix with the actual classes, and we counted the number of “bad” ones above the line, we find 151 “bad” ones above the line – those are misclassifications – and 95 “good” ones below the line, which are misclassifications. We don’t actual have to use a threshold of 0.5. This is exactly the same table of the actual and predicted, but I’ve changed the threshold to 0.833.

Skip to 2 minutes and 46 secondsThat gives me the classification matrix that’s shown here, and a total cost – using the cost matrix we were talking about in the last [lesson], where one kind of error costs 5 times the cost of the other kind of error – we get a total cost here of 517, versus 850 for the threshold of 0.5 on the previous slide. You can see that if you count up the numbers above the line, then there are 501 of them (448+53), of which 53 are “bad”. Then count up the number below the line and look at the number of “good” ones there; those are the errors.

Skip to 3 minutes and 29 secondsIn general, it’s not hard to show that, given a general cost matrix 0, λ, μ, 0, you minimize the expected cost by classifying an instance as “good”, setting the threshold at μ/( λ + μ), which is where we got the 0.833 from for this problem. That’s what you do for Naive Bayes, but what about methods that don’t produce probabilities? Well, they almost all do produce probabilities. Let’s look at J48. Imagine J48 with minNumObj set to 100. I’ve done this to force a small tree. I won’t do it for you, but I’d get the tree shown here. If I look at the tree, the leaves of the tree have effectively got probabilities.

Skip to 4 minutes and 16 secondsThe leftmost leaf at the bottom is predicting “good”, and there are 37 exceptions, 37 “bad” instances. The “good” probability for this leaf is 1 – 37/108 (the total number of instances that reach that leaf), which is 0.657. You’ll find that [number] in the list of probabilities in the table on the right. The next leaf is predicting “bad”, and there are 68 out of 166 exceptions. So the “good” probability for that leaf is 0.410, and you’ll see that number in the list in the table on the right. And so on. We can get probabilities from J48, and from other methods as well. Let’s do this. To do this in Weka, we use the CostSensitiveClassifier with “minimizeExpectedCost = true”.

Skip to 5 minutes and 10 secondsI’ve got the credit dataset open. If I just run J48 with that cost matrix, I get a cost of 1027. Over in Weka, I’m going to select the CostSensitiveClassifier, Meta > CostSensitiveClassifier. I’m going to configure that to have the appropriate cost matrix. I need to put in the cost matrix here, a 2 by 2 cost matrix, and I want the one we’ve been using all along, with a 5 there. Then I want to set minimizeExpectedCost to true. That gives us cost-sensitive classification. If I run that with J48 (did I select J48? No; I should have selected J48 here). Now, if I run that with J48, I get this little matrix here, and a total cost of 770.

Skip to 6 minutes and 18 secondsIn fact, back on the slide, that’s the middle section of the slide, the cost of 770 with the confusion matrix that’s shown. Actually, J48 isn’t very good at producing probabilities, and it’s advantageous to use bagging. We talked about bagging in Data Mining with Weka J48 produces a restricted set of probabilities, but using the bagging technique enriches the set of probabilities produced. If you just used bagged J48 – I won’t do this for you, but if you used that as the classifier – then you’d get a lower cost, a better confusion matrix, with a cost of 603. Or 0.603, because there are 1000 instances. That was what we’re calling “cost-sensitive classification”, where you adjust the probability threshold.