## Want to keep learning?

This content is taken from the The University of Waikato's online course, More Data Mining with Weka. Join the course to learn more.
4.15

## The University of Waikato

Skip to 0 minutes and 11 secondsThere are two ways of making a classifier cost-sensitive. The terminology is a little bit confusing. The first method is going to be called “cost-sensitive classification” and the second method is going to be called “cost-sensitive learning”. For cost-sensitive classification, what we do is adjust a classifier’s output by re-calculating the probability threshold. I’ve opened the german_credit dataset, with 1000 instances. I’m going to classify this with Naive Bayes. I get this matrix here. If I set Output predictions, which I’ve set, then I can see in the output the actual predictions for the 1000 instances. I’ve written those down here – not all 1000, I’ve just taken every 50.

Skip to 1 minute and 5 secondsI’ve got 20 results here: the actual class of the instance, the predicted class of the instance, and Naive Bayes’ probability that the instance is a “good” one rather than a “bad” one. And I’ve sorted this list by the probability column. In fact, the effect of Naive Bayes is, it looks to see if the “good” probability is bigger than the “bad” probability, which is the same as saying, “is the good probability bigger than 0.5?” It’s like drawing a horizontal line at 0.5, between instance number 750 and instance number 800. Everything above that line is going to be classified as “good”, and everything below the line is going to be classified as “bad”.

Skip to 1 minute and 55 secondsGoing back to that “classified as” matrix, the confusion matrix, 605 plus 151, that’s 756 instances that are going to be classified as “good”. 95 plus 149, that’s [244], are going to be classified as “bad”. Then, within those, if we were to look at the matrix with the actual classes, and we counted the number of “bad” ones above the line, we find 151 “bad” ones above the line – those are misclassifications – and 95 “good” ones below the line, which are misclassifications. We don’t actual have to use a threshold of 0.5. This is exactly the same table of the actual and predicted, but I’ve changed the threshold to 0.833.

Skip to 2 minutes and 46 secondsThat gives me the classification matrix that’s shown here, and a total cost – using the cost matrix we were talking about in the last [lesson], where one kind of error costs 5 times the cost of the other kind of error – we get a total cost here of 517, versus 850 for the threshold of 0.5 on the previous slide. You can see that if you count up the numbers above the line, then there are 501 of them (448+53), of which 53 are “bad”. Then count up the number below the line and look at the number of “good” ones there; those are the errors.

Skip to 3 minutes and 29 secondsIn general, it’s not hard to show that, given a general cost matrix 0, λ, μ, 0, you minimize the expected cost by classifying an instance as “good”, setting the threshold at μ/( λ + μ), which is where we got the 0.833 from for this problem. That’s what you do for Naive Bayes, but what about methods that don’t produce probabilities? Well, they almost all do produce probabilities. Let’s look at J48. Imagine J48 with minNumObj set to 100. I’ve done this to force a small tree. I won’t do it for you, but I’d get the tree shown here. If I look at the tree, the leaves of the tree have effectively got probabilities.

Skip to 4 minutes and 16 secondsThe leftmost leaf at the bottom is predicting “good”, and there are 37 exceptions, 37 “bad” instances. The “good” probability for this leaf is 1 – 37/108 (the total number of instances that reach that leaf), which is 0.657. You’ll find that [number] in the list of probabilities in the table on the right. The next leaf is predicting “bad”, and there are 68 out of 166 exceptions. So the “good” probability for that leaf is 0.410, and you’ll see that number in the list in the table on the right. And so on. We can get probabilities from J48, and from other methods as well. Let’s do this. To do this in Weka, we use the CostSensitiveClassifier with “minimizeExpectedCost = true”.

Skip to 5 minutes and 10 secondsI’ve got the credit dataset open. If I just run J48 with that cost matrix, I get a cost of 1027. Over in Weka, I’m going to select the CostSensitiveClassifier, Meta > CostSensitiveClassifier. I’m going to configure that to have the appropriate cost matrix. I need to put in the cost matrix here, a 2 by 2 cost matrix, and I want the one we’ve been using all along, with a 5 there. Then I want to set minimizeExpectedCost to true. That gives us cost-sensitive classification. If I run that with J48 (did I select J48? No; I should have selected J48 here). Now, if I run that with J48, I get this little matrix here, and a total cost of 770.

Skip to 6 minutes and 18 secondsIn fact, back on the slide, that’s the middle section of the slide, the cost of 770 with the confusion matrix that’s shown. Actually, J48 isn’t very good at producing probabilities, and it’s advantageous to use bagging. We talked about bagging in Data Mining with Weka J48 produces a restricted set of probabilities, but using the bagging technique enriches the set of probabilities produced. If you just used bagged J48 – I won’t do this for you, but if you used that as the classifier – then you’d get a lower cost, a better confusion matrix, with a cost of 603. Or 0.603, because there are 1000 instances. That was what we’re calling “cost-sensitive classification”, where you adjust the probability threshold.

Skip to 7 minutes and 8 secondsThe second method we’re going to call “cost-sensitive learning”, where, instead of adjusting the output of the classifier, the probability threshold, we’re going to learn a different classifier. Here’s a way to think about that. Suppose we created a new dataset by replicating some instances in the old dataset. To simulate the cost matrix we’ve been talking about, suppose we added 4 copies of every “bad” instance. The new dataset would have 700 “good” instances and 1500 “bad” instances. And rerun, say, J48. When you think about it, that will give errors on the “bad” instances effectively a weight of 5 to 1 more expensive than errors on the “good” instances. In practice, we won’t actually copy the instances, we’ll re-weight them internally in Weka.

Skip to 8 minutes and 1 secondThe way to do this is to use the same classifier, CostSensitiveClassifier, but set minimizeExpectedCost to false. We had it true before, now we’re going to set it to false, which is the default. We’re going to try that with Naive Bayes and J48. Here we are. Let’s use J48 first. We’re going to set minimizeExpectedCost to false, and run that. Now we get a total cost of 658 with this confusion matrix.

Skip to 8 minutes and 37 secondsThat corresponds to the middle line on this slide: J48 has a cost of 658. If we were to use Naive Bayes, we’d get a cost of 530; and if we used bagged J48, we’d get a cost of 581. In general, these are a little bit better – certainly for J48, the results of cost-sensitive learning are a little better than the results of cost-sensitive classification that we looked at before. Here’s what we’ve learned. Cost-sensitive classification adjusts a classifier’s output to optimize a given cost matrix. Cost-sensitive learning, on the other hand, learns a new classifier to optimize with respect to a given cost matrix, effectively by duplicating – or, really, internally re-weighting – the instances in accordance with the cost matrix.

Skip to 9 minutes and 33 secondsBoth of these are done with the Weka classifier CostSensitiveClassifier; [it] implements both of those with a switch to choose which one to use. And there are ways in Weka to store and load the cost matrix automatically.

# Cost-sensitive classification

There are two different ways to make a classifier cost-sensitive. One is to create the classifier in the usual way, striving to minimize the number of errors rather than their cost – but then adjust its output to reflect the different costs by recalculating the probability threshold used to make classification decisions. This can be done even for methods that don’t use probabilities explicitly. The second is to learn a different classifier, one that takes the costs into account internally rather than by post-processing the output. This can be done by re-weighting the instances in a way that reflects the error costs.