Skip to 0 minutes and 11 secondsHello again. You know, the trouble with life is that sometimes everything just comes down to money. In this lesson and the next we’re going to look at counting the cost in data mining applications. What is success? Well, that’s a pretty good question, I suppose. In data mining terms, we’ve looked at the classification rate, measured on a test set, or holdout, or cross-validation. But essentially we’re trying to minimize the number of errors, or maximize the classification rate. In real life, different kinds of errors might have different costs, and minimizing the total errors might be inappropriate. Now we looked at the ROC curve in Class 2, and that shows you the different tradeoffs between the different error costs.

Skip to 0 minutes and 54 secondsBut it’s not really appropriate if you actually know the error costs: then, we want to pick a particular point on this ROC curve. We’re going to look at the credit rating dataset, credit-g.arff. It’s worse to class a customer as “good” when they’re “bad” then it is to class a customer as “bad” when they’re “good”. (In this dataset, the class value is “good” or “bad”.) The idea is that if you class someone as “good” when they’re “bad” and you give them a loan, then he’s going to run away with all your money, whereas if you make an error the other way round then you might have an opportunity to rectify it later on.

Skip to 1 minute and 34 secondsTo tell you the truth, I know nothing about the credit rating industry, but let’s just suppose that’s the case. Furthermore, let’s suppose that the cost ratio is 5 to 1. I’ve got the credit dataset open here, and I’m going to run J48. What I get is an error rate of 29.5%, a success rate of 70–71%. Down here is the confusion matrix. I’ve copied those over here on to this slide. You can see that the cost here, the number of errors, is effectively the 183 plus 112, those off-diagonal elements of the confusion matrix. If errors cost the same amount, that’s a fair reflection of the cost of this confusion matrix.

Skip to 2 minutes and 20 secondsHowever, if the cost matrix is different, then we need to do a different kind of evaluation. On the Classify panel, we can do a cost-sensitive evaluation. Let me go and do that for you. In the More options menu, we’re going to do a cost-sensitive evaluation. I need to set a cost matrix. This interface is a little weird. I want a 2×2 matrix; I’m going to resize this. Here we’re got a cost of 1 for both kinds of error, but I want a cost of 5 for this kind of error. Just close that and then run this again. Now I’ve got the same result, the same confusion matrix, but I’ve got some more figures here.

Skip to 3 minutes and 1 secondI’ve got a total cost of 1027 and an average cost of 1.027. (There are 1000 instances in this dataset.) Coming back to the slide, the cost here is computed by taking the 183 in the lower left and multiplying it by 5 – because that’s the cost of errors down there – and the 112 times 1, adding those up, and I get 1027. If I take the baseline, let’s go and have a look at ZeroR. I’m going to run ZeroR on this. Here it is. Here I get a cost of 1500. I get this confusion matrix. Over here on the slide, there’s the confusion matrix.

Skip to 3 minutes and 40 secondsAnd although I’ve only got 300 errors here, they’re expensive errors, they each cost $5, so I’ve got a cost of 1500. This is classifying everything as “good”, because there are more “good” instances than “bad” in this dataset. If I were to classify everything as “bad” the total cost would only be 700. That’s actually better than either J48 or ZeroR. Obviously we ought to be taking the cost matrix into account when we’re doing the classification, and that’s exactly what the CostSensitiveClassifier does. We’re going to take the CostSensitiveClassifier, select J48, define a cost matrix, and see what happens. It’s meta > CostSensitiveClassifier, which is here.

Skip to 4 minutes and 29 secondsI can define a classifier: I’m going to choose J48, which is here. I need to specify my cost matrix. I want it 2×2; I’ll need to resize that. I want to put a 5 down here. Cool. I’m just going to run it.

Skip to 4 minutes and 53 secondsNow I get a worse classification error. We’ve only got 60–61% accuracy, but we’ve got a smaller cost, 658. And we’ve got a different confusion matrix. Back here on the slide you can see that. The old confusion matrix looked like this, and the new confusion matrix is the one [below]. You can see that the number 183 of expensive errors has been reduced to 66. That brings the cost down, the average cost, to 0.66 per instance instead of 1.027, despite the fact that we now have a worse classification rate. Let’s look at what ZeroR does with the CostSensitiveClassifier. It’s kind of interesting because we’re going to get a different rule.

Skip to 5 minutes and 41 secondsInstead of classifying everything as “good”, we’re going to classify everything as “bad”. We’re going to make 700 mistakes, but they’re cheap mistakes. It’s only going to cost us $700. That’s what we’ve learned today. Is classification accuracy the best measure? Very likely it isn’t, because in real life different kinds of errors usually do have different costs. If you don’t know the costs, you might just want to look at the tradeoff between the error costs – different parts of the space – and the ROC curve is appropriate for that.

Skip to 6 minutes and 13 secondsBut if you do know the costs – the cost matrix – then you can do cost-sensitive evaluation to find the total cost on the test set of a particular learned model; or you can do cost-sensitive classification, that is, take the costs into account when producing the classifier.

Skip to 6 minutes and 31 secondsThe CostSensitiveClassifier does this: it makes any classifier cost-sensitive. How does it do this? Very good question. We’re going to find out in the next lesson.

Counting the cost

So far we’ve taken the classification rate – computed on a test set, or holdout, or cross-validation – as the measure of a classifier’s success. We’re trying to maximize the classification rate, that is, minimize the number of errors. But in real life, different kinds of error often have different costs. If the costs are known, they can be taken into account when evaluating a classifier’s performance. Error costs can also taken into account when using a learning method to create a classifier – regardless of which learning method is used – to get a classifier that minimizes the cost rather than the error rate.

Share this video:

This video is from the free online course:

More Data Mining with Weka

The University of Waikato