Skip main navigation

£199.99 £139.99 for one year of Unlimited learning. Offer ends on 28 February 2023 at 23:59 (UTC). T&Cs apply

Find out more

Counting the cost

If errors have different costs, "classification rate" is inappropriate. Ian Witten shows how to take account of cost when measuring performance.
Hello again. You know, the trouble with life is that sometimes everything just comes down to money. In this lesson and the next we’re going to look at counting the cost in data mining applications. What is success? Well, that’s a pretty good question, I suppose. In data mining terms, we’ve looked at the classification rate, measured on a test set, or holdout, or cross-validation. But essentially we’re trying to minimize the number of errors, or maximize the classification rate. In real life, different kinds of errors might have different costs, and minimizing the total errors might be inappropriate. Now we looked at the ROC curve in Class 2, and that shows you the different tradeoffs between the different error costs.
But it’s not really appropriate if you actually know the error costs: then, we want to pick a particular point on this ROC curve. We’re going to look at the credit rating dataset, credit-g.arff. It’s worse to class a customer as “good” when they’re “bad” then it is to class a customer as “bad” when they’re “good”. (In this dataset, the class value is “good” or “bad”.) The idea is that if you class someone as “good” when they’re “bad” and you give them a loan, then he’s going to run away with all your money, whereas if you make an error the other way round then you might have an opportunity to rectify it later on.
To tell you the truth, I know nothing about the credit rating industry, but let’s just suppose that’s the case. Furthermore, let’s suppose that the cost ratio is 5 to 1. I’ve got the credit dataset open here, and I’m going to run J48. What I get is an error rate of 29.5%, a success rate of 70–71%. Down here is the confusion matrix. I’ve copied those over here on to this slide. You can see that the cost here, the number of errors, is effectively the 183 plus 112, those off-diagonal elements of the confusion matrix. If errors cost the same amount, that’s a fair reflection of the cost of this confusion matrix.
However, if the cost matrix is different, then we need to do a different kind of evaluation. On the Classify panel, we can do a cost-sensitive evaluation. Let me go and do that for you. In the More options menu, we’re going to do a cost-sensitive evaluation. I need to set a cost matrix. This interface is a little weird. I want a 2×2 matrix; I’m going to resize this. Here we’re got a cost of 1 for both kinds of error, but I want a cost of 5 for this kind of error. Just close that and then run this again. Now I’ve got the same result, the same confusion matrix, but I’ve got some more figures here.
I’ve got a total cost of 1027 and an average cost of 1.027. (There are 1000 instances in this dataset.) Coming back to the slide, the cost here is computed by taking the 183 in the lower left and multiplying it by 5 – because that’s the cost of errors down there – and the 112 times 1, adding those up, and I get 1027. If I take the baseline, let’s go and have a look at ZeroR. I’m going to run ZeroR on this. Here it is. Here I get a cost of 1500. I get this confusion matrix. Over here on the slide, there’s the confusion matrix.
And although I’ve only got 300 errors here, they’re expensive errors, they each cost $5, so I’ve got a cost of 1500. This is classifying everything as “good”, because there are more “good” instances than “bad” in this dataset. If I were to classify everything as “bad” the total cost would only be 700. That’s actually better than either J48 or ZeroR. Obviously we ought to be taking the cost matrix into account when we’re doing the classification, and that’s exactly what the CostSensitiveClassifier does. We’re going to take the CostSensitiveClassifier, select J48, define a cost matrix, and see what happens. It’s meta > CostSensitiveClassifier, which is here.
I can define a classifier: I’m going to choose J48, which is here. I need to specify my cost matrix. I want it 2×2; I’ll need to resize that. I want to put a 5 down here. Cool. I’m just going to run it.
Now I get a worse classification error. We’ve only got 60–61% accuracy, but we’ve got a smaller cost, 658. And we’ve got a different confusion matrix. Back here on the slide you can see that. The old confusion matrix looked like this, and the new confusion matrix is the one [below]. You can see that the number 183 of expensive errors has been reduced to 66. That brings the cost down, the average cost, to 0.66 per instance instead of 1.027, despite the fact that we now have a worse classification rate. Let’s look at what ZeroR does with the CostSensitiveClassifier. It’s kind of interesting because we’re going to get a different rule.
Instead of classifying everything as “good”, we’re going to classify everything as “bad”. We’re going to make 700 mistakes, but they’re cheap mistakes. It’s only going to cost us $700. That’s what we’ve learned today. Is classification accuracy the best measure? Very likely it isn’t, because in real life different kinds of errors usually do have different costs. If you don’t know the costs, you might just want to look at the tradeoff between the error costs – different parts of the space – and the ROC curve is appropriate for that.
But if you do know the costs – the cost matrix – then you can do cost-sensitive evaluation to find the total cost on the test set of a particular learned model; or you can do cost-sensitive classification, that is, take the costs into account when producing the classifier.
The CostSensitiveClassifier does this: it makes any classifier cost-sensitive. How does it do this? Very good question. We’re going to find out in the next lesson.

So far we’ve taken the classification rate – computed on a test set, or holdout, or cross-validation – as the measure of a classifier’s success. We’re trying to maximize the classification rate, that is, minimize the number of errors. But in real life, different kinds of error often have different costs. If the costs are known, they can be taken into account when evaluating a classifier’s performance. Error costs can also taken into account when using a learning method to create a classifier – regardless of which learning method is used – to get a classifier that minimizes the cost rather than the error rate.

This article is from the free online

More Data Mining with Weka

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education