New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only T&Cs apply

# Evaluating 2-class classification

Threshold curves show different tradeoffs between error types. Ian Witten explains how the area under the ROC curve measures a classifier's accuracy.
10.9
Hello again! At the end of the last lesson we were looking at a two-class dataset where the accuracy on one of the classes was very high and the accuracy of the other class was not very high. But because an overwhelming majority of the instances were in the first class, the overall accuracy looked very high. In this lesson, we’re going to take a closer look at this kind of situation and come up with a more subtle way of evaluating classifiers under these circumstances. Here in Weka I’ve opened the weather data. 14 instances, a simple, artificial dataset, and I’m going to classify it with Naive Bayes. I’ve selected NaiveBayes here, and there it is. I’m interested in the Confusion Matrix.
60.1
In fact, I’ve put over on the slide here. Here is the confusion matrix. You can see there are “a”s and “b”s, “yes”s and “no”s. There are 7 “a”s that are classified as “a”s, and 2 “a”s that are classified as “b”s, incorrectly. There’s 1 “b” that’s classified as “b” – that’s correct – and 4 “b”s that are classified as “a”, incorrectly. I want to introduce some terminology here. We’re going to talk about “true positives”, those 7 correctly classified “a”s; and “true negatives”, that 1 correctly classified “b”. “False positives” are negative instances that are incorrectly assigned to the positive class. They look like they’re positives, but they’re false. That’s the 4. And “false negatives” conversely.
111.6
We’re going to be interested in the “true positive rate”, that is the accuracy on class “a”, which is 7 (the number of true positives), divided by the total size of class “a”, that is 9; and the “false positive rate”, which is the number of false positives, 4, divided by the total number of negative instances, that is 5. That’s 0.80. That’s 1 minus the accuracy on class “b”. The main point of this lesson is that there’s a tradeoff between these things. You can trade off the accuracy on class “a” against the accuracy on class “b”. You can get better accuracy on class “a” at the expense of accuracy on class “b”, and vice versa.
157.4
To show you what I mean, let’s go back to Weka. In the More options menu, I’m going to output the predictions. Let’s just run Naive Bayes again. I’m interested in this table of predictions. These are the 14 instances. For this instance, which is actually a “no”, Naive Bayes had a prediction probability of 92.6% for the “yes” class and 7.4% for the “no” class. These two things add up to 1. Because the probability for the “yes” class was greater than the probability for the “no” class, Naive Bayes predicted “yes”. Incorrectly, as it turns out, because it was actually a “no” – that’s why there’s a plus in this error column. That’s the way Naive Bayes gets all of its predictions.
203.4
It takes the “yes” probability and the “no” probability, and sees which is larger, and predicts a “yes” or a “no” accordingly. Over on the slide, I’ve got the same data, and then I’ve processed it on the right into a simpler table, with just the actual class and the probability of the “yes” class that’s output by Naive Bayes. I’ve sorted the instances in decreasing order of prediction probability. At the top, we’ve got an instance which is actually a “no” that Naive Bayes predicts to be a “yes”, because the prediction probability for “yes” is 0.926, which is way larger than the prediction probability for a “no”, 1 minus that.
247
In fact, if you think about it, it’s like Naive Bayes is drawing a line at the 0.5 point – that horizontal line – and everything above that line it’s predicting to be a “yes”; everything below that line it’s predicting to be a “no”. The true positives are those “yes”s above the line – that’s 7 of them.
274.7
The “yes”s below the line are incorrectly predicted positive instances. So the “true positive” rate is 7 / 9. Conversely, for the “no” class, things below the line are predicted as a “no”. There’s only one correct prediction there below the line. That’s the very last entry. There are 4 “no”s above the line that are incorrectly predicted to be “yes”s because they are above the line. That gives a false positive rate of 0.8. Like I say, there’s a tradeoff. We could change things if we put the line in a different place. Naive Bayes puts it at 0.5.
313.9
But if we were to move the line from 0.5 (that’s the P line) to 0.75 (that’s the Q line), then we’d have a true positive rate of 5/9 – that’s those 5 “yes”s above the line compared with the 4 “yes”s below the line – and a false positive rate of 0.2. That’s the Q line. We’re going to plot these points on a graph. We’re going to plot the accuracy on class “a” (TP) against 1 minus the accuracy on class “b” (FP). You can see the P and Q points on the graph. Now we can get other points on the graph by putting the line in different places.
357.8
In the extreme, we could put the line right at the very top above the first instance. That means that we’d be classifying everything as a “no”, which gives us 100% accuracy on the “no” class – that’s an FP rate of 0 – and 0% accuracy on the “yes” class – that’s a TP rate of 0. That’s the 0, 0 point on the graph.
382.3
Then, if we take our horizontal line and move it down the table one by one, we’re going to be moving up along that red line until we get to the top, the upper right-hand corner, which corresponds to a line underneath the whole table where we classify everything as a “yes”, getting 100% accuracy on the “yes” class; and nothing as a “no”, getting 0% accuracy on the “no” class, the “b” class. You can get different tradeoffs between accuracy on class “a” and accuracy on class “b” by putting the line at different points. That’s for a single machine learning method. What about a different machine learning method? Well, different machine learning methods will give you different red lines.
425
There’s one, the dashed line down a little bit below. That’s actually worse than the Naive Bayes line with the P and the Q on it, because where you want to be is in the top left-hand corner. The top left-hand corner corresponds to perfect accuracy on class “a” and perfect accuracy on class “b”. That’s where you’d like to be. So lines that push up toward that top corner, that top red dotted line, are better. That’s where you want to be. One way of evaluating the overall merit of a particular classifier, say the Naive Bayes one shown in the P–Q line, is to look at the area under the curve. That’s the area shown there.
471.1
If that area is large, then we’re going to get a better classifier evaluated across all the different possible tradeoffs, the different thresholds. The area under the curve is a way of measuring classifier accuracy independent of the particular tradeoff that you happen to choose. Actually, in Weka, you can look at this curve. It’s called a “threshold curve”, and we’re going to visualize the threshold curve for the positive class. That’s what we get. It’s not a smooth curve, it’s a bit of a jagged curve. In fact, we plot the y axis against the x axis – true positive rate against false positive rate – and each of these points corresponds to a particular point in the table.
521.2
There are 13 points, plus 1 at the beginning and 1 at the end; 15 points altogether. The point that I’ve circled there corresponds to a false positive rate of 2/5 and a true positive rate of 5/9. All the other points correspond to different points on the curve. What we want to measure is the area under the curve. It’s called an ROC, “Receiver Operating Characteristic”, curve, for historical reasons. Weka prints out the area under the ROC curve. In this case it’s 0.5778. If we could find a classifier that pushed a bit more up towards the top left, then that would be better, give us a better area.
564.7
And actually, if we were to evaluate J48 – which I won’t do, but it’s very simple – on the same dataset (just run J48 and look at the curve), we’ll get a curve like this, the dashed blue line, which is better. The area under that curve is 0.6333, which is better than Naive Bayes. We’re looking at threshold curves that plot the accuracy of one class against the accuracy on the other class, and that depict the tradeoff between these two things. ROC curves plot the true positive rate against the false positive rate. They go from the lower left to the upper right, and good ones stretch up towards the top left corner.
604.6
In fact, a diagonal line corresponds to a random decision, so you shouldn’t go below the diagonal line. The area under the curve is a measure of the overall quality of a classifier. It turns out that it’s equal to the probability that the classifier ranks a randomly chosen positive test instance above a randomly chosen negative one.

In the last lesson we encountered a two-class dataset where the accuracy on one class was high and the accuracy on the other was low. Because the first class contained an overwhelming majority of the instances, the overall accuracy looked high. But life’s not so simple. In practice, there’s a tradeoff between the two error types – a different classifier may produce higher accuracy on one class at the expense of lower accuracy on the other. We need a more subtle way of evaluating classifiers that make this tradeoff explicit. Enter the ROC curve …

Note: Current versions of Weka have a different interface for outputting predictions from that shown in the video (at 2:42). Instead of selecting “Output predictions”, you now choose PlainText in the “Output predictions” selector. To get the output shown in the video, configure PlainText (double-click it) and set outputDistribution to True.