New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only T&Cs apply

# Comparing classifiers

Ian Witten discusses how to reliably compare two classifiers. This involves rejecting the "null hypothesis" at a given statistical significance level.
10.7
Hello, and welcome back! We’re going to be using the Experimenter quite a lot in this course. I’m going to show you how to use it to compare classifiers.
20
Here’s a question: Is J48 better than ZeroR or OneR on the Iris dataset? Of course, we could fire up the Explorer. You know how to do this, so I’m not going to do it for you. We can open the dataset, we can get the results for these three different machine learning methods, and we can see that J48 with 96% cross-validation accuracy is better than OneR, which is better than ZeroR. But the question is, how reliable is this comparison? Things could change if we happened to choose a different random number seed. The Experimenter helps produce more reliable comparisons between datasets and classification algorithms. I’m going to fire up the Experimenter.
64.6
I’m going to open the Iris dataset, and use the same 3 classification algorithms, and compare them. Here we are in the Experimenter. I’m going to create a new experiment. I’m going to open a dataset. I’m going to add 3 classification algorithms.
88.2
I can reorder these algorithms, by the way. If I select one and go up, and select another one and go down, I can reorder them. I’m going to go to Run and run this. Then I’m going to go to the Analyse panel and click Experiment – that’s important – and then click Perform test. Back to the slides here – that’s what I did. I switched to the Analyse panel and clicked these things and got these results, which look like this, actually. Now, we can see the 3 figures for the 3 classification algorithms on the Iris dataset. We can see that both OneR and ZeroR are worse than J48, just looking at the numbers.
135.6
The star (*) means that ZeroR is significantly worse than J48. The absence of the star on OneR means that we cannot be sure that OneR is significantly worse than J48 at the 5% level of statistical significance. In other words, J48 seems better than ZeroR, and we’re pretty sure (5% level) that this is not due to chance. It seems to be better than OneR, but this may be due to chance, we can’t rule it out at the 5% level of statistical significance. Now, I could add a bunch more datasets. In fact, I’ll just go and do that.
179.7
I’ll rerun the experiment.
185.9
It’ll take a little bit of time.
191.4
Then I’ll analyze the results.
197.9
Over here on the slide, these are the results I get. So I can see that at the 5% level of significance J48 is significantly better than both OneR and ZeroR on 3 of the datasets. That’s looking at the stars; the stars mean that those methods are significantly worse than J48. In other words J48 is significantly better than them. It’s significantly better than OneR on breast-cancer and german_credit, and it’s significantly better than ZeroR on the iris and pima_diabetes datasets. So you can see from the table of figures and the stars where the significant results are. Now, what if we wanted to know whether OneR was significantly better than ZeroR?
243
This does not tell us on this slide, because on this slide, we’re comparing everything to J48.
251.4
If we go back to the Experimenter and select something different for the test base: I’m selecting OneR for the test base and performing the test. Now, I’ve got OneR in the first column, and things are being compared with it. Going back to the slide, having changed the test base, I can see that OneR is significantly worse than ZeroR on the german_credit dataset, about the same on the breast-cancer dataset, and significantly better on all the rest of the datasets. Another thing that we can do is change the order of the columns in this matrix. If I go back to the Experimenter and select for the row – currently the Dataset is selected – I’m going to select Scheme for the row.
300.9
For the column, currently Scheme is selected, and I’m going to select Dataset for the column. Then perform the test again. Now we get the datasets going along horizontally here, this is the list of datasets; and we get the algorithms going vertically. So I can see whether J48 performs significantly better or worse on the iris dataset than it does say on the breast-cancer dataset. What we’ve looked at is comparing classifiers. In statistical terms, people talk about the “null hypothesis”. That is, that one classifier’s performance is the same as another. The result that we observe is highly unlikely if the null hypothesis is true. That is, we reject the null hypothesis.
349.6
We reject the hypothesis that they’re the same at the 5% level of statistical significance. So the Experimenter tells you when the null hypothesis is being rejected. Or, equivalently, we can say that A performs significantly better than B at the 5% level. In the Experimenter, we can change the significance level. It’s common to use 5%; 1% for critical applications, maybe medical applications; perhaps 10% for less critical applications. We can change the comparison field. We have used percent correct, but we can change that in the Explorer, and it’s common to compare over a set of datasets.
388.2
We might say on these datasets, method A has so many wins and so many losses over method B, referring to the number of statistically significant times A is better than B or B is better than A. There’s problem you ought to be aware of – the multiple comparison problem. If you make a large number of tests, some of them will appear to be significant just by chance. As usual, this is not an exact science. The interpretation of results requires a certain amount of care.

How can you reliably compare two classifiers? Experimental results always depend on the random number sequence – might the conclusion be different if you used a different random number seed? Statisticians talk about the “null hypothesis”, which is that one classifier’s performance is the same as the other’s. We’re usually hoping that the results of an experiment reject the null hypothesis! This involves a certain level of statistical significance: we might reject the hypothesis at the 5% level of statistical significance, meaning that it’s highly unlikely (1 chance in 20) that their performance is the same.