How do I evaluate a classifier's performance?
This week is all about evaluation.
Last week you downloaded Weka and looked around the Explorer and a few datasets. You used the J48 classifier. You used a filter to remove attributes and instances. You visualized some data, and classification errors. Along the way you encountered a few datasets: the weather data (both nominal and numeric versions), the glass data, and the iris data.
As you have seen, data mining involves building a classifier for a dataset. (Classification is the most common problem, though not the only one.) Given an example or “instance”, a classifier’s job is to predict its class. But how good is it? Predicting the class of the instances that were used to train the classifier is pretty trivial: you could just store them in a database. But we want to be able to predict the class of new instances, ones that haven’t come up before.
The aim is to estimate the classifier’s performance on new, unseen, instances. Testing using the training data is flawed because it “predicts” data that was used to build the classifier in the first place. We need to go beyond what we see in the training data and predict outcomes – class values – for data that has never been seen before. The only way to know whether a classifier has any value is to test it on fresh data.
But where does the fresh data come from? It’s a conundrum, and that is what we explore this week. The basic idea is to split the data into two parts: the training data and the test data. The training data is used to build the model – the rules, if you like – that say how instances should be classified. After this is done, the test data is used to evaluate the model (rules). To do this, the model built during the training phase is applied to each test instance, and the result is the predicted class for that instances. The system compares this to the real class value defined for the instance and calculates what percentage are correct.
In the first activity you’re going to experience what it’s like to actually be a classifier yourself, by constructing a decision tree interactively. In subsequent activities we’ll look at evaluation, including training and testing, baseline accuracy, and cross-validation.
At the end of the week you will know how to evaluate the performance of a classifier on new, unseen, instances. And you will understand how easy it is to fool yourself into thinking that your system is doing better than it really is.