Skip main navigation

Training and testing

How can you evaluate how well a classifier does? Ian Witten explains that training set performance is misleading, and demonstrates some alternatives.
Hi! Here we’re going to look at training and testing in a little bit more detail. Here’s the situation. We’ve got a machine learning algorithm, and we feed into it training data, and it produces a classifier – the basic machine learning situation. For that classifier, we can test it with some independent test data. We can put that into the classifier and get some evaluation results. And, separately, we can deploy the classifier in some real situation to make predictions on fresh data coming from the environment. It’s really important in classification, when you’re looking at your evaluation results, you only get reliable evaluation results if the test data is different from the training data.
That’s what we’re going to look at in this lesson. What if you only have one dataset? If you just have one dataset, you should divide it into two parts. Maybe use some of it for training and some of it for testing. Perhaps 2/3rds of it for training and 1/3rd of it for testing. It’s really important that the training data is different from the test data. Both training and test sets are produced by independent sampling from an infinite population. That’s the basic scenario here, but they’re different independent samples. It’s not the same data. If it is the same data, then your evaluation results are misleading. They don’t reflect what you should actually expect on new data when you deploy your classifier.
Here we’re going to look at the “segment” dataset, which we used in the last lesson. I’m going to open “segment-challenge”.
I’m going to use a supplied test set. First of all, I’m going to use the J48 tree learner. I’m going to use a supplied test set, and I will set it to the appropriate “segment-test”
file, segment-test.arff. I’m going to open that. Now we’ve got a test set, and let’s see how it does. In the last lesson, on the same data with the user classifier, I think I got 79% accuracy. J48 does much better; it gets 96% accuracy on the same test set. Suppose I was to evaluate it on the training set? I can do that by specifying under Test options “Use training set”. Now it will train it again and evaluate it on the training set. Which is not what you’re supposed to do, because you get misleading results. Here it’s saying the accuracy is 99% on the training set. That is not representative of what we would get using this on independent data.
If we had just one dataset, if we didn’t have a test set, we could do a percentage split. Here’s a percentage split. This is going to be 66% training data and 34% test data. It’s going to make a random split of the dataset. If I run that, I get 95%. That’s just about the same as what we got when we had an independent test set, just slightly worse. If I were to run it again, if we had a different split, we’d expect a slightly different result. But actually, I get exactly the same result, 95.098%. That’s because Weka, before it does a run, it reinitializes the random number generator. The reason is to make sure that you can get repeatable results.
If it didn’t do that, then the results that you got would not be repeatable. However, if you wanted to have a look at the differences that you might get on different runs, then there is a way of resetting the random number generator between each run. We’re going to look at that in the next lesson. That’s this lesson. The basic assumption of machine learning is that the training and test sets are independently sampled from an infinite population, the same population. If you have just one dataset, you should hold part of it out for testing, maybe 33% as we just did or perhaps 10%.
We would expect a slight variation in results each time if we hold out a different set, but Weka produces the same results each time by design by making sure it reinitializes the random number generator each time. We ran J48 on the segment-challenge dataset. Bye for now!

How can you evaluate how well a classifier does? Training set performance is misleading. It’s like asking a child to memorize 1+1=2, 1+2=3 and then testing them on exactly the same questions, whereas you really want them to be able to answer questions like 2+3=?. We want to generalize from the training data to get a more widely applicable classifier. To evaluate how well this has been done, it must be tested on an independent test set. If you only have one dataset, set aside part of it for testing and use the rest for training.

This article is from the free online

Data Mining with Weka

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education