## Want to keep learning?

This content is taken from the The University of Waikato's online course, Data Mining with Weka. Join the course to learn more.
3.4

## The University of Waikato

Skip to 0 minutes and 11 seconds Hi! Before we go on to talk about some more simple classifier methods, we need to talk about overfitting. Any machine learning method may ‘overfit’ the training data, that’s when it produces a classifier that fits the training data too tightly and doesn’t generalize well to independent test data. Remember the user classifier that you built at the beginning of Week 2 when you built a classifier yourself? Imagine tediously putting a tiny circle around every single training data point. You could build a classifier very laboriously that would be 100% correct on the training data, but probably wouldn’t generalize very well to independent test data. That’s overfitting. It’s a general problem. We’re going to illustrate it with OneR.

Skip to 0 minutes and 58 seconds We’re going to look at the numeric version of the weather problem, where temperature and humidity are numbers and not nominal values. If you think about how OneR works, when it comes to make a rule on the attribute temperature, it’s going to make complex rule that branches 14 different ways perhaps for the 14 different instances of the dataset. Each rule is going to have zero errors; it’s going to get it exactly right. If we branch on temperature, we’re going to get a perfect rule, with a total error count of zero. In fact, OneR has a parameter that limits the complexity of rules. I’m not going to talk about how it works.

Skip to 1 minute and 42 seconds It’s pretty simple, but it’s just a bit distracting and not very important. The point is that the parameter allows you to limit the complexity of the rules that are produced by OneR. Let’s open the numeric weather data.

Skip to 2 minutes and 5 seconds We can go to OneR, and choose it. There’s OneR, and let’s just create a rule. Here the rule is based on the outlook attribute. This is exactly what happened in the last lesson with the nominal version of the weather data. Let’s just remove the outlook attribute, and try it again.

Skip to 2 minutes and 37 seconds Now let’s see what happens when we classify with OneR.

Skip to 2 minutes and 48 seconds Now it branches on humidity. If humidity is less than 82.5%, it’s a yes day; if it’s greater than 82.5%, it’s a no day and that gets 10 out of 14 instances correct. So far so good, that’s using the default setting of OneR’s parameter that controls the complexity of the rules it generates. We can go and look at OneR, and remember you can configure a classifier by clicking on it. We see that there’s a parameter called minBucketSize, and it’s set to 6 by default, which is a good compromise value. I’m going to change that value to 1, and then see what happens. Run OneR again, and now I get a different kind of rule.

Skip to 3 minutes and 39 seconds It’s branching many different ways on the temperature attribute. This rule is overfitted to the dataset.

Skip to 3 minutes and 51 seconds It’s a very accurate rule on the training data, but it won’t generalize well to independent test data.

Skip to 4 minutes and 0 seconds Now let’s see what happens with a more realistic dataset. I’ll open diabetes, which is a numeric dataset.

Skip to 4 minutes and 11 seconds All the attributes are numeric, and the class is either tested_negative or tested_positive. Let’s run ZeroR to get a baseline figure for this dataset. Here I get 65% for the baseline. We really ought to be able to do better than that. Let’s run OneR. The default parameter settings that is a value of 6 for OneR’s parameter that controls rule complexity. We get 71.5%. That’s pretty good. We’re evaluating using cross-validation. OneR outperforms the baseline accuracy by quite a bit – 71% versus 65%. If we look at the rule, it branches on “plas”. This is the plasma-glucose concentration. So, depending on which of these regions the plasma-glucose concentration falls into, then we’re going to predict a negative or a positive outcome.

Skip to 5 minutes and 4 seconds That seems like quite a sensible rule. Now, let’s change OneR’s parameter to make it overfit. We’ll configure OneR, find the minBucketSize parameter, and change it to 1.

Skip to 5 minutes and 18 seconds When we run OneR again, we get 57% accuracy, quite a bit lower than the ZeroR baseline of 65%. If you look at the rule. Here it is. It’s testing a different attribute, pedi, which – if you look at the comments of the ARFF file – happens to be the diabetes pedigree function, whatever that is. You can see that this attribute has a lot of different values, and it looks like we’re branching on pretty well every single one. That gives us lousy performance when evaluated by cross-validation, which is what we’re doing now. If you were to evaluate it on the training set, you would expect to see very good performance.

Skip to 6 minutes and 2 seconds Yes, here we get 87.5% accuracy on the training set, which is very good for this dataset. Of course, that figure is completely misleading; the rule is strongly overfitted to the training dataset, and doesn’t generalize well to independent test sets. That’s a good example of overfitting. Overfitting is a general phenomenon that plagues all machine learning methods. We’ve illustrated it by playing around with the parameter of the OneR method, but it happens with all machine learning methods. It’s one reason why you should never evaluate on the training set. Overfitting can occur in more general contexts.

Skip to 6 minutes and 39 seconds Let’s suppose you’ve got a dataset and you choose a very large number of machine learning methods, say a million different machine learning methods and choose the best for your dataset using cross-validation. Well, because you’ve used so many machine learning methods, you can’t expect to get the same performance on new test data. You’ve chosen so many, that the one that you’ve ended up with is going to be overfitted to the dataset you’re using. It’s not sufficient just to use cross-validation and believe the results. In this case, you might divide the data three ways, into a training set, a test set, and a validation set. Choose the method using the training and test set.

Skip to 7 minutes and 21 seconds By all means, use your million machine learning methods and choose the best on the training and test set or the best using cross-validation on the training set. But then, leave aside this separate validation set for use at the end, once you’ve chosen your machine learning method, and evaluate it on that to get a much more realistic assessment of how it would perform on independent test data. Overfitting is a really big problem in machine learning.

# Overfitting

“Overfitting” is a problem that plagues all machine learning methods. It occurs when a classifier fits the training data too tightly and doesn’t generalize well to independent test data. It can be illustrated using OneR, which has a parameter that tends to make it overfit numeric attributes. For example, on the numeric version of the weather data, and on the diabetes dataset, we get good performance on the training data, but lousy performance on independent test sets – or with cross-validation. That’s overfitting.