Hello again! In this lesson we’re going to look at an important new concept called “baseline accuracy”. We’re going to use a new dataset, the “diabetes” dataset. I’ve got Weka here, and I’m going to open diabetes.arff.
There it is. Have a quick look at this dataset. The class is tested_negative or tested_positive for diabetes. We’ve got attributes like “preg”, which I think has to do with the number of times they’ve been pregnant; “age”, which is the age. Of course, we can learn more about this dataset by looking at the ARFF file itself. Here’s the diabetes dataset. You can see it’s diabetes in Pima Indians.
There’s a lot of information here.
The attributes: number of times pregnant, plasma, glucose concentration, diabetes pedigree function, and so on. I’m going to use percentage split. I’m going to try a few different classifiers. Let’s look at J48 first, our old friend J48.
I’m going to look at some other classifiers. You learn about these classifiers later on in this course, but right now we’re just going to look at a few. Look at NaiveBayes classifier in the “bayes” category, and run that. Here we get 77%, a little bit better, but probably not significant. Let’s choose, in the “lazy” category, IBk. Again, we’ll learn about this later on. Here we get 73%, quite a bit worse. We’ll use one final one, PART, “partial rules” in the “rules” category. Here we get 74%. We’ll learn about these classifiers later, but they are just different classifiers, alternative to J48.
You can see that J48 and NaiveBayes are pretty good, probably about the same: the 1% difference between them probably isn’t significant. IBk and PART are probably about the same performance; again, 1% between them. There’s a fair gap, I guess, between those bottom two and the top two, which probably is significant. I’d like to think about these figures. 76%, is it good to get 76% accuracy? If we go back and look at this dataset, the class, we see that there are 500 negative instances and 268 positive instances.
If you had to guess, you’d guess it would be “negative”, and you’d be right 500/768 – the sum of these two things, the total number of instances – you’d be right that fraction of the time, 500/768 if you always guess “negative”, and that works out to 65%. Actually, there’s a “rules” classifier called ZeroR, which does exactly that. The ZeroR classifier just looks for the most popular class and guesses that all the time. If I run this on the training set, that will give us the exact same number, 500/768, which is 65%. It’s a very, very simple, kind of trivial classifier, that always just guesses the most popular class.
It’s OK to evaluate that on the training set, because it’s hardly using the training set at all to form the classifier. That’s what we would call the “baseline”. The baseline gives 65% accuracy, and J48 gives 76% accuracy. It’s significantly above the baseline, but not all that much above the baseline. It’s always good when you’re looking at these figures to consider what the very simplest kind of classifier, the baseline classifier, would get you. Sometimes, baseline might give you the best results. I’m going to open a dataset here. We’re not going to discuss this dataset. It’s a bit of a strange dataset, not really designed for this kind of classification. It’s called “supermarket”.
I’m going to open “supermarket”, and without even looking at it I’m just going to apply a few schemes here. I’m going to apply ZeroR, and I get 64%. I’m going to apply J48, and I think I’ll use a percentage split for evaluation because it’s not fair to use the training set here. Now I get 63%. That’s worse than the baseline! If I try NaiveBayes (these are the ones I tried before) I get again 63%, worse than the baseline. If I choose IBk – this is going to take a little while here, it’s a rather slow
scheme – here we are; it’s finished now: only 38%! That’s way, way worse than the baseline.
We’ll just try PART, partial decision rules: here we get 63%. The upshot is that the baseline actually gave a better performance than any of these classifiers, and one of them was really atrocious compared with the baseline. This is because, for this dataset, the attributes are not really informative.
The rule here is: don’t just apply Weka to a dataset blindly. You need to understand what’s going on. When you do apply Weka to a dataset, always make sure that you try the baseline classifier, ZeroR, before doing anything else. In general, simplicity is best. Always try simple classifiers before you try more complicated ones. Also, you should consider, when you get these small differences, whether the differences are likely to be significant. We saw 1% differences in the last lesson that were probably not at all significant. You should always try a simple baseline. You should look at the dataset. We shouldn’t blindly apply Weka to a dataset; we should try to understand what’s going on.