Skip to 0 minutes and 11 secondsHello again! In data mining, people are always asking “how much data do I need?” We’re going to show you how you can address that question in this lesson using learning curves.
Skip to 0 minutes and 22 secondsThe advice on evaluation from Data Mining with Weka was: if you’ve got a large, separate test set, then just go ahead and use the test set. If you’ve got lots of data, use the holdout method. Otherwise, use 10-fold cross-validation – it’s the best way of getting the most reliable performance estimate out of a limited amount of data. You might repeat it 10 times or more, like the Experimenter does. But how much data is a lot?
Skip to 0 minutes and 49 secondsWell, that’s a good question, and there is no answer: it depends. Supposing you’ve got 1000 instances. That sounds like quite a lot. If you’ve got a 2-class dataset with 500 of each class, then maybe that’s pretty good. If you’ve got a 10-class dataset with 1000 instances and the classes are unevenly distributed – so maybe for some classes there are only 10 or 15 instances – well, that doesn’t sound so good. Although perhaps you don’t care about those small classes. It depends on the number of attributes. Again, with your 1000-instance dataset, that sounds like a lot, but if you have 1000 attributes, that might not be such a lot of instances. It depends on the structure of the domain.
Skip to 1 minute and 32 secondsAre you looking for complicated decision boundaries? It depends on the kind of model, the sort of decision boundaries it makes. If you’ve got a machine learning technique that looks for linear decision boundaries, then they’re pretty simple. You might not need so much data as you would for ones that look for more convoluted linear boundaries, or for decision trees, perhaps. It’s an impossible question to answer. The only way to look at it really is to look at it empirically using learning curves. I’ve shown a plot here of a learning curve. As the size of the training data increases, the performance gets better and better, but of course, it asymptotes off.
Skip to 2 minutes and 10 secondsThe point where it starts to asymptote off is probably enough training data to get a reliable estimate. Let’s talk about how to plot a learning curve in Weka. We’re going to sample the data. When you do sampling, we’re going to choose a sample, and you need to understand the difference between sampling with replacement and sampling without replacement. When you sample, it’s really a question of whether you move or copy the data. If you sample “with replacement”, then it’s like you take it out of the original dataset and put it into the sample dataset, and then replace it back in the original dataset. You don’t really take it out.
Skip to 2 minutes and 50 secondsYou copy it from the sample data, the instance from the original dataset to the sample dataset. “Without replacement” means you move it. You can’t see it again; you can’t sample it twice. If you sample with replacement, then instances might occur more than once in the sample dataset. If you sample without replacement, then they can’t. That’s the first thing. We’re going to sample the training set, but not the test set. We want to find out how performance changes as the size of the training set increases. But the test set determines the reliability of our estimate – we don’t want to make that artificially smaller. We always want to use the same size test set.
Skip to 3 minutes and 34 secondsWe can do that in Weka by using the FilteredClassifier. There’s a Resample filter, and if we wrap that up in the FilteredClassifier, that means that the filtering will apply to the training data and not to the test data. I’m going to do that with the glass dataset. I’ve opened the glass dataset here. I’m going to go to Classify. In meta, I’m going to find the FilteredClassifier.
Skip to 4 minutes and 4 secondsThen I’m going to check – I’m going to use J48 as the classifier. For the filter, I’m going to use the Resample filter.
Skip to 4 minutes and 14 secondsIt’s an unsupervised instance filter: we’re resampling instances. There it is. Here are the parameters. We can sample with or without replacement, and I would like to sample with no replacement, so I want to make that true. I want a 50% sample. I can go ahead and run that. I’m doing 10-fold cross-validation, sampling the training set, using a 50% sample of the training set and leaving the test set untouched. I get 65% performance. Back to the slide.
Skip to 4 minutes and 54 secondsHere is the 50% level: 65% performance.
Skip to 4 minutes and 58 secondsI did this for other sample sizes, which enabled me to plot this learning curve empirically: the performance against the percentage of training data I’m using. I’ve shown the ZeroR performance there, for reference. The line’s a bit jagged, and to get a smoother line I’d want to do it several times with cross-validation. If I do 10 repetitions of J48, then I get this line here. (I did this with the Experimenter. It’s very easy to do.) Then I did 1000 repetitions. I get this red line here, this smooth line. You can look at this line and make your own judgment as to how much training data you need to get pretty close to the ultimate accuracy of J48 on this dataset.
Skip to 5 minutes and 46 secondsIt looks like providing you have about maybe 50-60% of the training data, you’re going to be fairly close to the final accuracy. That’s it for learning curves. The question is how much data is enough? The answer is we don’t know! So you can plot a learning curve. We looked at resampling with and without replacement, but we didn’t want to sample the test set, because that would just decrease the reliability of evaluation. We used the FilteredClassifier. Obviously, the performance figures you get are only estimates, and you can improve the reliability of those estimates by repeating the test several times.
How much data do you need? There is no easy answer; it depends on many features of the problem and dataset. One way to estimate it is to plot a learning curve, and this can be done using the “resample” filter – along with the FilteredClassifier, to avoid resampling the test set (which is undesirable). The performance figures you obtain are estimates, and you can improve their reliability by repeating the experiment several times.
© University of Waikato, New Zealand. CC Creative Commons Attribution 4.0 International License.