Using Weka in practice: some questions
At this stage of the course, people often start wondering about some basic issues about the use of Weka in practice.
I’ve trained a classifier. How can I use it to classify fresh data?
It’s easy! If you right-click any item in Weka’s Result list (lower left in the Explorer’s Classify panel), there’s an item on the menu Re-evaluate model on current test set. This will be grayed out if you haven’t specified a Supplied test set (upper left in Classify panel). It doesn’t matter whether the model was evaluated on the training set or a supplied test set, or with cross-validation or percentage split: the point is that Weka now has a model.
Setting (or resetting) the test set to any dataset and clicking Start will re-evaluate the model on that dataset. (I showed you how to specify a test set in the Week 2 video Training and testing.) Of course, the test set must have the same attributes as the training set, otherwise the model can’t be applied. The class values in the test set are irrelevant, because you’re interested only in what the predictions are, not whether they are right or wrong. But you’ll have to put something to make a valid ARFF file; I’d use ? (i.e. missing value, discussed in Week 5) for each class value.
If under More options you select Output predictions and set it to (say) Plain text (which you learned about in this week’s Quiz on Weka’s output), the predictions for each instance will be printed.
With cross-validation, Weka produces a model for each fold. Which one is used to classify fresh data?
Suppose Weka does 10-fold cross-validation. It runs the learning algorithm 10 times on different folds of the data (training on 9 folds and testing on the 10th), and averages the results to generate the performance evaluation – the number of correctly classified instances, for example. This is to ensure that the evaluation is not corrupted by testing on any of the data that was used for training.
It then runs the algorithm once more, training on all the data, to generate a final single model for printing and to make the confusion matrix. Usually you’d prefer to have a single model rather than the 10 that are used in the cross-validation; and the 11th one stands a chance of being slightly better than the other 10 because it is based on slightly more data. This is the model that is used to classify fresh data. (In fact the “11th run” is actually done first, but that doesn’t affect anything.)
Weka adopts a similar procedure for percentage split evaluation. With a 66% split it generates a model from 2/3 of the data set and evaluates it on 1/3; then it generates a final model from the whole dataset for printing and any further classification.
Isn’t it risky to use Weka in practice if I don’t know exactly how the classifiers work?
Well, of course it’s nice to understand how your tools work, and I’m hoping that you now do understand how simple classification methods work – after all, that’s this week’s Big Question! But next week we’ll learn about real-life methods, and the bad news is that you’ll never understand exactly how they work, in full and gory detail. Of course, you’re accustomed to operating with partial understanding: you use your car, and TV, and computer, and Internet, without understanding exactly how they work. In fact, if you take it down to the atomic level, you probably don’t understand how your coffee cup works!
My philosophy is that the most crucial thing to understand is evaluation. Instead of trying to understand exactly how the algorithms work, you’re better off being able to evaluate their effect. How well do they work? – that’s the key question. If you’re confident you can evaluate them properly, you shouldn’t be scared that you might mis-apply the technology. That’s why we spent the whole of Week 2 on evaluation. Although I’ve tried to convey the gist of how classifiers work, I’ve focused far more on evaluation – training and testing, performance estimation, cross-validation, baseline accuracy.
How do I make my data into an ARFF file?
If you’ve been paying attention you’ve already learned how to do this :-) (see Week 1 quiz More irises). First put the data into a spreadsheet and write it out in the well-known Comma Separated Values (.csv) format – that part is up to you.
Weka can open .csv files directly, but you will not be able to see the file in the Open file dialog box until you change the Files of Type setting beneath the list of files.
The first row of the .csv file should give the attribute names, separated by commas (you discovered this in the above-mentioned quiz). For example, the weather data spreadsheet’s first row would read
Having read a .csv file into Weka, you can save it as ARFF in the Preprocess panel and take a look at the ARFF version in a text editor.
Things can go wrong when reading .csv files. If a line contains the incorrect number of attribute values – too many, or too few, commas – you’ll get an error message. Weka tells you the number of the offending line, so you should be able to pinpoint the problem with a text editor, and correct it. You get an error even if the comma appears in a quoted string. A common problem is that a string contains a newline, splitting the data line into two partial lines that both contain the wrong number of commas. Again, look at Weka’s error message to find the offending line, and examine it carefully.
Excuse me, but …
… all these datasets are tiny, trivial. Does this stuff scale up? How large a dataset can Weka cope with? How many instances? How long does it take to open?
Good questions. But not so straightforward to answer. You will learn about this in the follow-up course More data mining with Weka. To give you an idea, there you will load one million instances of a 50-attribute dataset into the Explorer (took 12 secs on my little computer). With more memory you can do better, but there inevitably comes a point where a dataset is too big for the Explorer.
Beyond that you have to resort to the command line, which operates in stream-oriented fashion if you use certain “updateable” classifiers. Then there is no limit on size, and both training and test files of any size can be processed. As an example, you will process a triple-size training file (3 million instances).