Reflect on your experience

Well over a million instances of this 55-attribute dataset are needed before it becomes too big for Weka to load into memory. Beyond that point you have to use “updateable” classifiers, and then there is no limit on size.

As you discovered in the quiz, you could load the 581,000-instance covtype dataset into the Explorer. You could also have loaded the double-size version (1,162,000 instances; try it if you like), but not the triple-size version (1,743,000 instances).

The command line interface allows larger files, provided it’s configured in a suitable way – updateable classifiers and no cross-validation. Using a test set and NaiveBayesUpdateable, it was able to process the triple-size training file; and in fact both training and test files of any size could be processed.

Weka reports the time that NaiveBayesUpdateable takes to build the model and then test it on the training data. On my computer, with the triple-size dataset, training takes 9 secs and testing 72 secs. Testing is nearly four times slower than training! – despite the fact that the training file contains 1,743,000 instances and the test file only 10,000!

Why? This involves thinking about how the Naive Bayes algorithm works. Processing a single instance when training involves incrementing 55 counts, one for each attribute. Processing an instance when testing involves 55 multiplications for each of the 7 class values. Thus if the time taken for multiplication and addition were comparable, one would expect testing to take 7 times as long as training.

Share this article:

This article is from the free online course:

More Data Mining with Weka

The University of Waikato