Want to keep learning?

This content is taken from the The University of Waikato's online course, More Data Mining with Weka. Join the course to learn more.

Reflect on your experience

Well over a million instances of this 55-attribute dataset are needed before it becomes too big for Weka to load into memory. Beyond that point you have to use “updateable” classifiers, and then there is no limit on size.

As you discovered in the quiz, you could load the 581,000-instance covtype dataset into the Explorer. You could also have loaded the double-size version (1,162,000 instances; try it if you like), but not the triple-size version (1,743,000 instances).

The command line interface allows larger files, provided it’s configured in a suitable way – updateable classifiers and no cross-validation. Using a test set and NaiveBayesUpdateable, it was able to process the triple-size training file; and in fact both training and test files of any size could be processed.

Weka reports the time that NaiveBayesUpdateable takes to build the model and then test it on the training data. On my computer, with the triple-size dataset, training takes 10 secs and testing 0.5 secs. Training takes only 20 times longer than testing – despite the fact that the training file is 175 times larger (1,743,000 instances compared with the test file’s 10,000)!

Why? This involves thinking about how the Naive Bayes algorithm works. Processing a single instance when training involves incrementing 55 counts, one for each attribute. Processing an instance when testing involves 55 multiplications for each of the 7 class values. Thus if the time taken for multiplication and addition were comparable, one would expect testing to take 7 times as long as training.

Share this article:

This article is from the free online course:

More Data Mining with Weka

The University of Waikato

Get a taste of this course

Find out what this course is like by previewing some of the course steps before you join: