Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only T&Cs apply

Find out more

Working with big data

Ian Witten shows how some classifiers can handle arbitrarily large datasets when invoked from the command line, because it works incrementally.
It’s time to talk about big data. Everyone’s talking about big data. I’ve heard people say it’s like teenage sex. Everyone talks about it, but no one’s actually doing it. Those people probably didn’t have teenage children. Anyway, different people mean different things by big data, and what I mean by big data is datasets that can’t fit into the Weka Explorer. The Explorer loads the entire dataset. When you load a dataset, it’s all got to fit into main memory. How much can it handle? Well, roughly speaking, a million instances with 25 attributes in the default configuration. Actually, if you go to the Explorer and right-click on Status, you can get memory information, and this gives three figures here.
The last figure is the total amount of memory that is allocated to Weka, which is actually a Gigabyte. That’s the default configuration. The other two figures, well, it’s a little bit complicated. The most important thing is the difference between these two figures. If you want to find out more, then you should look up the Java functions freeMemory() and totalMemory(). Although Weka initializes itself with a Gigabyte of memory, on my computer, there’s more. In fact, if I look on my computer, if I right-click on Computer here, I can get the properties. The properties will show me that I’ve got 8 Gb of memory.
I could, in fact, arrange for Weka to initialize itself with more main memory, but I’m not going to do that now. I’m going to try and break it. Let’s see what happens when you break Weka. Well, we can do this by downloading a large dataset. But I’m going to introduce you to Weka’s data generator instead. On the Preprocess panel, there is a Generate button, and that will generate random data according to particular patterns. I’m going to use the LED24 data, show it, and generate it. What this has generated is a dataset with a hundred instances of the LED data, which has got 25 attributes. There they are, the hundred instances. That’s what’s loaded in.
But I can easily generate more than the default 100 instances: let’s generate 100,000 instances by just adding three zeros to this. Generate that. Now it’s generated 100,000 instances. Let’s go and classify this. We could choose, say, J48.
I’m going to use percentage split here. Cross-validation would take a long time. J48 is working away.
It’s finished now, and it’s come up with a percentage accuracy of 73%. Or we could use NaiveBayes, which I think will be a little bit quicker, and that comes up with an accuracy of 74%. Let’s go and generate a million instances then with the data generator. We’ve got 100,000; so there’s a million. We can generate that. It’ll take a few seconds. There are a million instances, and we can go and classify that with NaiveBayes.
After a few seconds, I get the result. Here we go, 74% again. Now, I could try this with J48, but I happen to know that J48 uses more memory than NaiveBayes, and it will crash on this dataset. As things get bigger, the Explorer starts to crash. Actually, I could go and try to generate, say two million instances of this dataset. The Explorer would crash if I did that. When you’re doing this kind of thing, you’re better off using the console version of the Explorer. If you go to your All Programs menu, you’ll find that there are a couple of versions of Weka that are installed for you automatically. One is Weka “with console”.
That brings up this console window, and it’s the console window that reports when things crash, out of memory errors, and so on. If you’re going to mess around with this kind of thing, I’d recommend using that version of Weka.
This is the error message that you ought to get when J48 crashes. Unfortunately, when things break, they tend to break in different ways on different computers, so you might not get this error message. You might see that Weka just goes into an infinite loop and waits forever. It depends. That’s why the console version is a better thing to use. To go further, first of all, we mustn’t use the Explorer, because it loads the entire dataset in. Secondly, we need to use “updateable” classifiers. These are incremental classification models that process a single instance at a time. They don’t load the whole dataset into memory. There are a few of them.
In fact, we looked at them in the activity associated with the last lesson. The one we’re going to use is NaiveBayesUpdateable, which is just like NaiveBayes but an updateable implementation. IBK is also an updateable classifier, and there are a few others. How much data can Weka handle? If you use the Simple Command Line interface and updateable classifiers, then it’s unlimited. Let’s open up the Simple Command Line interface. Here it is. I’m going to create a huge dataset. Actually, I’m going to create a pretty small dataset here with 100,000 instances in. I’m going to run the LED24 data generator and put that in this file here.
That has created that dataset of 100,000 instances, which I’m going to use as a test file. For a training file, I’m going to use 10 million instances. I could change this to 10 million and put this in the training file. However, that would take a few minutes, so I’m not going to do that. Instead, I’ve prepared these files in advance. Let me show you. Here we’ve got test.arff. The test file is a half a Mb, with 100,000 instances. The training file is half a Gb, with 10 million instances. I’ve done a really big training file here, which is 5 Gb, with 100 million instances. Those are the files I’m going to use.
I just need to run the NaiveBayesUpdateable classifier with the training file. This is the very large training file. This is the much smaller test file. If I run that by typing Enter here, it’ll take 4 minutes and produce 74% accuracy with NaiveBayesUpdateable. I can’t do it with J48 because that’s not an updateable classifier. I can try it with a really big file, with any size file. If I were to use my 5 Gb training file with 100 million examples in it, then it would run. It takes about 40 minutes on my computer. There you have it. The Explorer can handle about a million instances with 25 attributes, say. It depends.
You can increase the amount of memory allocated to the Explorer if your computer’s got more than 1 Gb of main memory. We haven’t talked about how to do that, but it’s not difficult. The Simple Command Line interface works incrementally wherever it can. It doesn’t load the dataset into main memory the way the Explorer does. If you use updateable classifier implementations – you can find which ones are updateable using the Javadoc – then the Simple Command Line interface will work incrementally. Then you can work with arbitrarily large files, many gigabytes or hundreds of gigabytes. However, you shouldn’t use cross-validation.
If you were to specify cross-validation in the Simple Command Line interface, then it would have to load the file all in at once. The Command Line interface only doesn’t load the file if you’re not using cross-validation. That’s why we use an explicit test file instead of the default of cross-validation. Working with big data can be difficult and quite frustrating.

Some classifiers work incrementally – that is, they update their model as the training dataset comes in, in a single pass through the dataset. When invoked from the command line, these classifiers can handle arbitrarily large datasets. In contrast, the Explorer loads in the entire dataset to begin with irrespective of which classifier is used, so it is limited by the amount of computer memory available. Note that cross-validation cannot work incrementally; you need to be careful about how you do the evaluation, maybe using an explicit test file.

This article is from the free online

More Data Mining with Weka

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now