Skip to 0 minutes and 11 seconds What we’re going to cover in this lesson is building models and evaluating them. The classes that we’re going to touch upon in this lesson are weka.classifiers.Evaluation for evaluating classifiers; some classifiers, filters we’ve already seen in the last lesson; and some randomization stuff where we’re going to use some Java classes. The first thing that we want to do is build a J48 classifier. I’m going to start up our Jython console again. For this script, we’ll load some data, configure the J48 classifier, build it and output the model.
Skip to 0 minutes and 53 seconds First of all, the imports: once again, our DataSource for loading data and our J48 classifier. Once again, we’re going to load our data using our environment variable, this time we’re loading the anneal UCI dataset; and, since it’s classification, we also have to set the class attribute. In this case, it’s the last one. So with numAttributes you can determine the number of attributes in a dataset, and with setClassIndex you can set which of these attributes is going to be the class attribute. However, since it’s an API, usually start counting at 0, not 1. That’s why we have numAttributes()–1.Next thing is I’m going to instantiate the J48 classifier, and we’re going to set some options.
Skip to 1 minute and 42 seconds In this case, we’re changing the confidence factor from the default value of 0.25 to 0.3. With the data available now and the classifier configured, we can build it, which simply happens with a buildClassifier call supplying the data. Then, as a final step, we’re outputting the model with a simple print statement. We run that. We can see the model that is being output after it’s built from the data. As a next step, we want to evaluate a model that we’ve built. In this case, we’re going to use cross-validation, because there’s no point in building a model if you don’t actually know if it’s any good. I’m going to open a new tab and import some more stuff again.
Skip to 2 minutes and 26 seconds In this case, we also need the Evaluation class and, since we’re cross-validating we also want to randomize the data. In that case, we’re importing the Random class. Just like before, we’re loading the anneal UCI dataset, setting the class attribute. Then we’re configuring the same classifier again; confidence factor once again 0.3. And then we’re setting up our evaluation. First of all, we’re initializing our Evaluation object with the current data in order to obtain the class priors.
Skip to 3 minutes and 7 seconds Then we’re calling the crossValidateModel method of the Evaluation object with the classifier template not built; the data that we want to evaluate on; the number of folds (in our case, we’re doing 10-fold cross-validation); and a random number generator initialized with a seed value of 1.After that finishes, we’ll have basically all the statistics inside the Evaluation object, and we want to output some things. First thing, we want to output some summary statistics. There’s the so-called toSummaryString method. If you look at the Javadoc, you’ll realize there’s actually several methods, one with no attribute, one with a Boolean attribute, and one with String and Boolean attributes, like we’re using here.
Skip to 4 minutes and 2 seconds Now the difference between Python and Java is that Python doesn’t have polymorphism; it has optional parameters, and named parameters. So in order for Jython to work, you basically have one method that has all the various parameters available. In this case, we’ll have to provide a title for basically our summary string and that we don’t want to output any complexity statistics, hence False. That is that. Since this is classification, we also want to output the confusion matrix, which you can do with the toMatrixString.
Skip to 4 minutes and 36 seconds When we’re running this script, you’ll see in the output our usual summary statistics of accuracy, what’s missed, kappa statistic, all kinds of errors, coverage, and how many instances there were all together in the dataset – almost 900 in the anneal dataset case. The confusion matrix was also output. You can see there are hardly any instances that are not on the diagonal. According to our misclassified ones, it should be only 14. So we have 3 here, 2 there, 2 there, and 7 there, which is 14, so all is good. The final script that we want to do in this lesson is how we can actually use a built model to make predictions. I’m going to open up a new tab again.
Skip to 5 minutes and 30 seconds In this case, like in the first script, we are importing our DataSource for loading data and our J48 classifier. We are once again loading a dataset. In this case, we’re not using the usual anneal dataset but one that’s been stripped down a bit, the anneal_train set. But still, the class attribute is in the same location, so it’s the last one. Setting that. We are once again configuring our J48 classifier, because we were happy with that configuration, based on our cross-validation results – it resulted in excellent results. Then we are building our classifier on the data, once again using the buildClassifier method, and since we want to make predictions on unlabeled data, we are now loading the unlabeled data in.
Skip to 6 minutes and 26 seconds In this case, dataset anneal_unlbl, which basically has the same dataset structure, but just “missing values” for the class. We also set the class attribute for this one. It’s usually recommended that you compare your training and test/unlabeled data, whether they are actually compatible. You can use a method of the Instances class called equalHeadersMessage for telling whether two datasets are the same. If you look at this code here, the unlabeled data is checked against the training data, and this will return a message, but only in the case where they are different – for instance, different number of attributes, different types, or different order of labels. Then it will output a message.
Skip to 7 minutes and 13 seconds Otherwise it will just output “None” or, in the Java case, “null”. In case we have a discrepancy between our datasets, then this will be output simply saying that they are not the same. And for making our predictions finally, since we now have our unlabeled data and our built model, we just iterate through our unlabeled data row by row, and then we obtain our class distribution by calling the distributionForInstance method. We want to know what the chosen class label is, so we’re using the classifyInstance method, which returns, in the case of a nominal class attribute, the label index (starting with 0).
Skip to 8 minutes and 6 seconds In order to determine what the string label actually is, we use the dataset, retrieve the class attribute, and then determine the string value that is associated with that particular index. To output anything, we are then outputting with a simple print statement our class distribution, our label index, and the associated label.Running that, we get an output like this. First you get an array, which is the class distribution; then the index of the label; and the label itself; all separated by hyphens. At the bottom, you can see you have 1, 2, 3, 4, 5, 6 labels all together there, so index 5, and the label is U in this case.
Skip to 9 minutes and 4 seconds So what we’ve learned in this lesson is how to build a classifier. We can output statistics from cross-validation that we’ve obtained from a classifier on a particular dataset, and we also used a built model to actually make predictions on new, unlabeled data.
Peter demonstrates writing three Python scripts for Weka using the J48 classifier, using the anneal dataset. The first builds a classifier and outputs the model, the second evaluates a classifier and outputs summary statistics and a confusion matrix, and the third builds a classifier and makes predictions on an unlabeled dataset.
© University of Waikato, New Zealand. CC Creative Commons Attribution 4.0 International License.