Skip to 0 minutes and 11 secondsSo far, we’ve been using Python from within the Java Virtual Machine. However, in this lesson, we’re going to invoke Weka from within Python. But you might ask, “why the other way? Isn’t it enough using Jython?” Well, yes and no. Jython limits you to pure Python code and to Java libraries, and Weka provides only modeling and some limited visualization. However, Python has so much more to offer. For example, NumPy, a library of efficient arrays and matrices; SciPy, for linear algebra, optimization, and integration; matplotlib, a great plotting library. You can check all this out on the Python wiki under Numeric and Scientific libraries. So what do we need?

Skip to 0 minutes and 59 secondsWell, first of all we need to install Python 2.7, which you can download from python.org. But make sure the Java that you’ve got installed on your machine and Python have the same bit-ness. So they’re either 32bit or 64bit. You cannot mix things. You have to set up an environment that you can actually compile some libraries. On Linux, that’s an absolute no-brainer. A few lines on the command line and you’re done within 5 minutes. However, OSX and Windows have quite a bit of work involved, so it’s not necessarily for the faint-hearted.

Skip to 1 minute and 36 secondsYou can install the python-weka-wrapper library, which we’re going to use in today’s lesson, and you’ll find that and some instructions on how to install it on the various platforms on that page. Good luck with that. I’ve got it already installed, so I’m going to talk a bit more about what the python-weka-wrapper actually is. This library fires up a Java Virtual Machine in the background and communicates with the JVM via Java Native Interface. It uses the javabridge library for doing that, and the python-weka-wrapper library sits on top of that and provides a thin wrapper around Weka’s superclasses, like classifiers, filters, clusterers, and so on.

Skip to 2 minutes and 26 secondsAnd, in difference to the Jython code that we’ve seen so far, it provides a more “pythonic” API. Here are some examples. Python properties are, for example, used instead of the Java get/set-method pairs. For example, options instead of getOptions/setOptions. It uses lowercase plus underscore instead of Java’s camel case, crossvalidate_model instead of crossValidateModel. It also has some convenience methods that Weka doesn’t have, for example data.class_is_last() instead of data.setClassIndex(data.numAttributes()–1). And plotting is done via matplotlib. Right. So I presume you were lucky installing everything, and you’ve sorted everything out. I’ve already done that on my machine here because it takes way too long, and I’m going to fire up the interactive Python interpreter.

Skip to 3 minutes and 22 secondsFor the first script, we want to revisit cross-validating a J48 classifier. As with all the other examples, we have to import some libraries. In this case, we’re communicating with the JVM, so we have to have some form of communicating with it and starting and stopping it, so we import the weka.core.jvm module. We want to load data, so we’re going to import the converters, and we’re importing Evaluation and Classifier. First of all, we’re going to start the JVM. In this case, using the packages as well is not strictly necessary, but we’ll just do it. You can see a lot of output here. It basically tells you what the libraries are in the classpath, which is all good.

Skip to 4 minutes and 21 secondsNext thing is we’re going to load some data, in this case our anneal dataset, once again using the same approach that we’ve already done with Jython using the environment variable. That’s loaded. Then we’re going to set the class, which is the last one, and we’re going to configure our J48 classifier. Whereas in Jython we simply said “I want to have the J48 class”, we’re going to instantiate a Classifier object here and tell that class what Java class to use, which is our J48 classifier, and with what options. So the same confidence factor of 0.3.Once again, same thing for the Evaluation class.

Skip to 5 minutes and 11 secondsWe instantiate an Evaluation object with the training data to determine the priors, and then cross-validate the classifier on the data with 10-fold cross-validation. That’s done. And now we can also output our evaluation summary. This is simply with Evaluation.summary(...). The title, and we don’t want to have any complexity statistics being output, and since in our Jython example we also had the confusion matrix we’re going to output that as well. Here’s our confusion matrix. One thing you should never forget is, once you’re done, you also have to stop the JVM and shut it down properly. We can see once again like with the other one, we have 14 misclassified examples out of our almost 900 examples.

Skip to 6 minutes and 15 secondsYou can count those: 3, 2, 2, and 7, which is 14; here’s the confusion matrix as well. For the next script we’ll be plotting the classifier errors obtained from a LinearRegression classifier on a numeric dataset. Once again we’ll be using the errors between predicted and actual as the size of the bubbles. Once again I’m going to fire up the interactive Python interpreter. I’m going to import, as usual, a bunch of modules. In this case, new is the plotting module for classifiers I’m going to import here. We’ll start up our JVM. We’re loading our bodyfat dataset in, setting the class attribute. Then we’re going to configure our LinearRegression, once again turning off some bits that make it faster.

Skip to 7 minutes and 22 secondsWe’re going to evaluate it on our dataset with 10-fold cross-validation. Done. And now we can plot it with a single line. Of course, we’re cheating here a little bit, because the module does a lot of the heavy lifting, which we had to do with Jython manually. Here we go. Nice plot. Of course, you can also zoom in if you wanted to. Great. As a final step, stop the JVM again, and we can exit. The last script that we’re going to do in this lesson, we’ll be plotting multiple ROC curves, like we’ve done with Jython. Once again, the Python interpreter. It’s

Skip to 8 minutes and 18 secondsa nice thing: we can just open it up and do stuff with it straight away. Import stuff. Once again we’re using a plotting module for classifiers. We are starting up the JVM; loading the balance-scale dataset like we did with Jython; and we also use the NaiveBayes classifier – as you can see, this time there are no options. Cross-validate the whole thing with 10-fold cross-validation. Then we use the plot_roc method to plot everything. We want to plot 0, 1, and 2 class label indices. Here we have those. Once again, we can see the AUC values for each of the labels, whether

Skip to 9 minutes and 23 secondsit’s L, B, or R.Final step: stop the JVM again and exit.

Invoking Weka from Python

So far, we’ve been using Python from within Weka. However, in this lesson we work the other way round and invoke Weka from within Python. This allows you to take advantage of the numerous program libraries that Python has to offer. You need to install Python, and then the python-weka-wrapper library for Python. (You will probably need admin access to your computer for this.) Having set this up, we replicate some scripts from earlier lessons.

To set up Python and the python-weka-wrapper:

  • Download and install Python 2.7 (installation is easy on Linux, but can be challenging on Windows and OSX)
  • Download and install the python-weka-wrapper library
  • (Note: there is also a Python 3 version of python-weka-wrapper here; use that if you prefer)
  • Here are instructions for all this

Data files:

Evaluate a classifier and output summary statistics and a confusion matrix:

Plot the errors made by LinearRegression:

Display multiple ROC curves, one for each class:

Share this video:

This video is from the free online course:

Advanced Data Mining with Weka

The University of Waikato

Get a taste of this course

Find out what this course is like by previewing some of the course steps before you join: