Contact FutureLearn for Support
Skip main navigation
We use cookies to give you a better experience, if that’s ok you can close this message and carry on browsing. For more info read our cookies policy.
We use cookies to give you a better experience. Carry on browsing if you're happy with this, or read our cookies policy for more information.

Skip to 0 minutes and 11 secondsThis lesson is slightly different to the other ones, because we will be looking at a real-world challenge, and then, as a second part, we’ll be looking at another scripting language called Groovy. The challenge is basically from the annual shoot-out from the Council for Near- Infrared Spectroscopy, and the shoot-out process basically works as follows. You build your data on the training data, which is called “calibration” in infrared spectroscopy terms. You evaluate your model on the separate dataset, which is the test dataset, and then you generate and submit predictions in the shoot-out process. However, we don’t do the last step of submitting our predictions, because that particular challenge has already finished.

Skip to 0 minutes and 51 secondsBut we’re still going to use the data that’s publicly available at the link below that you can download and then run. What are you going to do? Well, first, you’re going to download the CSV files for Dataset 1 and 2. I’m just going to go on their website here. Here’s Dataset 1 and 2. For each one of them, you download the CSV files, and you only need the calibration and the test set, we don’t need the validation one.

Skip to 1 minute and 24 secondsThen, you generate basically data for Weka in ARFF format: for building the model, calibration; and then for evaluating the model, the test set. The class attribute in the calibration dataset is called “reference value”, and you shouldn’t include the “sample #” in your model. This is now up to you to come up not only with the proper dataset and the compatible training and test set, but you should also then try and come up with a good regression scheme for predicting the reference value. But what do you have to beat? Well, in our case, you have to beat, on Dataset 1, a correlation coefficient of 0.8644 and a root mean squared error of 0.384.

Skip to 2 minutes and 9 secondsAnd on Dataset 2 you have to beat a correlation coefficient of 0.9986 and a root mean squared error of 0.0026. Up for the challenge? Good luck!

Skip to 2 minutes and 25 secondsNow to the second part: using Groovy, another scripting language. As I already mentioned in the introduction, Groovy also runs in the JVM and can be installed through the Package Manager, as well. If you haven’t done so already, please open up the Package Manager and install the kfGroovy package (it doesn’t matter what

Skip to 2 minutes and 48 secondsversion): “kf” for KnowledgeFlow Groovy. I’ve already done that, and I’m going to show you what the interface looks like. Once again, just like with the Jython console, you’ll find a Groovy console menu item under the Tools menu in the GUI Chooser. Once you’ve opened up that, you’ll find the appearance of the Groovy console very similar to that of the Jython console. On the top you write your script, and at the bottom you’ll see the output. However, in the Groovy console you cannot use multiple tabs, you have to open multiple instances; but for our purpose that is sufficient. Now before we start, just a few minor Groovy basics.

Skip to 3 minutes and 30 secondsThe grammar of Groovy is derived from Java, but with the exception that you don’t have to write any semicolons to finish a line, which makes it much nicer. Def, for definition, defines a variable; you don’t have to require any types or anything. Lists are very similar to the Python ones, square brackets and just comma- separated. There can be mixed types. Maps are also very similar to the Python ones, but they’re called “dictionaries” in Python. However, you don’t use curly brackets, you still use square brackets. Groovy also enhances the Java syntax. For example, you have multi-line strings by using triple single quotes. You can use string interpolation.

Skip to 4 minutes and 18 secondsYou can also have default import of commonly used packages, like java.lang, java.io, java.net, and so on. And, last but not least, closures. They are not quite the same as Java 8 lambdas, but they’re a very powerful tool. They’re basically anonymous code blocks, which can take parameters and return values as well, and can be assigned to variables. If you want to look up some differences between Java and Groovy, then follow the link. One really funky thing about Groovy that I very much like is looping.

Skip to 4 minutes and 48 secondsOf course, you have the standard Java for-loop and while-loop, but you can also use – since everything is an object in Groovy – you can also use some additional methods called upto, times, and step, as long as you have number objects, like integers and so on. So if you look at upto, if you have 0.upto(10), that basically outputs all the numbers from 0 to 10, both included. If you do times, for example 5.times, that outputs the numbers from 0 to 5 with 5 excluded, so it outputs the numbers 0, 1, 2,3, 4. Last

Skip to 5 minutes and 34 secondsbut not least, you can also “step” through: if you have 0.step(10, 2), that means you’re going from 0 to 10 at step 2, so it outputs the numbers 0, 2, 4, 6, and 8.OK. So with the basics out of the way, we’re going to dive into writing one of the scripts we’ve already seen previously in Jython and python-weka-wrapper, and we’re going to make some predictions with a built classifier. Once again, as always, we’re going to have some imports, and, similar to Jython, we’re just going to import the whole classes. We once again do the trick with our environment variable; however, here we use System.getenv().

Skip to 6 minutes and 18 secondsThen we load our training data, once again using the MOOC_DATA environment variable, using our shortcut variable, and loading the anneal_train dataset. Setting the class attribute once again as the last one, and we also load in the unlabeled data and set the class. Now we’re going to instantiate J48. We’re going to set some options. There’s a minor difference here to Jython. You actually have to specify that this is a string array. So even though you have a list of strings, you just have to say what you want to cast it to. And once again, build our classifier on the training data and output the built model, just for the fun of it.

Skip to 7 minutes and 21 secondsNow we want to once again look at making predictions. First of all, we’re going to look at what labels we have. In this case, we’re going to use this previously mentioned .times that allows us for the number of values that the class attribute has, from 0 to times–1, and we’re basically adding the string label to it, which we can then also output with a simple println statement. We’re using the list’s join method, and we’re joining all those elements in the list with a comma, generating a comma-separated string. Once again, use our times method, but this time we’ll use

Skip to 8 minutes and 18 secondsthe number of instances in the dataset: for all the rows in the data we’ll call the classifier’s distributionForInstance in order to retrieve the class distribution. And then simply output what the class distribution is. And when we’re running this thing, you will first see that it actually loads the whole thing into JVM here on top. It just outputs; this is what we’re loading. Then after that you can see our J48 tree that we built on the training data. Then we have here our class labels, and finally the class distributions for all the rows in the data. Slightly different to Jython and python-weka-wrapper, but not too different.

Skip to 9 minutes and 5 secondsAs the second script, we’ll be looking at outputting multiple ROC curves on the balance-scale data. We’ll start from scratch. Once again, we have a bunch ofimports that we need. In this case, the Evaluation class again; we’re going to use NaiveBayes as the classifier again; ThresholdCurve, which allows us to compute the ROC curves and so on; DataSource for loading the data; and, once again we’re going to use JFreeChart for the plotting. First thing, we’re going to load the data in again using our environment variable, setting the class attribute as the last one. We’re going to instantiate NaiveBayes (there are no options necessary). Then we’re going to cross-validate it, after initializing an Evaluation object on the training data.

Skip to 10 minutes and 0 secondsWe’re going to do 10-fold cross-validation, and once again use a seed value of 1 for the random number generator. Having that done, we can now create our plot dataset once again. It’s just a simple XY dataset again, and, as you can see we’re going to use our .times again. So basically for all – since we want to do multiple – all the labels in the class. We can once again use the number of the labels that we have in the class attribute .times, and use the iterator once again to retrieve the curve data, then retrieve the data from the False Positive Rate column and the True Positive Rate column.

Skip to 10 minutes and 46 secondsTurn it into lists, and we’re adding that as a data series to our plot dataset, including the AUC area. Having done that we can then create the plot, which is just an XY line chart, a simple one, with axes of False Positive Rate and True Positive Rate. As the last step, as usual, we’re going to create a frame, embed a chart panel with the plot and make the whole thing visible. And we run that. It takes a little while, and then we basically have our plot that we’ve already seen before. You’ve now seen quite a range of scripting languages that you can use on the Weka API, whether within the JVM or outside using Python itself.

Skip to 11 minutes and 44 secondsAnd, last but not least, you also had some fun with a real world data challenge, and I hope you were much, much better than I was.

A challenge, and some Groovy

We begin by looking at a real-world challenge: the IDRC (International Diffuse Reflectance Conference) Shootout challenge. The training data – called “calibration data” – and test data is linked to below. The challenge is tough: in the step that follows the upcoming quiz Peter talks about his solution, and how he arrived at it. Then, as a second part, we look at another scripting language called “Groovy” and use it to replicate two of the earlier Python scripts.

IDRC Shootout challenge:

Datasets:

Build a classifier and make predictions on an unlabeled dataset:

Display multiple ROC curves, one for each class:

Share this video:

This video is from the free online course:

Advanced Data Mining with Weka

The University of Waikato

Course highlights Get a taste of this course before you join: