Want to keep learning?

This content is taken from the The University of Waikato's online course, Advanced Data Mining with Weka. Join the course to learn more.

Skip to 0 minutes and 11 seconds Hi! In the last video we saw how we can use the MLR classifier in Weka to run learning algorithms implemented in R from within Weka. Before that, we saw how we can use the R console in Weka to run R commands, for example, to plot data. In today’s lesson, we’ll see how we can use the preprocessing tools implemented in R to preprocess data before we pass that data on to Weka learning algorithms. OK, let’s get started. We will use the Knowledge Flow environment in Weka to process data in R and then pass it on to a Weka learning algorithm. The Knowledge Flow environment, once the RPlugin has been installed, provides an RScriptExecutor component that executes a user-supplied R script.

Skip to 0 minutes and 51 seconds So we can put this on the canvas, and then right-click on it or double-click on it to configure it. What we can see here is that we can enter an R script, or we can load a script from a file. Before we start with our script we should load some data. So we need a data source. Let’s use an ArffLoader data source. OK, maybe we just use the iris data to start off with, and then we make a dataset connection to our RScriptExecutor.

Skip to 1 minute and 34 seconds Now, the data will be loaded as an ARFF file in Java, then it will be passed to the R environment, and the data that is processed by this R script will be passed back into Weka so that, for example, we can visualize it. We can put a scatterplot matrix here, and then, once we’ve configured our R script, we can connect it up to the RScriptExecutor. Something very simple that we can do in our R script is to delete one of the attributes from the incoming data. The incoming data is referred to by the name rdata, just as is done in the R console in the Explorer, as well.

Skip to 2 minutes and 9 seconds Then in brackets, we can specify which columns of this data we want to keep. Let’s say we just want to keep the first 4 attributes of the data and discard the class attribute. Now, this command will be executed and then the result will be passed into Weka as an ARFF file. We can now connect our RScriptExecutor component to the ScatterPlotMatrix component using a dataset connection. Right, let’s try this. So we run the flow, and now let’s check the plot by right-clicking on the component. As we can see, we’ve got a ScatterPlotMatrix here, which has 4 of the attributes but not the class attribute. So it all worked as intended. Let’s try something more sophisticated.

Skip to 3 minutes and 0 seconds We want to preprocess the data using one of R’s many preprocessing tools. More specifically, we want to use independent component analysis to decompose the input data into statistically independent components. We want to do that using the fastICA library in R.First, we need to install this library in R. We can do that from the Knowledge Flow if we enable the R perspective. Perspectives allow us to implement additional functionality in the Knowledge Flow. There’s an R console perspective here. Let’s tick this. Now you can see up here we have additional R console, which is just the same as the R console in the Explorer. To install the fastICA package, we can just go install.packages(“fastICA”).

Skip to 3 minutes and 49 seconds Now that this is installed, we can use the library in our R script. We go back to the Data mining processes perspective, and now we can change our script to make use of this fastICA library. First of all, we need to load the library into R, so we have library(fastICA) as the first statement in our script. Now, for convenience, let’s just define a variable that allows us to specify the number of components we want to extract from our data. Let’s say we want to extract as many components as there are predictor attributes in our data. So we say num = ncol, which is the function that gives us the number of columns in the data – ncol(rdata) – 1.

Skip to 4 minutes and 43 seconds So this will be 4 in the case of the iris data. Now, we can call the fastICA function.

Skip to 4 minutes and 49 seconds fastICA: we specify the data we want to use. We go from 1 to num – these are the columns that we want to perform our independent component analysis on. Then we say how many components we want to extract, also num. Right. Now, fastICA actually returns a list of results in R. If we check the fastICA documentation on the web, we see that there are actually several things that are returned by the fastICA function. Let’s search for “fastICA documentation R”. This page here, the rdocumentation.org site, looks helpful.

Skip to 5 minutes and 43 seconds Right. If you look at the documentation for the function, you can see that it returns several values. It returns the preprocess data matrix and then several other things. What we want here is the estimated source matrix. The source consists of the estimated components, the independent components. We want to use the S value from the result. We can get that value by adding “$S” at the end of the invocation of the function. This will extract the independent components from the result. Right. We are almost done now. This will actually return the independent components as a matrix; however, to be able to pass the data back into Weka, we need to make this matrix into a data frame.

Skip to 6 minutes and 30 seconds To make this into a data frame, we can just call the data frame function, data.frame, and we put the whole thing into brackets. Let’s try this out. Now we should see the independent components that have been extracted from the data. They are called X1 to X4. You can see here that the independent component analysis has produced the desired results. For example, if you look at the relationship between X1 and X4, those two components look pretty much statistically independent. We’ve run independent component analysis on the data and passed the data back into the Weka environment. Let’s run a learning algorithm on this data.

Skip to 7 minutes and 11 seconds As you know, the Naive Bayes learning algorithm assumes that the attributes are conditionally independent given the class attribute. It’s a plausible hypothesis to assume that data preprocessed using ICA is easier to classify using Naive Bayes. In order to run a supervised learning algorithm on the data, we need to attach the original class labels to the data again. We can do that quite easily using the cbind function, the column bind function in R. We go “cbind” and then the two sets of columns that we want to bind together. Here we need to assign this data to a variable. Let’s call this variable d, so that we can refer to it in the cbind function.

Skip to 7 minutes and 54 seconds We want to bind the columns in d and the “class” column in the rdata data frame. The address of the column is given using square brackets again, and the index of this column is num + 1. That’s the last column in our rdata, so that’s the class column in the iris data. Now we will have labeled data. Let’s try this. Now we can see that the class attribute has been added. We have the data decomposed into independent components using ICA and then labeled again with the original class labels. Now we can run a learning algorithm on this data, for example Naive Bayes.

Skip to 8 minutes and 38 seconds What we need to do is go into the evaluation package and choose the class assigner to assign the class attribute to our data and make a dataset connection to this ClassAssigner. By default, it just uses the last attribute, so this is fine. We pass the data to the CrossValidationFoldMaker, and from there we pass the data to Naive Bayes – both the training set and the test set. After we have added the classifier, we need to evaluate the classifier. So we take a ClassifierPerformanceEvaluator, and we use a batch classifier connection to connect Naive Bayes to it.Finally, we want to see the output of the evaluation process so we use a TextViewer, and we use a text connection from the ClassifierPerformanceEvaluator.

Skip to 9 minutes and 37 seconds So let’s run this flow. OK, it’s finished, and we can get the cross-validated accuracy

Skip to 9 minutes and 45 seconds now in this text viewer: 98% accuracy. This is quite a high accuracy on the iris data, so it looks like independent component analysis has helped to improve performance. Note that, strictly speaking, we have performed semi-supervised learning in this experiment, because we used an unsupervised feature extraction method on the whole dataset before we applied cross-validation on the dataset. Right. This was to show you a bit of how to use R from the Knowledge Flow. We’ve covered all the aspects of how to use R in Weka now, so that’s it for this topic. See you later!

Using R to preprocess data

Tools implemented in R can preprocess data before passing it on to Weka learning algorithms. The Knowledge Flow’s RScriptExecutor component executes a user-supplied R script. Data can be loaded using an ArffLoader and passed to the RScriptExecutor, which is supplied with a script. Eibe demonstrates scripts that delete an attribute, produce a scatter plot matrix, and decompose the input into statistically independent components – after which the Naive Bayes classifier is run, and evaluated using cross-validation. R includes many other useful transformation methods. Detailed instructions are given in the accompanying download (these slides do not appear in the video itself).

Share this video:

This video is from the free online course:

Advanced Data Mining with Weka

The University of Waikato

Get a taste of this course

Find out what this course is like by previewing some of the course steps before you join: