Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only T&Cs apply

Find out more

Using R to preprocess data

Eibe Frank shows how the preprocessing tools implemented in R can be used to preprocess data before passing it on to Weka learning algorithms.
Hi! In the last video we saw how we can use the MLR classifier in Weka to run learning algorithms implemented in R from within Weka. Before that, we saw how we can use the R console in Weka to run R commands, for example, to plot data. In today’s lesson, we’ll see how we can use the preprocessing tools implemented in R to preprocess data before we pass that data on to Weka learning algorithms. OK, let’s get started. We will use the Knowledge Flow environment in Weka to process data in R and then pass it on to a Weka learning algorithm. The Knowledge Flow environment, once the RPlugin has been installed, provides an RScriptExecutor component that executes a user-supplied R script.
So we can put this on the canvas, and then right-click on it or double-click on it to configure it. What we can see here is that we can enter an R script, or we can load a script from a file. Before we start with our script we should load some data. So we need a data source. Let’s use an ArffLoader data source. OK, maybe we just use the iris data to start off with, and then we make a dataset connection to our RScriptExecutor.
Now, the data will be loaded as an ARFF file in Java, then it will be passed to the R environment, and the data that is processed by this R script will be passed back into Weka so that, for example, we can visualize it. We can put a scatterplot matrix here, and then, once we’ve configured our R script, we can connect it up to the RScriptExecutor. Something very simple that we can do in our R script is to delete one of the attributes from the incoming data. The incoming data is referred to by the name rdata, just as is done in the R console in the Explorer, as well.
Then in brackets, we can specify which columns of this data we want to keep. Let’s say we just want to keep the first 4 attributes of the data and discard the class attribute. Now, this command will be executed and then the result will be passed into Weka as an ARFF file. We can now connect our RScriptExecutor component to the ScatterPlotMatrix component using a dataset connection. Right, let’s try this. So we run the flow, and now let’s check the plot by right-clicking on the component. As we can see, we’ve got a ScatterPlotMatrix here, which has 4 of the attributes but not the class attribute. So it all worked as intended. Let’s try something more sophisticated.
We want to preprocess the data using one of R’s many preprocessing tools. More specifically, we want to use independent component analysis to decompose the input data into statistically independent components. We want to do that using the fastICA library in R.First, we need to install this library in R. We can do that from the Knowledge Flow if we enable the R perspective. Perspectives allow us to implement additional functionality in the Knowledge Flow. There’s an R console perspective here. Let’s tick this. Now you can see up here we have additional R console, which is just the same as the R console in the Explorer. To install the fastICA package, we can just go install.packages(“fastICA”).
Now that this is installed, we can use the library in our R script. We go back to the Data mining processes perspective, and now we can change our script to make use of this fastICA library. First of all, we need to load the library into R, so we have library(fastICA) as the first statement in our script. Now, for convenience, let’s just define a variable that allows us to specify the number of components we want to extract from our data. Let’s say we want to extract as many components as there are predictor attributes in our data. So we say num = ncol, which is the function that gives us the number of columns in the data – ncol(rdata) – 1.
So this will be 4 in the case of the iris data. Now, we can call the fastICA function.
fastICA: we specify the data we want to use. We go from 1 to num – these are the columns that we want to perform our independent component analysis on. Then we say how many components we want to extract, also num. Right. Now, fastICA actually returns a list of results in R. If we check the fastICA documentation on the web, we see that there are actually several things that are returned by the fastICA function. Let’s search for “fastICA documentation R”. This page here, the site, looks helpful.
Right. If you look at the documentation for the function, you can see that it returns several values. It returns the preprocess data matrix and then several other things. What we want here is the estimated source matrix. The source consists of the estimated components, the independent components. We want to use the S value from the result. We can get that value by adding “$S” at the end of the invocation of the function. This will extract the independent components from the result. Right. We are almost done now. This will actually return the independent components as a matrix; however, to be able to pass the data back into Weka, we need to make this matrix into a data frame.
To make this into a data frame, we can just call the data frame function, data.frame, and we put the whole thing into brackets. Let’s try this out. Now we should see the independent components that have been extracted from the data. They are called X1 to X4. You can see here that the independent component analysis has produced the desired results. For example, if you look at the relationship between X1 and X4, those two components look pretty much statistically independent. We’ve run independent component analysis on the data and passed the data back into the Weka environment. Let’s run a learning algorithm on this data.
As you know, the Naive Bayes learning algorithm assumes that the attributes are conditionally independent given the class attribute. It’s a plausible hypothesis to assume that data preprocessed using ICA is easier to classify using Naive Bayes. In order to run a supervised learning algorithm on the data, we need to attach the original class labels to the data again. We can do that quite easily using the cbind function, the column bind function in R. We go “cbind” and then the two sets of columns that we want to bind together. Here we need to assign this data to a variable. Let’s call this variable d, so that we can refer to it in the cbind function.
We want to bind the columns in d and the “class” column in the rdata data frame. The address of the column is given using square brackets again, and the index of this column is num + 1. That’s the last column in our rdata, so that’s the class column in the iris data. Now we will have labeled data. Let’s try this. Now we can see that the class attribute has been added. We have the data decomposed into independent components using ICA and then labeled again with the original class labels. Now we can run a learning algorithm on this data, for example Naive Bayes.
What we need to do is go into the evaluation package and choose the class assigner to assign the class attribute to our data and make a dataset connection to this ClassAssigner. By default, it just uses the last attribute, so this is fine. We pass the data to the CrossValidationFoldMaker, and from there we pass the data to Naive Bayes – both the training set and the test set. After we have added the classifier, we need to evaluate the classifier. So we take a ClassifierPerformanceEvaluator, and we use a batch classifier connection to connect Naive Bayes to it.Finally, we want to see the output of the evaluation process so we use a TextViewer, and we use a text connection from the ClassifierPerformanceEvaluator.
So let’s run this flow. OK, it’s finished, and we can get the cross-validated accuracy
now in this text viewer: 98% accuracy. This is quite a high accuracy on the iris data, so it looks like independent component analysis has helped to improve performance. Note that, strictly speaking, we have performed semi-supervised learning in this experiment, because we used an unsupervised feature extraction method on the whole dataset before we applied cross-validation on the dataset. Right. This was to show you a bit of how to use R from the Knowledge Flow. We’ve covered all the aspects of how to use R in Weka now, so that’s it for this topic. See you later!

Tools implemented in R can preprocess data before passing it on to Weka learning algorithms. The Knowledge Flow’s RScriptExecutor component executes a user-supplied R script. Data can be loaded using an ArffLoader and passed to the RScriptExecutor, which is supplied with a script. Eibe demonstrates scripts that delete an attribute, produce a scatter plot matrix, and decompose the input into statistically independent components – after which the Naive Bayes classifier is run, and evaluated using cross-validation. R includes many other useful transformation methods. Detailed instructions are given in the accompanying download (these slides do not appear in the video itself).

This article is from the free online

Advanced Data Mining with Weka

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now