## Want to keep learning?

This content is taken from the The University of Waikato's online course, Advanced Data Mining with Weka. Join the course to learn more.
4.12

## The University of Waikato

Skip to 0 minutes and 11 seconds In this, the final lesson of this class, we’ll touch briefly on a couple of Knowledge Flow templates that we haven’t had time to look at so far, and we’ll leave you with some things to look at for Distributed Weka, if you wish to take it further. Here we are back in the Knowledge Flow. If we open up the Templates menu again and scroll down a little bit here, we can see a tem plate called “Compute a correlation matrix and run PCA”, where PCA stands for Principal Components Analysis. Let’s open this one. What we have is our trusty ArffHeaderSparkJob, which loads our hypothyroid data again, and we have a little step here called the CorrelationMatrixSparkJob.

Skip to 0 minutes and 58 seconds And we have an ImageViewer and a TextViewer attached to that. This suggests that this job will produce some kind of an image that we can take a look at, and also some textual results. In the dialog for the CorrelationMatrixSparkJob here we have a few options, mainly related to exactly what sort of matrix is going to be computed, so we can compute either a correlation matrix or a covariance matrix. We have an option to run Principal Components Analysis. The algorithm for Principal Components Analysis can compute the principal components using either a correlation matrix as input or a covariance matrix. All right. Let’s run this now and see what it produces. It just takes a few seconds to run. And it’s finished.

Skip to 1 minute and 59 seconds OK, let’s open up the TextViewer. In the TextViewer, we have the result of the Principal Components Analysis and the correlation matrix that was computed. We can see that the correlation matrix and the Principal Components Analysis only involve the numeric attributes that are present in the hypothyroid data. Let’s take a quick look in the ImageViewer now. If we open up the ImageViewer, we can see that we have a graphical heat map representation of our correlation matrix, where the colors indicate the magnitude of the correlations between the attributes – the numeric attributes – in the hypothyroid data. Right, let’s take a look at one more example before we finish with Distributed Weka.

Skip to 2 minutes and 48 seconds In the Templates menu here, we have a job called run K-means   . K-means parallel is, as the name suggests, a parallel version of the k-means algorithm. For clustering in Distributed Weka, unfortunately we can’t use the trick of creating a voted ensemble like we did in the classification case. It’s not possible to make a voted ensemble out of separate clustering models. This is why there is only k-means available in Distributed Weka so far, as it’s the only clustering algorithm that has been implemented in a distributed fashion, specifically for Distributed Weka. This job takes a little while to run, so through the magic of video editing, I’ve executed it in between cuts to save a little bit of time.

Skip to 3 minutes and 47 seconds It actually takes longer to run than sequential Weka does if you were to run k-means in the Explorer on the hypothyroid dataset. This is simply due to the fact that there is a certain amount of overhead involved in Spark’s communication, the creation of its RDD data structures, and so forth; and that overhead actually outweighs the speed gained through parallel processing in this local case when we’re just using the cores that are available on our CPU. If our dataset was much larger and we were running on a real cluster, then we would have a true benefit from using a distributed approach.

Skip to 4 minutes and 32 seconds In the TextViewer, we can see the clustering results for k-means, which look exactly the same – or are in the same format – as if you were to run k-means in standalone Weka on your desktop. So where to from here? Experimenting with Distributed Weka in local mode using small datasets is the best way to get familiar with the capabilities of it, and explore what it has to offer. However, if you want to process larger datasets, then you’ll need to run on a cluster. We’ll take a little look at what’s available on the web to help you get started in that area.

Skip to 5 minutes and 14 seconds The first place to go for information is the main Apache Spark website, so let’s take a look at that. OK, under the documentation section here, we can find the documentation for the latest release of Spark. We go to that page, and there’s information on downloading, running some examples, and then down here a little ways we have information on launching on a cluster. The first thing to look at is the overview of cluster mode. This will describe exactly how a cluster is configured and set up to run. Then there are various different types of clusters that you can run Spark on. The simplest is called a stand-alone mode, and there is a documentation section here on that mode.

Skip to 6 minutes and 14 seconds That would be the one to start with first. There are several other modes of clustered running for Spark, including something called Mesos and YARN. These are different ways of managing the machines in a cluster. The stand-alone mode is the simplest. There are a number of blogs on the web that step you through the process of setting up a stand-alone cluster on a single machine. So if we search for “Apache Spark standalone cluster install”, there are a number if hits in Google for information on setting up a cluster. One that’s particularly concise, or, at least, I thought it was concise and could be a good place to start, is this one here.

Skip to 7 minutes and 13 seconds If we take a look at that, we can see a very short introduction to getting started with a Spark cluster running on a single machine. This is different from what we’ve been looking at so far, where we’ve been running in local mode. That’s where the entirety of Spark runs in a single JVM process. The stand-alone cluster running on a single machine involves multiple separate Java processes, and they communicate as if they were running on different machines. This tutorial is a reasonably short introduction to getting started with that. That’s it for this lesson.

Skip to 7 minutes and 56 seconds Today, we took a look at how you can use Distributed Weka to compute a correlation matrix in Spark and then use that correlation matrix as input to a Principal Components Analysis. We also took a look at the k-means algorithm running in a distributed fashion inside of Spark, and we took a little look at information on setting up Spark clusters. Well, I hope you’ve enjoyed learning about how to use Weka in a distributed processing environment, and now I’ll leave you with some links to further information on Distributed Weka and on Apache Spark.

# Miscellaneous Distributed Weka capabilities

There are other useful KnowledgeFlow templates for Distributed Weka. One computes a correlation matrix for input to Principal Component Analysis; another runs a parallel version of the k-means clustering algorithm. To process large datasets you need to run Distributed Weka on a cluster. The Apache Spark website contains information on how to set up a cluster; this blog post explains how to run a Spark cluster on a single machine using separate Java processes that communicate as though they were running on different machines – which is different from the “local mode” we’ve been using, where the entirety of Spark runs in a single Java process.