Skip main navigation

£199.99 £139.99 for one year of Unlimited learning. Offer ends on 28 February 2023 at 23:59 (UTC). T&Cs apply

Find out more

Installing with Apache Spark

Mark Hall explains how to interact with Distributed Weka through the KnowledgeFlow environment.
In this lesson we’re going to install Distributed Weka and start to use some of the components that come with it. OK, here we are in Weka’s Package. What we’re going to do here is scroll down a little bit in the package list, and we’re going to install Distributed Weka for Spark. And here it is, just down here. OK. So if I click install with this one selected, it tells me that I’m going to install the
following package: distributedWekaSpark version 1.0.2. We click “Yes”. We click “OK”. And then it tells me that, in order to install this, we need also to install distributedWekaBase 1.0.12. At this stage, I’ll click “No”, because I already have this installed, and we won’t show it installing at the moment, because the download is fairly large for distributeWekaSpark, and it’ll take a little while. And I already have it installed. OK. Once you’ve installed Distributed Weka, you need to make sure that you restart Weka, so that the packages – or the newly installed packages – are loaded correctly. The main way to interact with Distributed Weka is through the Knowledge Flow environment.
This allows us to chain together processing components in such a fashion that a given component will not execute until the previous one has completed executing. It’s also possible to use Distributed Weka from the command line, but the graphical user interface provided by the Knowledge Flow is a really convenient and easy way to edit the sometimes many parameters that are involved in setting up a Distributed Weka job. Let’s verify that our installation of Distributed Weka has proceeded correctly. All right, so in the Weka Knowledge Flow environment, you can see that there is on the left-hand side in the Design palette, a new folder called “Spark”.
If we open this up, we should find that there are a bunch of new components available to us. In particular, we have something called an ArffHeaderSparkJob, we have a WekaClassifierSparkJob, a WekaClassifierEvaluationSparkJob, and several others as well, that we’ll discuss shortly.The distributedWekaSpark package also comes with a bunch of example template flows. If we look in the templates folder, which is accessible from the templates button up here in the tool bar, we can see a bunch of entries that are prefixed with the word “Spark”. These are all example flows that we can execute right out of the box. They don’t require a Spark cluster to be installed and configured.
Spark has a very convenient local mode of operation, which allows you to use all of the cores in your CPU as processing nodes, if you like. So we can execute these particular example flows straight away without any further configuration. They are ready to go. Before we start running Distributed Weka examples, I need to introduce the dataset that we’re going to be looking at. We’re going to take a look at the hypothyroid data. This is a benchmark dataset from the UCI Machine Learning Repository. The goal in this data is to predict the type of thyroid disease a patient has using input variables such as demographic information about the patient, and various medical information as well.
In this dataset, there are 3,772 instances described by 30 attributes. A version of this data, in CSV format without a header row, can be found in the distributedWekaSpark package that you installed just before. If you browse to your home directory and look in wekafiles/packages/distributedWekaSpark/sample_data directory, you’ll find it there. The data in ARFF format is also included with the Weka 3.7.13 distribution in the data folder. So you can also load it up into the Explorer and take a look in there. Why don’t we do that now? Here we are in the Weka Explorer. Let’s open the hypothyroid data.
If you browse to the Weka installation directory, Program Files here, Weka 3.7, and in the data directory, we can see the hypothyroid data. Let’s open that up. As I mentioned before, there are 3,772 instances in this dataset, and we can see the attributes here. We have the age and sex of the patient, and we have a bunch of attributes related to various medical information. Down at the bottom is the class attribute. You can see there are four different class values here. By far the largest class in the data is that of negatives. So these are patients that don’t have hypothyroid disease. Then we have 194 cases of compensated_hypothyroid, 95 cases of primary_hypothyroid, and only 2 cases of secondary_hypothyroid.All right.
That’s the characteristics of the data. We can now return to the Knowledge Flow and start executing some Distributed Weka processes on this dataset. Before we do so, it’s worth spending a minute or two to explain why we’re going to be operating on comma-separated values, CSV files without a header, rather than ARFF. Systems like Hadoop and Spark split data files up into blocks. This is to facilitate distributed storage of large files out on the cluster and also to allow data-local processing. This is where the processing is taken to where the data resides.
So rather than move the data around, we take the processing to where the data is.Within such frameworks like Hadoop and Spark, there are “readers”, as they’re called, for various text files and for various structured binary files. These readers maintain the integrity of the individual records within the files. They know where the boundaries between records are, and they don’t ever split a record in half. If we were to use ARFF within such a framework, we would need to write a special reader, due to the fact that ARFF files, as you know, have header information that occurs at the start of the file.
That header information provides details on what attributes are in the data, their types, and legal values, and so forth. Now because the data file gets split up, only one of the blocks, or chunks, of data out on the cluster would have that header information. That is why we’d have to write a special reader to handle it. Distributed Weka for Spark, as it stands at the moment, operates just on CSV data, simply because there are readers already available within Spark and Hadoop for dealing with such data. All right. Here we are back in the Knowledge Flow environment. Let’s execute the first
Distributed Weka job in the list here: the “Create an ARFF header job”. Make it a little larger here. We’ll use this one to verify that everything is installed correctly and running properly. Now, the goal of this job on the hypothyroid data is to analyze that CSV file and produce some summary statistics, and do this in a distributed way. At the same time, it collects all the information that’s necessary to create an ARFF header, and it stores this. And then any future jobs that we run can make use of this ARFF header information straight away, and not be required to analyze the CSV data a second or third time before they can run.
What we can do is go ahead and execute this and see how it runs. First of all, make your log area – switch to the log from the status area down at the bottom here – and make it a little bit larger, so that we can see what’s happening in the log, because Spark generates a lot of log output; there is information about what it’s doing, and you’ll see any problems that occur in that log as well. We have just one job that’s going to be executed here – the job to create the ARFF header – and we’ll just run this right now and make sure everything is working correctly.
Later on, we’ll take a look at the parameters for the job, and I’ll explain a little bit about how it’s configured. Up here in the upper left-hand corner of the Knowledge Flow, we can press this Play button and start the flow running. As I said, we can see a lot of information being dumped into the log here. Most of this is coming from Spark. Our job has completed. You can see here it says “Successfully stopped” something called a “SparkContext”.All right, so what has this job produced? We can see here in the flow that we have a dataset connection coming out of the ArffHeaderSparkJob to a TextViewer.
So if we open up the TextViewer and show the results – I’m going to make this just a little bit larger here, so that it fills the screen – we can see that, as the name suggests, it has created an ARFF header for the hypothyroid data. In fact, it’s an ARFF header on steroids, because there is some extra information in here. What we can see at the top is standard ARFF header information. Here’s all our attributes – just like we saw in the Explorer before – all the way down to Class here, where, in this row here we can see all of the values of the class attribute listed.
Now, below this is a bunch of additional information that we’ve added into this header. The way that other jobs are programmed when they make use of this, is that they can either access this additional information or remove it and use a standard ARFF header. So what we have in this additional information is a bunch of summary statistics that have been computed on the hypothyroid data running in parallel in the Spark environment. You can see that for the “age” attribute here there is summary statistics that have been computed on that. We have a count, we have a sum, we have a sum of squares, we have minimum and maximum values, and we have a mean. This is a numeric attribute.
And a standard deviation, as well. And similar for other attributes. For nominal attributes, it computes a frequency distribution. So down here in the class, the summary attribute for the class, we can see the class label for each of the values of the class followed by an underscore and a number, and that number is the count for that particular class label. The ARFF header job has computed a header for us and a bunch of summary statistics. Next time, we’ll take a look at how that’s configured, and we’ll also look at running some other Distributed ARFF jobs, as well.
In this lesson, we’ve covered getting Distributed Weka installed; our test dataset, the hypothyroid data; the data format processed by Distributed Weka; and we’ve taken a look at a Distributed Weka job running on Spark to generate some summary statistics and an ARFF header for the hypothyroid data.

Having installed Distributed Weka, you can interact with it in the KnowledgeFlow environment. New components such as ArffHeaderSparkJob, WekaClassifierSparkJob, and WekaClassifierEvaluationSparkJob become available. In addition, example knowledge flows are provided as templates that operate “out of the box” using all the CPU’s cores as processing nodes – without having to install and configure a Spark cluster. Distributed Weka operates on header-less CSV files, because it splits data into blocks to enable distributed storage of large datasets and allow data-local processing, and it would be inconvenient to replicate the ARFF header in each block. Instead, the ArffHeaderSparkJob creates a separate header that contains a great deal of information that would otherwise have to be recomputed by each processing node.

This article is from the free online

Advanced Data Mining with Weka

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education