Skip to 0 minutes and 11 seconds Hello! In the last lesson, we installed Distributed Weka and ran our first Distributed Weka for Spark job that analyzed and computed a header for the hypothyroid dataset. In this lesson, we’ll take a closer look at how these jobs are configured, and we’ll run a few more jobs that use Weka classification algorithms, and learn classification models on the hypothyroid data. OK, here is the job that we ran last time. I’ve loaded it back up into the Knowledge Flow here. Let’s take a look at how it’s configured. So if I double-click on the ArffHeaderSparkJob component here on the canvas, it will bring up the configuration dialog, which is made up of two tabs here.
Skip to 0 minutes and 58 seconds The first tab is entitled Spark configuration, and, as the name suggests, there are a bunch of options here related to how the cluster is configured. So up at the top we have a couple of options that are related to how Spark handles or manages memory out on the cluster. We won’t go into detail about exactly how those work, but suffice it to say that the defaults that are set here work reasonably well for most situations. Under that, there is something called the InputFile parameter, and that’s most important, because it’s the dataset that we’re operating on.
Skip to 1 minute and 36 seconds You can see here that it’s preconfigured to point to the hypothyroid data, which is in this sample data directory, which is, in turn, in the package installation directory for Distributed Weka for Spark. Then we have the masterHost parameter. This is where you can specify the address of the machine that the master Spark process is running on. In our case, we don’t have a Spark cluster, we’re running locally on our desktop machine, and Spark is treating each of the processing cores in our CPU as a processing node. So that’s why we have the local host specified here, and in parentheses we have an asterisk, which tells Spark that we want to make use of all of our available processing cores.
Skip to 2 minutes and 24 seconds If we wanted to limit the number of cores that Spark uses on our desktop machine, then we could place a number inside those parentheses there to limit that. Similarly, the masterPort would be used if we were running against a cluster and we needed to provide the port that the Spark master process is listening on. Further down in the list here we can see something called the output directory. This is where Weka will be saving any results generated by the job. OK. The last parameter we’ll take a look at here is called minInputSlices. With this parameter, we’re telling Spark how many logical chunks to split the dataset up into.
Skip to 3 minutes and 10 seconds So Spark will create partitions or slices of the dataset and process those, and it uses one worker task running on a core of the CPU of a processing node in order to process a given partition. Here we can really have some control over the level of parallelism applied to our dataset. If we had a processing cluster of 25 machines where each machine had a CPU with four cores, then Spark would be working at maximum efficiency if we chose 100 input slices or fewer for our dataset. That way, we would have the entire dataset processed in one wave of tasks. Let’s take a quick look at the second configuration panel in the dialog here entitled ArffHeaderSparkJob. So we click on that.
Skip to 4 minutes and 11 seconds This relates to how Weka will parse the CSV file of hypothyroid data. There are a lot of options here related to CSV parsing, so what the field separator is, what the date format might be if there are date attributes, and so forth. We can also tell Weka, since this is a headerless CSV file, what the names of the attributes
Skip to 4 minutes and 35 seconds are in the data, and we can do that in one of two ways: either by typing a comma-separated list of attribute names in this first text box at the top here; or we can point Weka to a file on the file system that contains the names of the attributes. We’re using that option in this case by saying there is a file called hypothyroid.names on the file system. The format of that file is a simple one. It just contains one attribute name per line in the file. We also have this option called pathToExistingHeader here.
Skip to 5 minutes and 7 seconds If we have already run this job and created an ARFF header file and computed all the summary statistics, then there is no need for the job to run again, but we may have it as a component in an overall larger job. In that case, we can provide the path to the header file that was created in a previous execution, and Weka will then realize that it does not need to regenerate that file. And the last dialog box here is one where we tell it what we want to call that ARFF header file when it gets created. In this case, we’re calling it hypo.arff. All right.
Skip to 5 minutes and 48 seconds Now let’s try running another one of the example flows that are included with the Distributed Weka for Spark package. Up here in the templates menu, let’s choose the “Spark train and save two classifiers” flow. Let’s load that one in. All right, here it is. OK, so what do we have in this flow? Well, we have – as we can see on the left-hand side here – the ArffHeaderSparkJob again. This time, however, it is configured to make use of an existing header file, if we happen to have already run this particular job on the hypothyroid dataset.
Skip to 6 minutes and 34 seconds As we can see here, this path is now filled in, so we can take advantage of that existing header file that we may have already generated. If that is the case that this exists on disk, then it will load and use this header file, and then the only point of this job entry in this particular example is to load the CSV data and parse it into an internal Spark format that can then be used in the downstream job entries in the rest of this flow. So that’s ready to go pretty much. What we have next in the flow is something called the WekaClassifierSparkJob, and in fact we have two entries in this flow. These components are executed in sequence.
Skip to 7 minutes and 22 seconds The ArffHeaderSparkJob will run first. When it succeeds, it triggers execution of the next component downstream in the flow, so the WekaClassifierSparkJob will then execute. We can see from this Spark job that there is a “text” connection, so it will produce a textual description of the classifier that it learns, which will then be displayed in the TextViewer here. That also gets saved out, along with the model itself, the actual Weka model, to the file system in our output directory, and we’ll take a look in there once we’ve finished looking at this flow and executing it. There’s also a second Weka Spark job here on the right-hand side, and this learns a different classifier.
Skip to 8 minutes and 9 seconds The first one learns a Naive Bayes classifier, and the second one learns a JRip rule set. In between the two, we have another job that gets executed. This is called the RandomlyShuffleDataSparkJob. We’ll discuss exactly what that’s doing a little later on.Let’s take a quick look at the configuration dialog for the first of these WekaClassifierSparkJobs, the one that trains Naive Bayes. So we open that up – make it a little bit larger here. What we can see is a whole bunch of settings we can change.
Skip to 8 minutes and 44 seconds We have some stuff at the top here related to telling the system what the class attribute is and what we want to call the serialized model file that gets written out as the output of this job to our output directory. Down at the bottom here, we can see that we have an option that allows us to choose a classifier we want to run and also configure its options, just like in regular Weka. In this case, we’ve chosen the standard Naive Bayes algorithm. You can see that we also have an option here that allows us to combine some filtering with this classifier, as well.
Skip to 9 minutes and 23 seconds So we can opt to specify one of Weka preprocessing filters here and have that applied to the data before it reaches the classifier. We can combine multiple filters if we so desire by using Weka’s multifilter filter, which allows us to specify multiple filters to apply. All right, there are a number of other options here. I won’t describe them at this point in time. Let’s go ahead and run this flow now. We have two classifiers that are going to be trained, Naive Bayes and the JRip rule set. We can start this running. Before I do so, I’ll get the log opened so we can see the activity in the log, and we launch it now. All right, it’s processing away.
Skip to 10 minutes and 12 seconds A lot of output in the log, and now it’s completed. Let’s take a look in the TextViewer, which has picked up the textual output of these two steps, and see what we have. Let’s look at the results. I’ll make that a little bit larger. The first entry here is, as expected, a Naive Bayes classifier model learned on our hypothyroid data. This is very similar to what you would see if you just ran standard desktop Weka. The second entry in our result list here is for the JRip classifier. It doesn’t actually say “JRip” here; instead it has something called weka.classifiers.meta.BatchPredictorVote. That’s a little bit interesting. We’ll take a look at this output.
Skip to 11 minutes and 8 seconds And what we have is, instead of a single rule set, we have 4 separate rule sets, and they’ve been combined into an ensemble learner, a “vote” ensemble learner, which has 4 separate sets of JRip rules as its base learners. We can see the individual rules here in this list. So what’s happened in this case? Why has it done this? Why has it learned one Naive Bayes model when we apply Naive Bayes, but it’s learned 4 sets of JRip rules when we applied the JRip classifier? To answer that question, we’ll have to learn a little bit about how Distributed Weka learns a classifier when it runs on Spark, but we’ll leave that to the next lesson.
Using Naive Bayes and JRip
There are many options when configuring a Distributed Weka job. The ArffHeaderSparkJob’s configuration panel has two tabs, Spark configuration, whose options relate to how the cluster is configured, including how many partitions to make from the data and the desired level of parallelism; and ArffHeaderSparkJob, which determines how Weka parses the CSV file containing the input data, including the names of attributes and the name of the header file that is created. Another Distributed Weka template is “Spark: train and save two classifiers”, which trains Naive Bayes and JRip classifiers from the same dataset.
© University of Waikato, New Zealand. CC Creative Commons Attribution 4.0 International License.