Skip main navigation
We use cookies to give you a better experience, if that’s ok you can close this message and carry on browsing. For more info read our cookies policy.
We use cookies to give you a better experience. Carry on browsing if you're happy with this, or read our cookies policy for more information.

Skip to 0 minutes and 11 secondsHello again! You’ll recall last time we ran the example Knowledge Flow template that built two different classifiers in the Spark environment, a Naive Bayes classifier and a JRip classifier. But we found that it did something different with the JRip classifier than it did with Naive Bayes. It actually ended up building 4 separate JRip classifiers and wrapping them up in a voted ensemble. In this lesson, we’ll take a closer look at exactly what happened there and the reasons for why there is a difference between the Naive Bayes example and the JRip example. Here is that Knowledge Flow template we ran last time, and, if we open up the TextViewer we can refresh our memories as to what the results looked like.

Skip to 0 minutes and 50 secondsAt the top we had the Naive Bayes classifier on our hypothyroid dataset, and we ended up with one model, as we expected. The second classifier that ran in this example was JRip, and as we can see from the results we ended up with 4 sets of JRip rules and these were combined in a voted meta classifier. So the question was why did this happen? Let’s take a closer look at it, and I’ll attempt to explain. OK, so here is a slide that attempts to describe how the processing occurs in Spark for our classifier job. On the left-hand side, we have the ArffHeaderSparkJob, which initially loads the data into main memory for us.

Skip to 1 minute and 39 secondsIt loads the CSV data and creates one of Spark’s resilient distributed datasets with a number of splits, or “partitions” as they’re called in Spark. Each partition is processed by a worker out on the cluster, or, in our case, by a CPU core on our desktop machine. The map tasks process these partitions and create models – or to be more precise, they create partial models. In the case of Naive Bayes, the algorithm is fairly simple, and the model is comprised of a number of probability estimators, all of which can be computed incrementally and additively.

Skip to 2 minutes and 24 secondsSo when it comes to combining these probability estimators, we can simply add together their statistics, and we end up with one final model, which is identical to what we would get if we ran Naive Bayes sequentially on the dataset on our desktop. In the case of other types of classifiers – tree learners and rule learners like JRip are an example – it’s somewhat more difficult to try and aggregate these partial models into one final model which would be the same as if you were to run sequentially.

Skip to 2 minutes and 56 secondsIn that case, Weka takes the easy route of taking the partial models or the smaller models, which are learned on the splits of the data, an d combining them by simply making a voted ensemble out of them. OK. We’re nearly finished with this example, but before we leave there are a couple more aspects to touch on. One is output. We’ve seen some output in the TextViewer here in the Knowledge Flow, but I mentioned earlier that the jobs in Distributed Weka also store output on the file system. This can be our local file system or, if we’re using Hadoop and Hadoop’s distributed file system, it could be stored in HDFS.

Skip to 3 minutes and 43 secondsAnyway, if we take a look in the source of the data here, our ArffHeaderSparkJob, recall that we saw we had some setup for our input files and also our output directory down here. This is where the jobs will store their output. Let’s take a look at that on the file system. If I find my home directory, I can see that directory that was specified in the configuration there, sparkOutput. If we go into that directory, we can see a couple of subfolders. One is where the ARFF header was stored by the ARFF header job.

Skip to 4 minutes and 22 secondsWe can look in there, and we can see hypo.arff, and, if we go back out and look in this model directory, we can see the models that were created by Distributed Weka. So we have one for Naive Bayes and one for the voted JRip model. OK. The other thing we haven’t mentioned is this job in the middle here called the RandomlyShuffleDataSparkJob. So what does this do? Well, as the name suggests, it’s a job that, in a distributed fashion, randomly shuffles the order of the rows or instances in the dataset that was being processed. In some cases, it’s advantageous to do this random shuffle. There are certain classifiers which this is beneficial for. Naive Bayes isn’t one of them.

Skip to 5 minutes and 22 secondsIt’s not affected by the order of the instances in the dataset that it learns from. However, other classifiers, like trees and rules, can be. In a worst case, we might end up with a partition of our Spark RDD where certain class values aren’t represented at all, if the data has perhaps been collected in some systematic way. For that reason, it can be beneficial to randomly shuffle the order of the instances before learning a classifier like a rule set or a decision tree. Before we finish today’s lesson, let’s take a look at one more of the example templates that come with Distributed Weka. We’ll take a look at the one that cross-validates two classifiers.

Skip to 6 minutes and 12 secondsSo this runs an evaluation, a cross-validation, inside of Distributed Weka. If we load this one – and I make it a little larger here – we can see that, apart from the ArffHeaderSparkJob and the RandomlyShuffleDataSparkJob, we now can see two components called the WekaClassifierEvaluationSparkJob. These are job entries that will perform a cross-validation out in Spark for us. There are two entries here, two WekaClassifierEvaluationSparkJob entries, because we’re comparing two classifiers under cross-validation. We’re comparing the Naive Bayes classifier again, and, in this case, the Random Forest classifier. Both of these will be evaluated under cross-validation inside of Spark. Let’s run the flow and see what happens. Switch to the log; we can watch some activity.

Skip to 7 minutes and 15 secondsThis will take a little bit longer than the previous job, as we’re running ten-fold cross-validation. Let’s take a look at the results in the TextViewer here. OK, we’ve got two entries, one for Naive Bayes, which is this first one here, and the second one is for the Random Forest classifier. As we can see in the textual output, the results look exactly the same as if you were to run a cross-validation in desktop Weka, and, similarly for the Random Forest classifier. Let’s consider how Distributed Weka performs a cross-validation. It actually involves two separate phases, or passes over the data. Phase one involves model construction, and phase two involves model evaluation. If we consider a simple three-fold cross-validation.

Skip to 8 minutes and 18 secondsWe know that the dataset gets split up into three distinct chunks during this process, and that models are created by training on two out of the three folds. So we end up with actually three models created, each of them trained on two-thirds of the data, and then we test them on the fold that was held out during the training process. In this example, our dataset in Spark is made up of two logical partitions. We can think of each of these partitions as containing part of each cross-validation fold. In this case, they would hold exactly half of each cross-validation fold, because we have two partitions. Each partition, as we know, is processed by a worker, or a map task.

Skip to 9 minutes and 4 secondsIn the model-building phase, the workers will build partial models, and there will be a model inside the worker that is created for one of the training splits of this cross-validation. So it will be a partial model. For example, we’ll have the first model created on folds two and three, or parts of folds two and three. Similarly, model two will be created on fold one and three, and model three on fold one and two. In each of these workers, these models are partial models, because they’ve only seen part of the data from those particular folds. In our example here, the map tasks will output a total of six partial models, two for each training split of the cross-validation.

Skip to 9 minutes and 58 secondsThis allows us to get parallelism involved in the reduce phase. We can run as many reduce tasks as there are models to be aggregated. Each reducer will have the goal of aggregating one of the models. So in our example here, the six partial models are aggregated to the three final models that you would expect from a three-fold cross-validation. The second phase of cross-validation is somewhat simpler than the first. It takes the models learned from the first phase and applies them to the holdout folds of our cross-validation in each of the logical partitions of our dataset. It uses them to evaluate each of those holdout folds.

Skip to 10 minutes and 51 secondsThe reduce task then takes all of the partial evaluation results coming out of the map tasks and aggregates them to one final full evaluation result, which is then written out to the file system. Over the last couple of lessons, we’ve looked at some of the example Knowledge Flow templates that come with Distributed Weka. We’ve looked at one that creates ARFF metadata and summary statistics for a dataset. We’ve looked at how Distributed Weka builds models and how it performs cross-validation. The next lesson will wrap things up and leave you with some directions for what to look at next with respect to Distributed Weka.

Map tasks and Reduce tasks

Map tasks produce models and a Reduce task aggregates them. Reduce strategies differ for Naive Bayes and other model types. We saw in the last lesson that Naive Bayes and JRip are treated differently. The reason is that Naive Bayes is easily parallelized by adding up frequency counts from the individual partitions, producing a single model. For JRip (and other classifiers), separate classifiers are learned for each partition (4 in this case), and a “vote” ensemble learner is produced that combines them. Also, for some classifiers (like JRip) it is beneficial to randomize the dataset before splitting it into partitions. Finally, we look at the “Spark: cross-validate two classifiers” template and examine how DIstributed Weka performs cross-validation.

Share this video:

This video is from the free online course:

Advanced Data Mining with Weka

The University of Waikato

Get a taste of this course

Find out what this course is like by previewing some of the course steps before you join:

Contact FutureLearn for Support