Skip to 0 minutes and 11 seconds Hello! My name is Mark Hall. I’m a software architect and data mining consultant with the Pentaho Corporation. I live here in New Zealand, not very far away from where the Weka software was originally developed. This lesson is about using Weka in a distributed processing framework, such as Spark or Hadoop. Distributed Weka is a plugin for Weka 3.7 that allows Weka algorithms to run on a cluster of machines. You would use this when your dataset is too large to load into main RAM on your desktop, or you’re perhaps applying an algorithm that would just take too long to run on a single machine, you covered data stream mining.
Skip to 0 minutes and 53 seconds You saw sequential online algorithms that can be used to handle large datasets in the Moa framework, and also inside of Weka using Moa. Distributed Weka works with distributed processing frameworks that use something called map-reduce. So this is a little bit different. It’s more suited to large, offline, batch-based processing scenarios. Essentially, your data is divided up over the nodes in the processing cluster – the machines in a processing cluster – and is conquered. Each piece is conquered independently of the other pieces. More on map-reduce shortly. The Distributed Weka plugin is actually made up of two packages. First, there is something called distributedWekaBase.
Skip to 1 minute and 38 seconds This is a package that provides general map-reduce-style tasks for machine learning that are not tied to any particular map-reduce framework implementation. We’ll discuss map-reduce in just a second. It includes tasks for training classifiers and clusterers, and computing summary statistics and correlations from the data. A second package is needed in order to apply the base package – or the algorithms in the base package – within a particular implementation of the map-reduce programming model. So in this lesson, we’re going to be looking at an implementation for the Spark distributed processing environment. So we will need to install something called distributedWekaSpark, as well. This is a wrapper for the base tasks that works on the Spark platform.
Skip to 2 minutes and 29 seconds There is also a package – or several, actually – that work with Hadoop, depending on which version or flavor of Hadoop you have installed. Now, let’s return to map-reduce. Map-reduce is the main processing model used by distributed frameworks such as Spark and Hadoop.
Skip to 2 minutes and 47 seconds Map-reduce programs involve two phases: a map phase followed by a reduce phase. To start with, we have a dataset, probably a large dataset. This dataset is divided up into disjoint subsets. The framework takes care of doing this for us. It then feeds a split of the data, a subset of the data, into a map task. Now, map tasks do their processing independently of all other map tasks; they are not aware of any of the other data splits or what the other tasks that may be running in parallel are doing. So the kind of operations that map tasks do include sorting the data, perhaps, filtering it in some way, or computing some kind of partial result.
Skip to 3 minutes and 32 seconds The output of map tasks are these partial results associated with a distinct key value. Now, the key values allow the framework to group together related intermediate results and pass them on to reduce tasks. The reduce tasks’ job is to take all of the values associated with one distinct key and aggregate them in some fashion. So they may count or add or do some averaging, or some kind of aggregation which produces a final result. Now, the job of the map-reduce framework itself is to provide all of this orchestration. So as I said, they handle splitting up the data for us; they handle invoking and initializing the map and reduce tasks; they provide redundancy and fault-tolerance, as well.
Skip to 4 minutes and 21 seconds So if there is some failure out on the cluster which causes map tasks to abort processing before they finish, or were a reduce task to fail, the framework will take care of ensuring that there are additional map and reduce tasks that can be started up to take care of and complete the processing. OK. The design goals of Distributed Weka were to provide a similar experience to that of using standalone desktop Weka. It enables you to use any classification or regression learner in Weka and also has some support for clustering, as well. It also generates output, including evaluation output that looks just like that produced by standard desktop Weka. The models that are output from Distributed Weka are normal Weka models.
Skip to 5 minutes and 10 seconds That means they can be saved to your file system, loaded into desktop Weka at a later stage, used for making predictions – just like any other Weka model. One thing that wasn’t a goal of the package – initially, at least – was to provide distributed implementations of every algorithm in Weka. One exception to this is k-means clustering, which was written specifically to work within a framework such as Spark and Hadoop. So we’ll see exactly how Weka handles distributing different types of models in a later lesson. That’s pretty much the end of our first lesson on Distributed Weka. We learned what Distributed Weka is.
Skip to 5 minutes and 52 seconds We learned when you would want to use it, under what conditions you would want to use it. We’ve learned what map-reduce it, and we’ve taken a look at the basic design goals of Distributed Weka.
What is distributed Weka?
Mark Hall from Pentaho introduces a plugin that runs Weka on a cluster of machines. It uses the “map-reduce” framework, and operates with both Spark and Hadoop. It comprises two Weka packages, distributedWekaBase, which provides general map-reduce tasks for machine learning that are not tied to any particular map-reduce implementation, and distributedWekaSpark, a wrapper for the base tasks that operates on the Spark platform. (There are also packages for Hadoop.) The aim is to support all Weka’s classification and regression algorithms without reimplementing them, generating output just like that produced by standard Weka. Clustering, however, had to be rewritten specifically for the distributed framework.
© University of Waikato, New Zealand. CC Creative Commons Attribution 4.0 International License.