Want to keep learning?

This content is taken from the The University of Waikato's online course, Data Mining with Weka. Join the course to learn more.

Skip to 0 minutes and 11 seconds In this lesson, we’re going to start by looking at classification boundaries for different machine learning methods. We’re going to use Weka’s Boundary Visualizer, which is another Weka tool that we haven’t encountered yet. I’m going to use a 2-dimensional dataset. I’ve prepared iris.2d.arff.

Skip to 0 minutes and 33 seconds It’s a 2-dimensional version of the iris dataset. I took the regular iris dataset and deleted a couple of attributes – sepallength and sepalwidth – leaving me with this 2D dataset, and the class. We’re going to look at that using the Boundary Visualizer. You get that from this Visualization menu on the Weka Chooser. There are a lot of tools in Weka, and we’re just going to look at this one here, the Boundary Visualizer. I’m going to open the same file in the Boundary Visualizer, the 2-dimensional iris dataset. Here we’ve got a plot of the data. You can see that we’re plotting petalwidth on the y-axis against petallength on the x-axis.

Skip to 1 minute and 19 seconds This is a picture of the dataset with the 3 classes setosa in red, versicolor in green, and virginica in blue. I’m going to choose a classifier. Let’s begin with the OneR classifier, which is in rules.

Skip to 1 minute and 37 seconds I’m going to “plot training data” and just going to let it rip. The color diagram shows the decision boundaries, with the training data superimposed on it. Let’s look at what OneR does to this dataset in the Explorer.

Skip to 1 minute and 58 seconds OneR has chosen to split on petalwidth. If it’s less than a certain amount, we get a setosa; if it’s intermediate, we get a versicolor; and if it’s greater than the upper boundary, we get a viriginica. It’s the same as what’s being shown here. We’re splitting on petalwidth. If it’s less than a certain amount, we get a setosa; in the middle, a versicolor; and at the top, a virginica. This is a spatial representation of the decision boundary that OneR creates on this dataset. That’s what the Boundary Visualizer does; it draws decision boundaries. It shows here that OneR chooses an attribute – in this case petalwidth – to split on.

Skip to 2 minutes and 32 seconds It might have chosen petallength, in which case we’d have vertical decision boundaries. Either way, we’re going to get stripes from OneR. I’m going to go ahead and look at some boundaries for other schemes. Let’s look at IBk, which is a “lazy” classifier. That’s the instance-based learner we looked at in the last class. I’m going to run that. Here we get a different kind of pattern. I’d like to plot the training data. We’ve got diagonal lines. Down here are the setosas underneath this diagonal line; the versicolors in the intermediate region; and the virginicas, by and large, in the top right-hand corner. Remember what [IBk] does. It takes a test instance.

Skip to 3 minutes and 21 seconds Let’s say we had an instance here, just on this side of the boundary, in the red. Then it chooses the nearest instance to that. That would be this one, I guess. That’s kind of the nearer than this one here. This is a red point. If I were to cross over the boundary here, it would choose a green class, because this would be the nearest instance then. If you think about it, this boundary goes halfway between this nearest red point and this nearest green point. Similarly, if I take a point up here, I guess the two nearest instances are this blue one and this green one. This blue one is closer.

Skip to 4 minutes and 5 seconds In this case, the boundary goes along this straight line here.

Skip to 4 minutes and 8 seconds You can see that it’s not just a single line: this is a piecewise linear line, so this part of the boundary goes exactly halfway between these two points quite close to it. Down here, the boundary goes exactly halfway between these two points. It’s the perpendicular bisector of the line joining these points. So we get a piecewise linear boundary made up of little pieces.

Skip to 4 minutes and 31 seconds It’s kind of interesting to see what happens if we change the parameter: if we look at, say, 5 nearest neighbors instead of just 1. Now we get a slightly blurry picture, because whereas down here in the pure red region the 5 nearest neighbors to a point are all red points, if we look in the intermediate region here, then the nearest neighbors to a point here – this is going to be in the 5, and this might be another one in the 5, and there might be a couple more down here in the 5. So we get an intermediate color here, and IBk takes a vote.

Skip to 5 minutes and 13 seconds If we had 3 reds and 2 greens, then we’d be in the red region and that would be depicted as this darker red here. If it had been the other way round with more greens than reds, we’d be in the green region. So we’ve got a blurring of these boundaries. These are probabilistic descriptions of the boundary. Let me just change k to 20 and see what happens.

Skip to 5 minutes and 43 seconds Now we get the same shape, but even more blurry boundaries. The Boundary Visualizer reveals the way that machine learning schemes are thinking, if you like. The internal representation of the dataset. They help you think about the sorts of things that machine learning methods do. Let’s choose another scheme. I’m going to choose NaiveBayes. When we talked about NaiveBayes, we only talked about discrete attributes. With continuous attributes, I’m going to choose a supervised discretization method. Don’t worry about this detail, it’s the most common way of using NaiveBayes with numeric attributes.

Skip to 6 minutes and 31 seconds Let’s look at that picture. This is interesting. When you think about NaiveBayes, it treats each of the two attributes as contributing equally and independently to the decision. It sort of decides what it should be along this dimension and decides what it should be along this dimension and multiples the two together. Remember the multiplication that went on in NaiveBayes. When you multiple these things together, you get a checkerboard pattern of probabilities, multiplying up the probabilities. That’s because the attributes are being treated independently. That’s a very different kind of decision boundary from what we saw with instance-based learning.

Skip to 7 minutes and 11 seconds That’s what’s so good about the Boundary Visualizer: it helps you think about how things are working inside. I’m going to do one more example. I’m going to do J48, which is in trees.

Skip to 7 minutes and 28 seconds Here we get this kind of structure. Let’s take a look at what happens in the Explorer if we choose J48.

Skip to 7 minutes and 43 seconds We get this little decision tree: split first on petalwidth; if it’s less than 0.6 it’s a setosa for sure. Then split again on petalwidth; if it’s greater than 1.7, it’s a virginica for sure. Then, in between, split on petallength and then again on petalwidth, getting a mixture of versicolors and viriginicas. We split first on petalwidth; that’s this split here. Remember the vertical axis is the petalwidth axis. If it’s less than a certain amount, it’s a setosa for sure. Then we split again on the same axis. If it’s greater than a certain amount, it’s a virginica for sure. If it’s in the intermediate region, we split on the other axis, which is petallength.

Skip to 8 minutes and 34 seconds Down here, it’s a versicolor for sure, and here we’re going to split again on the petalwidth attribute. Let’s change the minNumObj parameter, which controls the minimum size of the leaves. If we increase that, we’re going to get a simpler tree. We discussed this parameter in one of the lessons of Class 3. If we run now, then we get a simpler version, corresponding to the simpler rules we get with this parameter set. Or we can set the parameter to a higher value, say 10, and run it again. We get even simpler rules, very similar to the rules produced by OneR. We’ve looked at classification boundaries.

Skip to 9 minutes and 27 seconds Classifiers create boundaries in instance space and different classifiers have different capabilities for carving up instance space. That’s called the “bias” of the classifier – the way in which it’s capable of carving up the instance space. We looked at OneR, IBk, NaiveBayes, and J48, and found completely different biases, completely different ways they carve up the instance space. Of course, this kind of visualization is restricted to numeric attributes and 2-dimensional plots, so it’s not a very general tool, but it certainly helps you think about these different classifiers.

Classification boundaries

Different classifiers are biased towards different kinds of decision. You can see this by examining classification boundaries for various machine learning methods trained on a 2D dataset with numeric attributes. Here we use Weka’s Boundary Visualizer to plot boundaries for some example classifiers: OneR, IBk, Naive Bayes, and J48. Characteristic patterns appear.

Share this video:

This video is from the free online course:

Data Mining with Weka

The University of Waikato

Get a taste of this course

Find out what this course is like by previewing some of the course steps before you join: