First of all, we’re going to see what it’s like to be a classifier. We’re going to construct a decision tree ourselves, interactively.
I’m going to open up Weka here: the Weka Explorer.
I’m going to load a dataset, the segment-challenge dataset: segment-challenge.arff – that’s the one I want.
We’re going to look at this dataset. Let’s first of all look at the class. The class values are brickface, sky, foliage, cement, window, path, and grass. It looks like this is an image analysis dataset. When we look at the attributes, we see things like the centroid of columns and rows, pixel counts, line densities, means of intensities, and various other things, saturation, hue.
And the class, as I said before, is different kinds of texture: bricks, sky, foliage, and so on. That’s the segment challenge dataset. I’m going to select the “user classifier”. The user classifier is a tree classifier. We’ll see what it does in just a minute. That’s the user classifier. Before I start – this is really quite important – I’m going to use a supplied test set. I’m going to set the test set, which is used to evaluate the classifier, to be segment-test. The training set is segment-challenge; the test set is segment-test. Now we’re all set. I’m going to start the classifier.
What we see is a window with two panels: the Tree Visualizer and the Data Visualizer. Let’s start with the Data Visualizer. We looked at visualization in the last class, how you can select different attributes for x and y. I’m going to plot the region-centroid-row against the intensity-mean.
Now we’re going to select a class. I’m going to select Rectangle.
If I draw out with my mouse a rectangle here, I’m going to have a rectangle that’s pretty well pure reds, as far as I can see. I’m going to “submit” this rectangle. You can see that that area has gone and the picture has been rescaled. I’m building up a tree here. If I look at the Tree Visualizer, I’ve got a tree.
We’ve split on these two attributes, region-centroid-row and intensity-mean. Here we’ve got “sky”, these are all sky classes. Here we’ve got a mixture of brickface, foliage, cement, window, path, and grass. We’re going to build up this tree. What I want to do is take this node and refine it a bit more. Here’s the Data Visualizer again. I’m going to select a rectangle containing these items here, and submit that. They’ve gone from this picture. You can see that here, I’ve created this split, another split on region-centroid-row and intensity-mean, and here, this is almost all “path” – 233 “path” instances – and then a mixture here. This is a pure node we’ve got over there. This is almost a pure node here.
This is the one I want to work on. I’m going to cover some of those instances now. Let’s take this lot here and submit that. Then I’m going to take this lot here and submit that. Maybe I’ll take those ones there and submit that. This little cluster here seems pretty uniform. Submit that. I haven’t actually changed the axes, but, of course, at any time, I could change these axes to better separate the remaining classes. I could mess around with these.
Actually, a quick way to do it is to click here on these bars: left click for x and right click for y. I can quickly explore different pairs of axes to see if I can get a better split.
Here’s the tree I’ve created. I’m going to fit it to the screen. It looks like this. You can see that we have successively elaborated down this branch here. When I finish with this, I can “accept” the tree. Actually, before I do that, let me just show you that we were selecting rectangles here,
but I’ve got other things I can select: a polygon or a polyline. If I don’t want to use rectangles, I can use polygons or polylines. If you like, you can experiment with those to select different shaped areas.
There’s an area I’ve got selected … I just can’t quite finish it off … ah right, I right-clicked to finish it off. I could submit that. I’m not confined to rectangles; I can use different shapes. I’m not going to do that. I’m satisfied with this tree for the moment. I’m going to “accept” the tree. Once I do this, there is no going back, so you want to be sure. If I accept the tree, “Are you sure?” – yes. Here, I’ve got a confusion matrix, and I can look at the errors. My tree classifies 78% of the instances correctly, nearly 79% correctly, and 21% incorrectly. That’s not too bad, especially considering how quickly I built that tree.
It’s over to you now. I’d like you to play around and see if you can do better than this by spending a little bit longer on getting a nice tree. I’d like you to reflect on a couple of things. First of all, what strategy you’re using to build this tree. Basically, we’re covering different regions of the instance space, trying to get pure regions, to create pure branches. This is like a bottom-up covering strategy. We cover this area, and this area, and this area. Actually, that’s not how J48 works. When it builds trees, it tries to do a judicious split through the whole dataset.
At the very top level, it’ll split the entire dataset into two in a way that doesn’t necessarily separate out particular classes, but makes it easier when it starts working on each half of the dataset, further splitting in a top-down manner in order to try and produce an optimal tree. It will produce trees much better than the one that I just produced with the user classifier. I’d also like you to reflect on what it is we’re trying to do here. Given enough time, you could produce a “perfect” tree for the dataset, but don’t forget that the dataset we’ve loaded is the training dataset.
We’re going to evaluate this tree on a different dataset, the test dataset, which hopefully comes from the same source, but is not identical to the training dataset. We’re not trying to precisely fit the training dataset; we’re trying to fit it in a way that generalizes the kinds of patterns exhibited in the dataset. We’re looking for something that will perform well on the test data. That highlights the importance of evaluation in machine learning.