One of the constantly recurring themes in this course is the necessity to get close to your data, look at it in every possible way. We’re going to use the Visualize panel. I’m going to open the Iris dataset.
I’m using it because it has numeric attributes, four numeric attributes: sepallength, sepalwidth, petallength, petalwidth.
The class are three kinds of iris flower: Iris-setosa, Iris-versicolor, and Iris-virginica. Let’s go to the Visualize panel and visualize this data. There is a matrix of two dimensional plots, a 5×5 matrix of plots. If I can select one of these plots, I’m going to be looking at a plot of sepalwidth on the x-axis and petalwidth on the y-axis. That’s a plot of the data. The colors correspond to the three classes. I can actually change the colors – if I don’t like those, I could select another color. But I’m going to leave them the way they are. I can look at individual data points by clicking on them.
This is talking about instance number 86 with a sepallength of 6, sepalwidth of 3.4, and so on. That’s a versicolor, which is why this spot is colored red. We can look individual instances. We can change the x- and y-axes by changing the menus here. Better still, if we click on this little set of bars here, these represent the attributes. I’m going to click on this and the x-axis will change to sepallength. Here the x-axis is sepalwidth. Here the x-axis is petallength, and so on. If I right-click, then it will change the y-axis to sepallength. So I can quickly browse around these different plots. There is a Jitter slider.
Sometimes points sit right on top of each other, and jitter just adds a little bit of randomness to the x- and the y-axes. With a little bit of jitter on here, the darker spots represent multiple instances. If I click on one of those, I can see that that point represents three separate instances, all of class Iris-setosa, and they all have the same value of petallength and sepalwidth – both of which are being plotted on this graph. The sepalwidth and petallength are 3.0 and 1.4 for each of the three instances. If I click another one here, here are two with very similar sepalwidths and petallengths, both have the class versicolor.
The jitter slider helps you distinguish between points that are in fact very close together. Another thing we can do is select bits of this dataset. I’m going to choose “select rectangle” here. If I draw a rectangle now, I can select these points. If I were to “submit” this rectangle, then all other points would be excluded and just these points would appear on the graph, with the axis re-scaled appropriately. Here we go. I’ve submitted that rectangle, and you can see that there are just the red points and green points there.
I can save that if I wanted as a different dataset, or I can reset it and maybe try another kind of selection like this, where I’m going to have some blue points, some red, and some green points, and see what that looks like. This might be a way of cleaning up outliers in your data, by selecting rectangles and saving the new dataset. That’s visualizing the dataset itself. What about visualizing the result of a classifier? Let’s get rid of this Visualize panel and go back to the Preprocess panel. I’m going to use a classifier. I’m going to use, guess what, J48. Let’s find it under “trees”. I’m going to run it.
Then if I right-click on this entry here in the “log” area, I can view classifier errors. Here we’ve got the class plotted against the predicted class. The square boxes represent errors. If I click on one of these, I can of course change the different axes if I want. I can change the x-axis and the y-axis. But I’m going to go back to “class” and “predictedclass”.
If I click on one of these boxes, I can see where the errors are. There are two instances where the predicted class is versicolor and the actual class is virginica. We can see these in the confusion matrix. The actual class is virginica, and the predicted class is versicolor, that’s ‘b’. This “2” entry in the confusion matrix is represented by these two instances here. If I look at another point, say this one, here I’ve got one instance, which is in fact a setosa, predicted to be a versicolor. I can look at this plot and find out where the misclassifications are actually occurring, the errors in the confusion matrix.
Get down and dirty with your data: visualize it. You can do all sorts of things. You can clean it up, detect outliers. You can look at the classification errors. For example, there’s a filter that allows you to add the classifications as a new attribute. Let’s just go and have a look at that. I’m going to go and find a filter. We’re going to add an attribute. It’s supervised because it uses the “class”.
Add an attribute: AddClassification. Here I get to choose in the configuration panel the machine learning scheme. I’m going to choose J48, of course, and I’m going to output the classification – make that “true”. That’s configured it, and I’m going to apply it. It will add a new attribute. It’s done it, and this attribute is the classification according to J48. Weka is very powerful. You can do all sorts of things with classifiers and filters.