Skip to 0 minutes and 11 seconds One of the constantly recurring themes in this course is the necessity to get close to your data, look at it in every possible way. We’re going to use the Visualize panel. I’m going to open the Iris dataset.
Skip to 0 minutes and 23 seconds I’m using it because it has numeric attributes, four numeric attributes: sepallength, sepalwidth, petallength, petalwidth.
Skip to 0 minutes and 34 seconds The class are three kinds of iris flower: Iris-setosa, Iris-versicolor, and Iris-virginica. Let’s go to the Visualize panel and visualize this data. There is a matrix of two dimensional plots, a 5×5 matrix of plots. If I can select one of these plots, I’m going to be looking at a plot of sepalwidth on the x-axis and petalwidth on the y-axis. That’s a plot of the data. The colors correspond to the three classes. I can actually change the colors – if I don’t like those, I could select another color. But I’m going to leave them the way they are. I can look at individual data points by clicking on them.
Skip to 1 minute and 21 seconds This is talking about instance number 86 with a sepallength of 6, sepalwidth of 3.4, and so on. That’s a versicolor, which is why this spot is colored red. We can look individual instances. We can change the x- and y-axes by changing the menus here. Better still, if we click on this little set of bars here, these represent the attributes. I’m going to click on this and the x-axis will change to sepallength. Here the x-axis is sepalwidth. Here the x-axis is petallength, and so on. If I right-click, then it will change the y-axis to sepallength. So I can quickly browse around these different plots. There is a Jitter slider.
Skip to 2 minutes and 15 seconds Sometimes points sit right on top of each other, and jitter just adds a little bit of randomness to the x- and the y-axes. With a little bit of jitter on here, the darker spots represent multiple instances. If I click on one of those, I can see that that point represents three separate instances, all of class Iris-setosa, and they all have the same value of petallength and sepalwidth – both of which are being plotted on this graph. The sepalwidth and petallength are 3.0 and 1.4 for each of the three instances. If I click another one here, here are two with very similar sepalwidths and petallengths, both have the class versicolor.
Skip to 3 minutes and 4 seconds The jitter slider helps you distinguish between points that are in fact very close together. Another thing we can do is select bits of this dataset. I’m going to choose “select rectangle” here. If I draw a rectangle now, I can select these points. If I were to “submit” this rectangle, then all other points would be excluded and just these points would appear on the graph, with the axis re-scaled appropriately. Here we go. I’ve submitted that rectangle, and you can see that there are just the red points and green points there.
Skip to 3 minutes and 39 seconds I can save that if I wanted as a different dataset, or I can reset it and maybe try another kind of selection like this, where I’m going to have some blue points, some red, and some green points, and see what that looks like. This might be a way of cleaning up outliers in your data, by selecting rectangles and saving the new dataset. That’s visualizing the dataset itself. What about visualizing the result of a classifier? Let’s get rid of this Visualize panel and go back to the Preprocess panel. I’m going to use a classifier. I’m going to use, guess what, J48. Let’s find it under “trees”. I’m going to run it.
Skip to 4 minutes and 25 seconds Then if I right-click on this entry here in the “log” area, I can view classifier errors. Here we’ve got the class plotted against the predicted class. The square boxes represent errors. If I click on one of these, I can of course change the different axes if I want. I can change the x-axis and the y-axis. But I’m going to go back to “class” and “predictedclass”.
Skip to 5 minutes and 0 seconds If I click on one of these boxes, I can see where the errors are. There are two instances where the predicted class is versicolor and the actual class is virginica. We can see these in the confusion matrix. The actual class is virginica, and the predicted class is versicolor, that’s ‘b’. This “2” entry in the confusion matrix is represented by these two instances here. If I look at another point, say this one, here I’ve got one instance, which is in fact a setosa, predicted to be a versicolor. I can look at this plot and find out where the misclassifications are actually occurring, the errors in the confusion matrix.
Skip to 5 minutes and 54 seconds Get down and dirty with your data: visualize it. You can do all sorts of things. You can clean it up, detect outliers. You can look at the classification errors. For example, there’s a filter that allows you to add the classifications as a new attribute. Let’s just go and have a look at that. I’m going to go and find a filter. We’re going to add an attribute. It’s supervised because it uses the “class”.
Skip to 6 minutes and 18 seconds Add an attribute: AddClassification. Here I get to choose in the configuration panel the machine learning scheme. I’m going to choose J48, of course, and I’m going to output the classification – make that “true”. That’s configured it, and I’m going to apply it. It will add a new attribute. It’s done it, and this attribute is the classification according to J48. Weka is very powerful. You can do all sorts of things with classifiers and filters.
Visualizing your data
For successful data mining you must “know your data”; examine it in detail in every possible way. Weka’s Visualize panel lets you look at a dataset and select different attributes – preferably numeric ones – for the x- and y-axes. Instances are shown as points, with different colors for different classes. You can sweep out a rectangle and focus the dataset on the points inside it. You can also apply a classifier and vlsualize the errors it makes by plotting the “class” against the “predicted class”.
© University of Waikato, New Zealand. CC Creative Commons Attribution 4.0 International License.