In the last lesson we looked at some different clustering algorithms, and each of them had a different metric. SimpleKMeans talked about the total squared distance of each instance from its cluster center. That’s not necessarily a good way of evaluating clustering, and it certainly makes it difficult to compare the results of different clustering algorithms. One thing we can do in Weka is to visualize the clusters. Over here in Weka I’ve got the iris data open. I’ve got here the SimpleKMeans method with 3 clusters selected, and I’m going to run that.
On the right-click menu, I’m going to visualize the cluster assignments. Here they are. This would make most sense if we plot the cluster against the instance number.
Remember the iris data: the first 50 instances are one kind of iris, and the next 50 are another, and the third 50 are another. Well, this looks too good to be true. Here the first 50 are in one cluster, the second 50 are in another cluster, and the third 50 are in another cluster. And in data mining, if things look too good to be true, they probably are. The problem here, when you think about it, is that one of the attributes is the “class”, and it’s not really fair to include the class when we’re doing the clustering. On the clustering panel we can ignore attributes. I’m going to ignore the “class” attribute and try again.
Now I’ve got 61 instances in one cluster and 50 in another and 39 in another. If I visualize the cluster assignments and choose the cluster here, I get a different picture. You can see that the first cluster looks pretty good, but there are some errors here, some green things have crept into this second thing. For the last 50 items of the dataset, which all belong to one class of iris, we’ve got a whole bunch of stuff coming in here from another [class]. That’s not looking so good. How do you tell which instances are in which cluster? To do that, there’s a filter called “AddCluster”. It’s an unsupervised attribute filter called AddCluster. In this filter, we can specify a clusterer.
Here we specified SimpleKMeans, and I’ll choose 3 clusters again. I’m going to apply the filter, and that’s going to add a new attribute. Let’s do this. You can see that we’ve got a new attribute. It’s called “cluster”, attribute 6. If we edit this dataset, we can have a look at the values for the last attribute and compare them with the class. This is an unsupervised filter, so the class was not used when running the filter. The clustering is done just on the basis of the first four attributes. You can see that the iris-setosas are all in cluster 2. The next lot of irises, versicolors, are mostly in cluster 1 – there are a couple of cluster 3’s here.
The third lot, the iris-virginicas, are mostly in cluster 3, but there are quite a lot of cluster 1’s. That’s just exactly what we saw when we visualized the cluster assignments before. Coming back to the slide, we’ve looked at the Visualize cluster assignments on the Cluster panel. We’ve learned how to ignore attributes. Typically the class attribute is a good one to ignore if you’ve got a dataset with a class. Then we’ve looked at a filter, the AddCluster unsupervised attribute filter. We looked at the result of that and how you can add a new attribute which gives a cluster number, and then look at which instances have got which cluster by clicking the Edit button.
A way of evaluation in Weka is called the “classes-to-clusters evaluation”. I’m going to go back to the iris data and do a classes-to-clusters evaluation. (Let me get rid of this.) I’m going to undo the filter we just did to get the original iris data back. I’m going to go to my Cluster panel, click “Classes to clusters evaluation”, and run that. Now I see I’ve got my 3 classes. There are 3 clusters, and you can see how many of each class were assigned to which cluster. You can see there are 17 incorrectly clustered instances. We’ll have a look at that in a minute, but first let me go and use the EM algorithm and see how that does.
Again, I’m going to specify 3 clusters, and I’m going to run that. I get a similar kind of thing here. Back on the slide, this is the result I saw for SimpleKMeans with 3 clusters. You can see that the majority in cluster 0 is this 47 here. That’s versicolor. So we’re going to assign versicolor to cluster 0. The majority in cluster 1, that’s the second column, are the setosas – that 50 there in the second column, the column labeled 1. The final column, there’s a 36 there, so the majority class is virginica. That’s where we get the 17 incorrectly clustered instances from. EM does quite a lot better here.
We only get 14 incorrectly clustered instances, or 9% of the dataset. That’s a classes-to-clusters evaluation. There’s a meta-classifier called ClassificationViaClustering. It works by ignoring the classes, clustering the data, assigning to each cluster its most frequent class, and that’s a classifier. It’s very similar to what we just did, but we can evaluate it like we evaluate classifiers. Let’s get back to Weka. I’m going to go to Classify, and in my meta list I’m going to choose ClassificationViaClustering. I’m going to stick to SimpleKMeans with 3 clusters. Now if I evaluate that on the training set, that’s exactly what we just did on the clustering panel. Let me start that. Here I get exactly the same matrix as I just looked at.
As you can see, there are 17 errors here. That’s evaluating on the training set. Of course, there are the 17 errors up there. We know we shouldn’t be evaluating on the training set.
We’re going to use cross-validation, which is going to do the usual thing: take 90%, form a clustering, form a classification based on that clustering, and then see how well that does on the held-out 10% of the dataset.
In this case, I get slightly worse results, as I would expect. I’ve got 19 errors, or an 84% success rate. That’s ClassificationViaClustering. Of course, I could choose different clusterers and build classifiers based on them. It’s a very good way of comparing clusterers. It’s hard to evaluate clustering. SimpleKMeans, for instance, uses within-cluster sum of squared errors, but really clustering should be evaluated with respect to a particular application.
Visualization helps: it helps you to see what’s happening to your data. The AddCluster filter allows you to see which instances are in each cluster, which is often useful to see. The classes-to-clusters evaluation gives you a way of looking at the clusters, but, in effect, it uses the entire dataset. To look at the incorrectly assigned instances based on a classification made from the entire dataset risks overfitting; you should never evaluate on the training set. Classification via clustering uses the same kind of technique to produce a classifier that can then be evaluated in different ways, for example, 10-fold cross-validation, which is what we just did.