Skip to 0 minutes and 11 seconds Hello again! We’re going to talk about clustering, and I’m going to show you some different clustering methods.
Skip to 0 minutes and 18 seconds With clustering, there’s no “class” attribute: we’re just trying to divide the instances into natural groups, or “clusters”. For instance, imagine the iris dataset that we looked at in the last course. Imagine deleting the class attribute. Of course (you might remember) there are 3 kinds of iris, and in the iris dataset there
Skip to 0 minutes and 37 seconds are 50 of each kind: iris setosas, iris versicolors, and iris virginicas. The dataset gives the petal length, and petal width, and so on.
Skip to 0 minutes and 46 seconds If you deleted the “class” attribute, the question is: could you recover the 3 classes by clustering the data? You’ll be trying that in the activity [quiz] after this lesson. The different kinds of clustering algorithms produce different sorts of representations of clusters. One way of thinking about clusters is to imagine disjoint sets. We take the instance space and divide it into sets such that each part of the instance space is in just one cluster. Or the clusters might overlap, as shown in the second picture. You might have overlapping clusters. If you have overlapping clusters, you might have probabilistic assignment of instances. So a, b, c, d, e, are instances and there are 3 clusters here.
Skip to 1 minute and 32 seconds Instance a has 0.4 probability of belonging to cluster 1, and 0.1 probability for cluster 2, and 0.5 probability for 3. Fourthly, we might have a hierarchical clustering method. Here the instances are along the bottom and a and g get clustered together at the bottom level. In fact, you can see the clusters at the bottom level. Then these clusters join together at the next level up, and so on, until at the very top level all the dataset is just one big cluster. It’s called a dendrogram, that kind of tree.
Skip to 2 minutes and 4 seconds The first algorithm we’re going to look at is called KMeans: it does iterative distance-based clustering.
Skip to 2 minutes and 10 seconds First of all you specify the desired number of clusters: we call that k. Then the algorithm chooses k points at random as cluster centers. It assigns all the instances in the dataset to their closest cluster center. Then it takes each cluster and calculates the centroid of the instances in it – that’s the mean of all of the instances. These centroids are new cluster centers. It goes back to the beginning and carries on until the clusters centers don’t change. At some point when you re-calculate the cluster centers you get just the same numbers you had before. This algorithm minimizes the total squared distance from instances to their cluster centers. Unfortunately, it’s a local minimum, not a global minimum.
Skip to 2 minutes and 57 seconds You get different results with different random number seeds.
Skip to 3 minutes and 0 seconds We’re going to look at it in Weka: we’re going to look at KMeans clustering. I’ve opened the weather dataset, and on the Cluster panel I’m going to open SimpleKMeans.
Skip to 3 minutes and 11 seconds It’s got some parameters here: the number of clusters, we’ve set this for 2 clusters; we can have different distance functions; and the random number seed here. Let’s just run it. Here we get 2 clusters. One’s got 9 instances and the other’s got 5 instances. The total squared error is 16.2 – that’s what we’re trying to minimize. The thing is that if we run this with a different random number seed, say 11, then we’re going to get different clusters. There are 6 instances in one cluster and 8 in another cluster. Here the total squared error is 13.6. If we were to do it again with another seed, let’s say 12, we get a different clustering again.
Skip to 4 minutes and 4 seconds Going back to the slide, you can see that for each different random number seed we get a different clustering, and that doesn’t seem like a very good thing. Although maybe it’s the dataset. Maybe with a proper dataset we might get better results, more consistent results. But in KMeans, it’s always dependent on the initial assignment of the cluster centers, the initial choice of cluster centers. XMeans, also in Weka, is an extended version of KMeans. It selects the number of clusters itself, whereas with KMeans you’ve got to specify that. For XMeans, you can specify a minimum and a maximum for this number. It uses kD-trees, which are sophisticated data structures, to make it operate pretty quickly.
Skip to 4 minutes and 50 seconds Unfortunately, though, XMeans cannot handle nominal attributes. Let’s look at another method. EM is a probabilistic clustering method. It stands for Expectation Maximization. I’m going to go to the cluster panel and choose EM. There are some parameters here. We’ve got here the number of clusters. That’s set to –1; that means EM will try to determine the number of clusters itself. I’m going to set that to 2. Then I’m going to run EM. Here I get 2 clusters. In fact, these are the clusters. For each of the attribute values, I get kind of a probability that the “outlook” is going to be “sunny” and “overcast” and “rainy”, in each of the clusters.
Skip to 5 minutes and 43 seconds We get the probability by dividing this by the total here. Given those probabilities, if we had a new instance we could calculate for it the probabilities for each cluster. Actually, EM uses as an overall quality measure, a thing called the “log likelihood”. Back to the slide, we’ve got two clusters with these prior probabilities. Within each cluster, we’ve got the probability of each value for a nominal attribute; and for numeric attributes we’ve got the mean and standard deviation. Let’s look at another, final clustering method. This is a hierarchical clustering method called Cobweb. Back in Weka, let me just run Cobweb. It’s got some rather magic parameters here. I’m going to choose 0.3.
Skip to 6 minutes and 34 seconds It’s a bit of a black art, actually, using Cobweb. I’m going to run this. I’ll get a tree, which I can visualize – on the right-click method I can visualize this tree. Going back to the slide, this is the tree that we get for the weather data, with 10 clusters. You can see these clusters at the bottom level, and then these clusters at the level above, and one cluster at the very top. That’s clustering. In clustering, there’s no class value. There are different representations of clusters, and different algorithms produce different kinds of representation. KMeans is the simplest, standard clustering method. It’s an iterative, distance-based method. It can take different distance metrics.
Skip to 7 minutes and 27 seconds We were using the Euclidean distance, but you can select different distances metrics in KMeans and XMeans. It’s really hard to evaluate clustering, and we’re going to be looking at that in the next lesson.
With clustering, there’s no “class” attribute: we’re just trying to divide the instances into natural groups or “clusters”. There are different ways of representing clusters. Are they disjoint, or can they overlap? Is cluster membership precisely determined, or probabilistic? Perhaps a tree structure in which clusters are repeatedly refined into smaller ones is appropriate? Different algorithms produce different representations. In any case, it’s hard to evaluate clustering, because, lacking a class value, you can’t compute the accuracy of any particular clustering.
© University of Waikato, New Zealand. CC Creative Commons Attribution 4.0 International License.