Skip main navigation

Representing clusters

With clustering, there's no "class" attribute. Instead, as Ian Witten explains, the aim is to divide instances into natural groups.
Hello again! We’re going to talk about clustering, and I’m going to show you some different clustering methods.
With clustering, there’s no “class” attribute: we’re just trying to divide the instances into natural groups, or “clusters”. For instance, imagine the iris dataset that we looked at in the last course. Imagine deleting the class attribute. Of course (you might remember) there are 3 kinds of iris, and in the iris dataset there
are 50 of each kind: iris setosas, iris versicolors, and iris virginicas. The dataset gives the petal length, and petal width, and so on.
If you deleted the “class” attribute, the question is: could you recover the 3 classes by clustering the data? You’ll be trying that in the activity [quiz] after this lesson. The different kinds of clustering algorithms produce different sorts of representations of clusters. One way of thinking about clusters is to imagine disjoint sets. We take the instance space and divide it into sets such that each part of the instance space is in just one cluster. Or the clusters might overlap, as shown in the second picture. You might have overlapping clusters. If you have overlapping clusters, you might have probabilistic assignment of instances. So a, b, c, d, e, are instances and there are 3 clusters here.
Instance a has 0.4 probability of belonging to cluster 1, and 0.1 probability for cluster 2, and 0.5 probability for 3. Fourthly, we might have a hierarchical clustering method. Here the instances are along the bottom and a and g get clustered together at the bottom level. In fact, you can see the clusters at the bottom level. Then these clusters join together at the next level up, and so on, until at the very top level all the dataset is just one big cluster. It’s called a dendrogram, that kind of tree.
The first algorithm we’re going to look at is called KMeans: it does iterative distance-based clustering.
First of all you specify the desired number of clusters: we call that k. Then the algorithm chooses k points at random as cluster centers. It assigns all the instances in the dataset to their closest cluster center. Then it takes each cluster and calculates the centroid of the instances in it – that’s the mean of all of the instances. These centroids are new cluster centers. It goes back to the beginning and carries on until the clusters centers don’t change. At some point when you re-calculate the cluster centers you get just the same numbers you had before. This algorithm minimizes the total squared distance from instances to their cluster centers. Unfortunately, it’s a local minimum, not a global minimum.
You get different results with different random number seeds.
We’re going to look at it in Weka: we’re going to look at KMeans clustering. I’ve opened the weather dataset, and on the Cluster panel I’m going to open SimpleKMeans.
It’s got some parameters here: the number of clusters, we’ve set this for 2 clusters; we can have different distance functions; and the random number seed here. Let’s just run it. Here we get 2 clusters. One’s got 9 instances and the other’s got 5 instances. The total squared error is 16.2 – that’s what we’re trying to minimize. The thing is that if we run this with a different random number seed, say 11, then we’re going to get different clusters. There are 6 instances in one cluster and 8 in another cluster. Here the total squared error is 13.6. If we were to do it again with another seed, let’s say 12, we get a different clustering again.
Going back to the slide, you can see that for each different random number seed we get a different clustering, and that doesn’t seem like a very good thing. Although maybe it’s the dataset. Maybe with a proper dataset we might get better results, more consistent results. But in KMeans, it’s always dependent on the initial assignment of the cluster centers, the initial choice of cluster centers. XMeans, also in Weka, is an extended version of KMeans. It selects the number of clusters itself, whereas with KMeans you’ve got to specify that. For XMeans, you can specify a minimum and a maximum for this number. It uses kD-trees, which are sophisticated data structures, to make it operate pretty quickly.
Unfortunately, though, XMeans cannot handle nominal attributes. Let’s look at another method. EM is a probabilistic clustering method. It stands for Expectation Maximization. I’m going to go to the cluster panel and choose EM. There are some parameters here. We’ve got here the number of clusters. That’s set to –1; that means EM will try to determine the number of clusters itself. I’m going to set that to 2. Then I’m going to run EM. Here I get 2 clusters. In fact, these are the clusters. For each of the attribute values, I get kind of a probability that the “outlook” is going to be “sunny” and “overcast” and “rainy”, in each of the clusters.
We get the probability by dividing this by the total here. Given those probabilities, if we had a new instance we could calculate for it the probabilities for each cluster. Actually, EM uses as an overall quality measure, a thing called the “log likelihood”. Back to the slide, we’ve got two clusters with these prior probabilities. Within each cluster, we’ve got the probability of each value for a nominal attribute; and for numeric attributes we’ve got the mean and standard deviation. Let’s look at another, final clustering method. This is a hierarchical clustering method called Cobweb. Back in Weka, let me just run Cobweb. It’s got some rather magic parameters here. I’m going to choose 0.3.
It’s a bit of a black art, actually, using Cobweb. I’m going to run this. I’ll get a tree, which I can visualize – on the right-click method I can visualize this tree. Going back to the slide, this is the tree that we get for the weather data, with 10 clusters. You can see these clusters at the bottom level, and then these clusters at the level above, and one cluster at the very top. That’s clustering. In clustering, there’s no class value. There are different representations of clusters, and different algorithms produce different kinds of representation. KMeans is the simplest, standard clustering method. It’s an iterative, distance-based method. It can take different distance metrics.
We were using the Euclidean distance, but you can select different distances metrics in KMeans and XMeans. It’s really hard to evaluate clustering, and we’re going to be looking at that in the next lesson.

With clustering, there’s no “class” attribute: we’re just trying to divide the instances into natural groups or “clusters”. There are different ways of representing clusters. Are they disjoint, or can they overlap? Is cluster membership precisely determined, or probabilistic? Perhaps a tree structure in which clusters are repeatedly refined into smaller ones is appropriate? Different algorithms produce different representations. In any case, it’s hard to evaluate clustering, because, lacking a class value, you can’t compute the accuracy of any particular clustering.

This article is from the free online

More Data Mining with Weka

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education