# K-means clustering

An article describing the K-means clustering algorithm

As we saw in Week 1, K-means is a simple unsupervised learning clustering technique.

In this article we briefly review how it works, and give an example of using it in Python. There are more details in the following video.

## The idea

The idea behind K-means clustering is that points of numerical data of a similar type lie close together in clusters. If your data set has (p) features, you can think of them as points in a p-dimensional space, and the assumption is that points of a similar type or class lie closer to one-another than to members of another class. If you have two dimensions these are clusters in a flat plane, while in three dimensions these are clusters in 3-D space – clouds of points, if you will.

What K-means does is to take every point in your dataset and assign it to one of (K) clusters. All you need to do is choose a value for (K). For example, if you think there should be three clusters, you need to set (K=3), apply K-means, and it will assign all your data to one of three clusters. For example, if you have phenotyping data for three varieties of wheat, trying (K=3) would seem to make sense in the first instance.

## The algorithm

So how does it do this? The algorithm is relatively straightforward and can be summarised as follows.

• Step 1: Provide an initial guess for the centre of each of the K clusters. These can be either picked randomly, or equally spaced throughout the space, or possibly via some user input.
• Step 2: Assign every data point in the dataset to the nearest cluster centre. This distance is usually just measured using the Euclidian distance (i.e. the square root of the sum of squared differences between the points and cluster centres in every dimension).
• Step 3: Update the cluster centres using the mean values of all the points assigned to each cluster during step 2. This is just done by taking the mean value of every feature in turn.
• Step 4: Repeat steps 2 and 3 until there is no change in either the cluster centres or the membership of the clusters.

And that’s it! Your data will now be in K clusters, and you also have an estimate for the centre of each cluster.

The main drawback to K-means clustering that it can depend on your initial estimate for the cluster centres, and doesn’t always result in the ‘best’ choice of clusters. Different choices can lead to different cluster assignments. If in doubt, it can be worth running K-means a few times with different start points to check for any variation.

## K-means in scikit-learn

K-means in scikit-learn is done using the KMeans() function in sklean.cluster. Supposing you have a dataset X and want to split it into three clusters, you can use the following code:

from sklearn.cluster import KMeanskmeans = KMeans(n_clusters=3).fit(X)print(kmeans.labels_) # displays the clustersprint(kmeans.cluster_centers_) # displays the cluster centres

By default, the function runs the algorithm a number of times, and picks the best result based on a measure of how different the clusters are to one-another known as inertia.

For more on K-means in scikit-learn see the documentation.