# How to determine the optimal number of clusters: scree plots and the elbow method

Watch Ethan Carragher explain the application of k-means clustering to an example customer data set.

Our objective can now be restated in more precise language: we wish to find the value of k in our k-means clustering procedure that yields the lowest total cluster distortion. This should be the optimal number of clusters in our data.

Let us see how this works in detail using our example customer data. We will need to perform the k-means clustering algorithm on this dataset for all possible values of k. Starting with k=1, the algorithm trivially groups every point into a single cluster, so the centroid is simply the average position of all the data points:

In this case the distortion is calculated to be a little over 176,633. This is in fact the maximum possible distortion for this dataset, since the spread of the data points around the centroid is as large as possible on average.

Moving on to k=2 clusters lowers the total distortion to 42,868, since there is now one centroid that is close to the points in the bottom left, and another that is near the points towards the right:

Note that there is still a reasonable spread around the rightmost centroid – by eye we can see that three clusters will work better. Indeed this is the case, as running the algorithm with k=3 gives an even lower distortion of 15,174:

Continuing on in this way with k=4 clusters, we get a distortion that is lower still at 12,567:

However, at this stage we might get suspicious of overfitting. By eye, there does not seem to be a reasonable division among the points in the top right. With k=5 clusters we get a distortion of 10,210, but again the clusters do not look particularly convincing:

This pattern of overfitting will continue as we search for ever more clusters. At the extreme end of the scale, when k is equal to the number of instances, every point will be classified as its own cluster and the distortion will be a perfect 0. Does this mean that the optimal number of clusters is simply the number of points in our dataset? Emphatically not! The whole purpose of clustering is to understand our dataset in terms of relatively few groupings of customers. We should obviously stop well before reaching such an absurd number of clusters, and so we must not be able to rely on distortion alone to tell us the best choice of clusters. There is a trade-off between low cluster distortion and unnecessary cluster complexity that we must somehow balance.

How then can we tell what the optimal number of clusters is? It helps to consider our example more closely, where intuition would tell us that the dataset is most properly described as having k=3 clusters of points. If we plot the distortion values we found as we increased k, we get the following graph, known as a scree plot:

Notice how the distortion decreased significantly up to k=3, and subsequently decreased by only a small amount for k=4 and k=5. The increased complexity with four or five clusters does not seem to compensate for the marginally lower cluster distortions they offer, motivating k=3 as the correct choice. This procedure for determining k is called the elbow method on account of the shape of the scree plot: the optimal value of k occurs at the “elbow” in the graph, where the distortion stops decreasing so rapidly.

When presented with a k-means clustering analysis, you should always ask your data analytics team how they decided on the number of clusters to use. Consider the following:

• Did they give any consideration to this choice?
• Were they constrained by practical limitations, such as in the billboard example above?
• Did they guess the number of clusters by eyeballing the data, or did they work out the mathematically correct choice based on cluster distortion using the elbow method (or something similar)?

It is important to make sure the number of clusters used is optimal, to benefit your company as much as possible.