# What is Data Clustering?

What is clustering? Let’s assume we have the data A to F. Are you familiar with the Euclidean Distance? You can see that A and C are closest based on Euclidean Distance. You can also see that B and E are nearest. Now, we will make a group of A and C and another group of B and E.

Have you ever watched the movie “Titanic”? It is one of the world’s most famous films with 11 Oscar awards!

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

## The Data

We are going to attempt to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using Titanic passenger data. Let’s take a look at the data. We have 891 data rows with the individual passenger’s information. You can have this dataset in CSV file format from our lecture and the original source.

In our dataset, we have six attributes to describe the passengers. Those are passenger class, gender, age, the number of siblings, the number of parents and children aboard, and the port of embarkation. Now, let’s assume we do not know whether they survived or not, although our dataset has the survival information.

Let me remind you about our problem. We would like to know whether we can group people based on their information. We hope those groups can be “survivor” and “non-survivor” groups. But we do not know the answer. Does this ring a bell? Yes, I am trying to set the problem for unsupervised learning. And we will use clustering to solve this problem.

## What is Data Clustering?

What is clustering? Let’s assume we have the data A to F. Are you familiar with the Euclidean Distance? You can see that A and C are closest based on Euclidean Distance. You can also see that B and E are nearest. Now, we will make a group of A and C and another group of B and E. Now let’s check D. D is closer to the group of B and E than A and C. But it is not as close to B and E as much as B and E are close to each other. So we make another layer of the group on the top of the B and E group. When we are doing that, we have to make sure one thing. Within a group, we have to make sure we only include the points with the closest distance. But at the same time, we have to make sure that the different groups maintain a far distance. This methodology is called ‘clustering’.

We used only the first 30 rows of data in our analysis. The map I am showing here is the Euclidean Distance map. We used all the six attributes and calculated the Euclidean Distances. More orange color indicates close distance and more purple to blue color indicate far distance. X and y-axis show each row in our data. That is an individual passenger. So I can see that the passenger labeled “Not1” is close to the passenger labeled “Not 8,” but it is far from the passenger labeled “Not14”.

## Hierarchical Clustering

Let’s go back to the A to F figure. B and E made a group, and then when we consider D, B, and E and D can create another layer of the group. This group will include the group of B and E. Based on Euclidean Distance, we can add the layers of groups. It is called “hierarchical clustering.”

Let’s apply hierarchical clustering to our Titanic passenger data. Our result is here. You can see that we could have two highest level clusters. The first cluster has 12 passengers, and the second cluster has 18 passengers. The first cluster with 12 passengers will be divided into two sub-clusters with 4 and 8 passengers. It seems that the first cluster has nine passengers who could not survive out of 12 passengers. That is 75%. On the other hand, it appears that the second cluster has nine passengers who could not survive out of 18 passengers. That is 50%. Yes, we are not supposed to tell this since we assumed that we do not know the target variable. But I am pointing this out simply because I want to clarify that we cannot make the perfect clustering model, which will cut the clear groups, but we can at least tell there is more likelihood.