Skip main navigation

The Similarity and Correlation

How can we visualize similarity? Learn how to use correlation as a tool If we are dealing with data.
This is a correlation matrix. The darker the colour is, the higher the correlation is. It maps out correlation coefficient between two variables.

In this article, we discuss a good tool called correlation.

Look at the colorful square above. It is a visualization of how closely related the attributes are. In this example, we are using the Wisconsin cancer dataset. On the vertical and horizontal axis, you will see all the attributes we are considering.

The colors in that square indicate the strength of correlations. The darker the color is, the higher the correlation between the two variables. I hope you remember what the correlation is. The correlation measure of how closely two are linearly related to. Let me put this in easier words. When one increases (or decreases) and the other has a strong tendency to increase (decrease), two have a strong positive correlation. When one increases (or decreases), and the other has a strong tendency to decrease (increase), two have a strong negative correlation.

Before looking at the square with thirty variables, let’s take the first step with a much simpler one. Look at the square below. This square is also a correlation plot, but only two variables for simplicity: A and B.

(A,A): Correlation between A and A

(A,B): Correlation between A and B

(B,A): Correlation between B and A = correlation between A and B

(B,B): Correlation between B and B

You can notice that this is a 2 x 2 table. Each cell indicates a correlation between the corresponding variables. The diagonal of this table has a correlation with itself. That should have the strongest correlation.

Now, look at our original correlation plot. Each cell indicates a correlation between the corresponding variables. Of course, the correlation between diagnosis and diagnosis itself will be 100%. It is the same for all diagonal entries in the square. Those have the darkest purple. All other cells indicate how closely related the corresponding two attributes are. Darker colors mean they are more closely related to each other.

With this in mind, look at the whole square again. You can probably see lots of dark colors. Attributes are pretty well closely related to each other. It makes sense since they are all factors related to cancer; therefore, if one had been diagnosed as malignant, it is highly likely for other factors to tell something similarly. The correlation plots show us how closely related the attributes are. Please do not confuse correlation with causation. They are not necessarily causal.

One thing to remember is this kind of plot is very helpful in understanding the attributes.

 

This article is from the free online

Artificial Intelligence and Machine Learning for Business

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now