Skip main navigation

The Similarity and Correlation

How can we visualize similarity? Learn how to use correlation as a tool If we are dealing with data.
This is a correlation matrix. The darker the colour is, the higher the correlation is. It maps out correlation coefficient between two variables.

In this article, we discuss a good tool called correlation.

Look at the colorful square above. It is a visualization of how closely related the attributes are. In this example, we are using the Wisconsin cancer dataset. On the vertical and horizontal axis, you will see all the attributes we are considering.

The colors in that square indicate the strength of correlations. The darker the color is, the higher the correlation between the two variables. I hope you remember what the correlation is. The correlation measure of how closely two are linearly related to. Let me put this in easier words. When one increases (or decreases) and the other has a strong tendency to increase (decrease), two have a strong positive correlation. When one increases (or decreases), and the other has a strong tendency to decrease (increase), two have a strong negative correlation.

Before looking at the square with thirty variables, let’s take the first step with a much simpler one. Look at the square below. This square is also a correlation plot, but only two variables for simplicity: A and B.

(A,A): Correlation between A and A

(A,B): Correlation between A and B

(B,A): Correlation between B and A = correlation between A and B

(B,B): Correlation between B and B

You can notice that this is a 2 x 2 table. Each cell indicates a correlation between the corresponding variables. The diagonal of this table has a correlation with itself. That should have the strongest correlation.

Now, look at our original correlation plot. Each cell indicates a correlation between the corresponding variables. Of course, the correlation between diagnosis and diagnosis itself will be 100%. It is the same for all diagonal entries in the square. Those have the darkest purple. All other cells indicate how closely related the corresponding two attributes are. Darker colors mean they are more closely related to each other.

With this in mind, look at the whole square again. You can probably see lots of dark colors. Attributes are pretty well closely related to each other. It makes sense since they are all factors related to cancer; therefore, if one had been diagnosed as malignant, it is highly likely for other factors to tell something similarly. The correlation plots show us how closely related the attributes are. Please do not confuse correlation with causation. They are not necessarily causal.

One thing to remember is this kind of plot is very helpful in understanding the attributes.


This article is from the free online

Artificial Intelligence and Machine Learning for Business

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education