Skip main navigation

Supervised vs. unsupervised learning

Supervised learning methods mainly deal with regression and classification problems, while typical unsupervised learning method is clustering.
Do you care about the data; do you care about the future? Do you want to understand well what is happening behind the data and even make good predictions for the future? This is the main task of statistical learning,
which can be classified into two areas: supervised and unsupervised learning. In this lecture we will explain both of them. Statistical learning is a strong area in statistics, aiming to reveal hidden relations between the data instances or variables that we are measuring. A classical example of statistical learning is regression, especially linear regression. Suppose we have a dataset of songs – let’s say audio clips or music notations –
and we represent each song with two variables describing its complexity: unigram and bigram entropy. We omit details about how to measure them. Scatter plot reveals there is a hidden relation, actually linear relation between these two variables.
Therefore a natural question arises: what is the relation between these two variables. More precisely, we see there is a linear relation between the dots on the scatter diagram so we want to compute the best line representing this relation. Another example of statistical learning is classification.
Suppose we have for the introduced dataset of songs another variable: song popularity, whether the song is popular or not.
We can again plot scatter diagram with two colours: blue dots for popular and red dots for the unpopular songs.
Scatter diagram naturally suggests the question: can we predict popularity from the unigram and bigram complexity? More precisely, can we compute a line that separates well the red dots from the blue dots?
Summing up: in supervised learning, we have a set of variables X that we call predictors, also features, and one response variable – we denote it by Y. We also have available measurements of these variables on so-called training set. The main goal is to find mathematical function that relates values of predictors X and response Y and fits the measurements well. As mentioned, classical examples are regression and classification problems.
Another example of statistical learning is clustering. Suppose we consider a dataset of all Slovenian scientists that have published at least one paper in the period 1970-2015. We consider their collaboration in the years 1970, 1980, 1990 and 2010 and we visualise this collaboration with the following network map. The dots represent the groups of scientists that collaborate - publish joint papers. We call such dots clusters or also communities. Cluster or community detection is the main task of cluster analysis. A special lecture afterwards will be devoted to it. Once we have detected the communities, we need an explanation for them. For example - what are the groups of scientists that collaborate most?
In our case it turns out these are scientists from the same institute or the same scientific fields. But note that here we do not have any valid grouping available, so we cannot check how well we detected the clusters. Another example of unsupervised learning is dimension reduction. Let us consider the cancer data. We have available measurements for several variables (features) describing each patient and we want to best visualise these data in two dimensions. We can see that 3d diagram is not very descriptive. If we take arbitrary pair of features, we see that the diagram is not very descriptive either. But if we take appropriate 2D space the resulting diagram clearly suggests two clusters, probably belonging to patients with and without cancer.
Summing up, unsupervised learning is needed when we have a bunch of measurements of a given list of statistical variables and we want to reveal hidden groups of similar data instances, which is known as clustering problem. We may also want to find few new variables which enable better low dimensional visualization or more compact representation of the data - in this case we talk about dimension reduction and may use for example the Principal component analysis or Factor analysis.
There is a clear distinction between supervised and unsupervised learning. For supervised learning we know the ground truth - at least on the training set. The ground truth is coded in the response variable Y. Therefore different approaches can be evaluated and compared. For unsupervised learning this is not the case. We have no universal measure to compare different solutions for clustering, and these methods are therefore more prone to subjectivity.

In this video we explain what is supervised and what unsupervised learning. We present few demonstrative examples and list classical methods from both families: regression, classification and clustering.

This article is from the free online

Managing Big Data with R and Hadoop

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now