Want to keep learning?

This content is taken from the Partnership for Advanced Computing in Europe (PRACE)'s online course, Managing Big Data with R and Hadoop. Join the course to learn more.

Skip to 0 minutes and 5 seconds Do you care about the data; do you care about the future? Do you want to understand well what is happening behind the data and even make good predictions for the future? This is the main task of statistical learning,

Skip to 0 minutes and 19 seconds which can be classified into two areas: supervised and unsupervised learning. In this lecture we will explain both of them. Statistical learning is a strong area in statistics, aiming to reveal hidden relations between the data instances or variables that we are measuring. A classical example of statistical learning is regression, especially linear regression. Suppose we have a dataset of songs – let’s say audio clips or music notations –

Skip to 0 minutes and 55 seconds and we represent each song with two variables describing its complexity: unigram and bigram entropy. We omit details about how to measure them. Scatter plot reveals there is a hidden relation, actually linear relation between these two variables.

Skip to 1 minute and 13 seconds Therefore a natural question arises: what is the relation between these two variables. More precisely, we see there is a linear relation between the dots on the scatter diagram so we want to compute the best line representing this relation. Another example of statistical learning is classification.

Skip to 1 minute and 33 seconds Suppose we have for the introduced dataset of songs another variable: song popularity, whether the song is popular or not.

Skip to 1 minute and 43 seconds We can again plot scatter diagram with two colours: blue dots for popular and red dots for the unpopular songs.

Skip to 1 minute and 51 seconds Scatter diagram naturally suggests the question: can we predict popularity from the unigram and bigram complexity? More precisely, can we compute a line that separates well the red dots from the blue dots?

Skip to 2 minutes and 8 seconds Summing up: in supervised learning, we have a set of variables X that we call predictors, also features, and one response variable – we denote it by Y. We also have available measurements of these variables on so-called training set. The main goal is to find mathematical function that relates values of predictors X and response Y and fits the measurements well. As mentioned, classical examples are regression and classification problems.

Skip to 2 minutes and 43 seconds Another example of statistical learning is clustering. Suppose we consider a dataset of all Slovenian scientists that have published at least one paper in the period 1970-2015. We consider their collaboration in the years 1970, 1980, 1990 and 2010 and we visualise this collaboration with the following network map. The dots represent the groups of scientists that collaborate - publish joint papers. We call such dots clusters or also communities. Cluster or community detection is the main task of cluster analysis. A special lecture afterwards will be devoted to it. Once we have detected the communities, we need an explanation for them. For example - what are the groups of scientists that collaborate most?

Skip to 3 minutes and 43 seconds In our case it turns out these are scientists from the same institute or the same scientific fields. But note that here we do not have any valid grouping available, so we cannot check how well we detected the clusters. Another example of unsupervised learning is dimension reduction. Let us consider the cancer data. We have available measurements for several variables (features) describing each patient and we want to best visualise these data in two dimensions. We can see that 3d diagram is not very descriptive. If we take arbitrary pair of features, we see that the diagram is not very descriptive either. But if we take appropriate 2D space the resulting diagram clearly suggests two clusters, probably belonging to patients with and without cancer.

Skip to 4 minutes and 42 seconds Summing up, unsupervised learning is needed when we have a bunch of measurements of a given list of statistical variables and we want to reveal hidden groups of similar data instances, which is known as clustering problem. We may also want to find few new variables which enable better low dimensional visualization or more compact representation of the data - in this case we talk about dimension reduction and may use for example the Principal component analysis or Factor analysis.

Skip to 5 minutes and 17 seconds There is a clear distinction between supervised and unsupervised learning. For supervised learning we know the ground truth - at least on the training set. The ground truth is coded in the response variable Y. Therefore different approaches can be evaluated and compared. For unsupervised learning this is not the case. We have no universal measure to compare different solutions for clustering, and these methods are therefore more prone to subjectivity.

Supervised vs. unsupervised learning

In this video we explain what is supervised and what unsupervised learning. We present few demonstrative examples and list classical methods from both families: regression, classification and clustering.

Share this video:

This video is from the free online course:

Managing Big Data with R and Hadoop

Partnership for Advanced Computing in Europe (PRACE)