A brief summary of the difference between classification, clustering and regression in machine learning

The previous video ‘Machine Learning Tasks’ introduced the main types of problems to which we can apply machine learning that we’ll look at in this course.

This article provides a quick review of the following, looking at what they do and how they differ:

• classification – binary and multiclass
• clustering
• regression.

## Classification

Classification is any task where you know your data falls into distinct categories or classes, and you have some data giving known examples of each class.

If there are just two categories it is binary classification, while if there are more than two categories it is multiclass classification.

A common form of binary classification is where the two classes represent yes and no, or true versus false. For example, ‘is this email spam or is it not spam?’, or ‘is there a flower in this image or not?’.

On the other hand, in multiclass regression there are more than two categories, and so your question cannot be reduced to a yes or no answer. For example, ‘what species is this flower?’

Since you have known examples of each category in your dataset, classification is an example of supervised learning. Generally you will use your training data to train some machine learning model which you can then use to make predications on which class new, unlabelled data belongs to.

## Clustering

With clustering, in contrast to classification, while you may suspect (or even be pretty sure) your data belongs in distinct categories, you don’t actually have that information in your dataset.

Instead, the aim with clustering is to use machine learning to find patterns in your data, specifically by grouping items of data that seem alike into distinct clusters. For example, you might have a set of images of flowers, but you don’t know for sure what species any of the flowers are. Clustering would then find the images that are most alike in the dataset and place them in distinct clusters.

Often, as in the case of K-means clustering, you will need to set the number of clusters you are looking for before running the algorithm. So you might ask K-means clustering to divide your set of photos into three groups. However, without some form of manual identification and expert knowledge you can’t know for sure whether they have been divided correctly since that information is not in your dataset.

For this reason clustering is a form of unsupervised learning.

## Regression

Regression is a supervised machine learning technique, meaning that like classification, there is some specific feature for which you have data to train a model, and want to use that model to predict the same feature for new data in the future.

Unlike classification however, the value you wish to predict isn’t a class or a discrete category, it is a number in a continuous range. For example, this could be the yield of a crop, or the length of a plant root, or any other feature that can be represented by a number. The idea being that other features associated with the dataset are used by the model to predict one particular feature of interest in the data.

A common example of regression is linear regression, the most simple version of which is fitting a straight line model predicting the response of some output variable (e.g. crop yield) to a single input variable (e.g. irrigation). In reality of course there might be many input variables for which you have data which could be incorporated into a linear regression model.

As well as linear regression, there are many other types of regression models, all of which can be used to predict continuous output.

Examples of machine learning models and algorithms that can be used for regession include:

• linear regression
• support vector machines
• decision trees.