# PCA and tSNE

An article describing Principal Component Analysis (PCA) and t-distributed Stochastic Neighbour Embedding (tSNE) for dimensionality reduction.

In the previous video we discussed dimensionality reduction, both what it is and why it might be useful.

In this article we will look at two methods of dimensionality reduction in more detail, PCA and t-SNE.

## PCA

Principal Component Analysis or PCA is perhaps the most common method used to reduce the number of dimensions in a dataset. What it effectively does is calculate a new coordinate system for the dataset, where the axes of the new coordinate system point in the directions which account for the most variance in the data. These new axes remain at right-angles (orthogonal) to one-another and are the principal components after which the analysis is named.

For example, imagine your dataset has two features, which you can plot using a set of (xy) axes. The principal components then are the two new directions at right angles to each other in 2D space that account for most of the variation in the data. In other words, the principal components point in the direction of the most obvious trends in the data.

To reduce the number of dimensions from two to one we need to transform the data into this new coordinate space and ignore all the components in the second principal component. In the figure above, think of it as flattening all the points to their nearest points on the line going through the data in the direction of the first principal component.

Usually, when you perform PCA there are many more than two dimensions, but the idea is the same. The data is transformed into a new coordinate system, and the most important principal components are chosen as a new representation of the dataset.

Along with the directions of the principal components (mathematically these are referred to as eigenvectors), we also get their magnitude, referred to as eigenvalues. Adding up all the eigenvalues and dividing each by this total gives the percentage of the total variance accounted for by each principal component.

A common way to select the number of principal components to use is to pick a threshold for the explained variance above which no further components are included. For example, you might have a dataset with twelve variables, perform PCA, and find that the first four principal components account for 90% of the variance. By using just these four components your dataset has then been reduced from twelve dimensions to four.

Before performing any PCA it’s important to standardize your data by subtracting each feature from the mean and scaling by its standard deviation (see the article on Regularisation).

## t-SNE

Another method for dimensionality reduction, sometimes useful for data visualisation, is t-distributed stochastic neighbour embedding or t-SNE.

Rather than reducing the number of dimensions to some number (potentially > 2 or 3) that accounts for most of the variance in the data as in PCA, t-SNE reduces the dimension of datasets to two or possibly three dimensions for the purposes of data visualisation.

A common application of this is in visualisation of multi-dimensional data that appears in clusters, such as in genomics, or image classification. The exact details of how it works are beyond the scope of this course but we will give a quick demonstration of how to use it using scikit-learn, along with PCA, in the following article. For more details on the algorithm itself, check out the link below.