# PCA and tSNE in Python

An article with example code demonstrating how to perform PCA and tSNE in Python Scikit-Learn.

In this article we will demonstrate how to perform the two methods of dimensionality reduction we discussed previously, PCA and and tSNE.

Both examples use scikit-learn and its built in datasets. For the PCA example we will use the Iris data we’ve looked at several times previously, while the tSNE example uses a set of image data known as ‘digits’.

Digits is a reduced copy of an image dataset depicting handwritten numbers, just 8×8 pixels in size, useful for demonstrating and trying out image classification techniques. For more information and the digits dataset, and links to other datasets see the UCI machine learning repository.

As you read through the following article and code, we would encourage you to try it out on your own system. If you have time, perhaps try PCA with the digits dataset, and vice versa.

## PCA

The first thing we need to do is to load the data in the usual way:

from sklearn.datasets import load_irisiris = load_iris()X = iris.dataprint(X.shape)
(150, 4)

We can see from the print out above that we have 150 data items, each with four features.

The next thing to do before performing PCA is standardise the data using the StandardScaler() function:

from sklearn.preprocessing import StandardScalerscaler = StandardScaler().fit(X)X_stand = scaler.transform(X)

To find all the principal components we can just use the PCA() function without any input arguments:

from sklearn.decomposition import PCApca = PCA() # set up the analysispca.fit(X_stand) # actually run the PCA on the dataprint(pca.explained_variance_ratio_)
[0.72962445 0.22850762 0.03668922 0.00517871]

Once we’ve run the PCA we can view the explained variance ratio with the attribute .explained_variance_ratio_. As we can see above, the first two components account for more than 95% of the variance in the data.

So to get just the first two PCs we can set n_components = 2 when we initialise thePCA:

pca = PCA(n_components = 2)pca.fit(X_stand) # run the PCA again with n_components = 2print(pca.components_)
[[ 0.52106591 -0.26934744 0.5804131 0.56485654] [ 0.37741762 0.92329566 0.02449161 0.06694199]]

The attribute .components_ gives the directions (in 4D space in case) of the two principal components. You can think of these as the directions in the original 4D feature space of the (x) and (y) axes that we will use to plot our transformed data.

Before we can plot the data in 2D though we need to transform it using the transform() function:

X_pc = pca.transform(X_stand)print(X_pc.shape)
(150, 2)

We can see from the print out that we now have our 150 data points in two dimensions, as we wanted.

Finally we can plot the data in the usual way using Matplotlib. We can still use the species values from the original data (iris.target) to colour each data point accoring to species:

import matplotlib.pyplot as pltfig, ax = plt.subplots()scatter = plt.scatter(X_pc[:,0],X_pc[:,1],c=iris.target)legend = ax.legend(*scatter.legend_elements(),loc="lower right")ax.add_artist(legend)plt.xlabel('Principal component 1')plt.ylabel('Principal component 2')plt.show()

While there is some overlap between two of the species clusters, three distinct clusters are clearly visible.

## tSNE

For our tSNE example we will use the digits image dataset. The digits data is imported in exactly the same way as the Iris data, except you use load_digits() rather than load_iris():

from sklearn import datasetsdigits = datasets.load_digits()X=digits.dataprint(X.shape)
(1797, 64)

We can see that the data is 1797 examples each with 64 features. These are grayscale values, one for each pixel in an 8×8 image array. The data in digits.data has been flattened for convenience during further analysis, but we can access the individual images using digits.images.

Let’s plot a few examples using the matshow() function in Matplotlib:

import matplotlib.pyplot as pltplt.set_cmap('binary') # set the colourmap to grayscale white to blackfor i in range(10):	for j in range(10): ax = plt.subplot(10,10,1+10*i+j) ax.axis('off') ax.matshow(digits.images[10*i+j])plt.show()plt.close()

To perform tSNE with the default settings we can just import it and use the fit_transform() function, and we have the transformed data in two dimensions rather than the original 64:

from sklearn.manifold import TSNEX_embedded = TSNE().fit_transform(X)print(X_embedded.shape)
(1797, 2)

So what does this new two-dimensional data look like?

We can plot this in the usual way using Matplotlib, using the original target values from digits to colour each class:

plt.set_cmap('tab10')fig, ax = plt.subplots()scatter = ax.scatter(X_embedded[:,0],X_embedded[:,1],c=digits.target)legend = ax.legend(*scatter.legend_elements(),loc="upper left")ax.add_artist(legend)plt.show()plt.close()

Hopefully if you run this code yourself you should see something similar to the plot above. What it shows is a 2D representation of a 64 dimension dataset containing images. While not perfect for every image there are clearly ten distinct clusters, one for each of the digits 0 to 9. Remember, the analysis only uses the target data to colour the plot, not perform the analysis itself.

As well as showing the differences between digit classes tSNE can also reveal similarities, for example there seems to be some overlap between ones and eights, while fours, sixes and zeros appear mostly distinct, with a few exceptions.

These are just demos of how to use PCA and tSNE in Scikit-Learn. Every problem and every dataset is different, so you may need to experiment with it depending on what you are hoping to achive with your analysis. For full details see the links to the documentation below.

tSNE