Skip main navigation

PCA and tSNE in Python

An article with example code demonstrating how to perform PCA and tSNE in Python Scikit-Learn.

In this article we will demonstrate how to perform the two methods of dimensionality reduction we discussed previously, PCA and and tSNE.

Both examples use scikit-learn and its built in datasets. For the PCA example we will use the Iris data we’ve looked at several times previously, while the tSNE example uses a set of image data known as ‘digits’.

Digits is a reduced copy of an image dataset depicting handwritten numbers, just 8×8 pixels in size, useful for demonstrating and trying out image classification techniques. For more information and the digits dataset, and links to other datasets see the UCI machine learning repository.

As you read through the following article and code, we would encourage you to try it out on your own system. If you have time, perhaps try PCA with the digits dataset, and vice versa.

PCA

The first thing we need to do is to load the data in the usual way:

from sklearn.datasets import load_iris

iris = load_iris()

X = iris.data
print(X.shape)
(150, 4)

We can see from the print out above that we have 150 data items, each with four features.

The next thing to do before performing PCA is standardise the data using the StandardScaler() function:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X)

X_stand = scaler.transform(X)

To find all the principal components we can just use the PCA() function without any input arguments:

from sklearn.decomposition import PCA

pca = PCA() # set up the analysis

pca.fit(X_stand) # actually run the PCA on the data

print(pca.explained_variance_ratio_)
[0.72962445 0.22850762 0.03668922 0.00517871]

Once we’ve run the PCA we can view the explained variance ratio with the attribute .explained_variance_ratio_. As we can see above, the first two components account for more than 95% of the variance in the data.

So to get just the first two PCs we can set n_components = 2 when we initialise thePCA:

pca = PCA(n_components = 2)

pca.fit(X_stand) # run the PCA again with n_components = 2

print(pca.components_)
[[ 0.52106591 -0.26934744 0.5804131 0.56485654]
[ 0.37741762 0.92329566 0.02449161 0.06694199]]

The attribute .components_ gives the directions (in 4D space in case) of the two principal components. You can think of these as the directions in the original 4D feature space of the (x) and (y) axes that we will use to plot our transformed data.

Before we can plot the data in 2D though we need to transform it using the transform() function:

X_pc = pca.transform(X_stand)
print(X_pc.shape)
(150, 2)

We can see from the print out that we now have our 150 data points in two dimensions, as we wanted.

Finally we can plot the data in the usual way using Matplotlib. We can still use the species values from the original data (iris.target) to colour each data point accoring to species:

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
scatter = plt.scatter(X_pc[:,0],X_pc[:,1],c=iris.target)
legend = ax.legend(*scatter.legend_elements(),loc="lower right")
ax.add_artist(legend)
plt.xlabel('Principal component 1')
plt.ylabel('Principal component 2')
plt.show()

A set of axes labelled 'Principal component 1' and 'Principal component 2. There is a cluster of purple points to the left of the axes labelled via the legend as '0', and two slightly overlapping green and yellow clusters to the right of the axes labelled '1' and '2' via the legend.

While there is some overlap between two of the species clusters, three distinct clusters are clearly visible.

tSNE

For our tSNE example we will use the digits image dataset. The digits data is imported in exactly the same way as the Iris data, except you use load_digits() rather than load_iris():

from sklearn import datasets

digits = datasets.load_digits()

X=digits.data
print(X.shape)
(1797, 64)

We can see that the data is 1797 examples each with 64 features. These are grayscale values, one for each pixel in an 8×8 image array. The data in digits.data has been flattened for convenience during further analysis, but we can access the individual images using digits.images.

Let’s plot a few examples using the matshow() function in Matplotlib:

import matplotlib.pyplot as plt

plt.set_cmap('binary') # set the colourmap to grayscale white to black

for i in range(10):
for j in range(10):
ax = plt.subplot(10,10,1+10*i+j)
ax.axis('off')
ax.matshow(digits.images[10*i+j])

plt.show()
plt.close()

A ten by ten array of digital images of hand drawn digits from 0 to 9, the first three rows are in sequential order, while in the remaining rows the order appears to be random. While the digits are just legible, the individual pixels in the images are clearly visible.

To perform tSNE with the default settings we can just import it and use the fit_transform() function, and we have the transformed data in two dimensions rather than the original 64:

from sklearn.manifold import TSNE

X_embedded = TSNE().fit_transform(X)

print(X_embedded.shape)
(1797, 2)

So what does this new two-dimensional data look like?

We can plot this in the usual way using Matplotlib, using the original target values from digits to colour each class:

plt.set_cmap('tab10')

fig, ax = plt.subplots()
scatter = ax.scatter(X_embedded[:,0],X_embedded[:,1],c=digits.target)

legend = ax.legend(*scatter.legend_elements(),loc="upper left")
ax.add_artist(legend)

plt.show()
plt.close()

A set of axes showing circular data points classified by colour and the digits 0 to 9 as shown in the legend. Most of the different colour classes are clustered together though some are more diffuse and some clusters contain points of a different colour belonging to other clusters.

Hopefully if you run this code yourself you should see something similar to the plot above. What it shows is a 2D representation of a 64 dimension dataset containing images. While not perfect for every image there are clearly ten distinct clusters, one for each of the digits 0 to 9. Remember, the analysis only uses the target data to colour the plot, not perform the analysis itself.

As well as showing the differences between digit classes tSNE can also reveal similarities, for example there seems to be some overlap between ones and eights, while fours, sixes and zeros appear mostly distinct, with a few exceptions.

These are just demos of how to use PCA and tSNE in Scikit-Learn. Every problem and every dataset is different, so you may need to experiment with it depending on what you are hoping to achive with your analysis. For full details see the links to the documentation below.

PCA

tSNE

This article is from the free online

Machine Learning for Image Data

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now