4.7

## The Open University

Skip to 0 minutes and 1 secondNow in the second principle component example exercise, we're going to look at a more complicated data set, a data set that's more like the sort of complexity you'll often find in real-life examples. So let's turn to the data and prepare this data set.

Skip to 0 minutes and 18 secondsThis data set is the sonar data set, and what it gives is we have 60 columns of features and these are sonar frequencies of different sonar readings. The final column, the class column, gives the type of object that was identified here, whether it's a metal cylinder or roughly cylindrical rock. What we want to be able to do, of course, is to classify a new sonar reading on the basis of the frequency readings as to whether the object is a metal cylinder a roughly cylindrical rock.

Skip to 1 minute and 7 secondsNow, we're going to do this using PCA. So we'll perform our principal component analysis. We'll make sure to scale the features-- always important to do in PCA.

Skip to 1 minute and 26 secondsNow, how should we use the number of principal components to use? Well, fortunately we can actually see how much variance each principal component accounts for, how much variance of the original data each principal component accounts for. And in the principal component analysis object returned by the function PR Comp that we're working with, this information is in the PCA dollar-- or it's in the S dev field. And we can plot this field in a bar graph.

Skip to 2 minutes and 5 secondsAnd here we go. So we see, of course, as we expected, they are ordered by the amount of variance that each accounts for. And we see the first principal component accounts for the most, but the second one isn't far behind. And then it dropped suddenly, and so forth. Now, typically what a lot of people do is look at the elbow of the graph. That is to say, at what point does this graph pass through 45 degrees? And that's a good sort of rule of thumb for your first test model to see how the model performs using a certain number of principal components. Of course, you can play around with it after that.

Skip to 2 minutes and 52 secondsSo you might say, OK, we think we might use, say, the first five. Or you might look at this big drop here and say, oh, we'll use the first two. But once you've built that model, you might compare it, say, if you're using the first five with one built from the first seven or first nine and play around and see what number of principal components is the optimal number to use.

Skip to 3 minutes and 18 secondsSo that's how we can select principal components. Now we've seen that they work in the other example, and you guys should be all ready to use them in real life.

# PCA Exercise 2

The second exercise for principle components analysis. The associated code is in the PCA Ex2.R file. Interested students are encouraged to replicate what we go through in the video themselves in R, but note that this is an optional activity intended for those who want practical experience in R and machine learning.

In this exercise we look at how to apply PCA to more complicated data. We use the Sonar dataset, where there are 50 frequencies of sonar readings as features, and the target variable is whether the reading was from a metal cylinder or a roughly cylindrical rock. Our focus is on how to decide on the number of principle components to use based on the amount of variance in the data set that each principle component accounts for. We do this via a simple manual inspection of these values in a bar plot, looking for the ‘elbow’ of the graph. We discuss the reasoning behind such a choice, and how to improve on such an initial decision.

Note that the mlbench and utils R packages are used in this exercise. You will need to have them installed on your system. You can install packages using the install.packages function in R.