## Want to keep learning?

This content is taken from the The Open University & Persontyle's online course, Advanced Machine Learning. Join the course to learn more.
2.28

## The Open University

Skip to 0 minutes and 1 second So in the second neural network example exercise, we’ll spice things up a little bit by doing a hyperparameter search. So first hyperparameter we’ll be looking at is how many hidden nodes we should use. But we’ll also add a regularisation parameter. L2 regularisation, or white decay as it’s sometimes known in neural network literature, we’ll see what white decay value we should use to get an optimal model.

Skip to 0 minutes and 36 seconds So let’s turn to the code. As always, we start off by setting up our data. Here we go. This time we’re going to be using the iris dataset.

Skip to 0 minutes and 52 seconds Let’s have a look at that. It has four features, the sepal and petal length and with of the different flowers. Each row is a particular flower. Sepal and petal length and width of the flowers, and then the target variable is the species. So this is a classification task. We’re trying to estimate the species of the iris flower based on its petal and sepal length and width. Now we begin as normal by splitting up the data into, this case, training validation and test sets. There we go. Now we’re going to want to create a whole bunch of neural network models.

Skip to 1 minute and 36 seconds We’re going to want to create neural network models of different size, in the sense of having different numbers of hidden nodes in a single hidden layer. We’ll create models of size 4, 6, 8, all the way up to 16, in steps of 2. But we also want to create neural network models with different weight decay hyperparameter values, which is, remember, weight decay is simply L2 regularisation. So we also want to create neural network models with different L2 regularisation penalties. Now, because neural networks are a non-deterministic algorithm, when we start the training algorithm, we assign random initial weights close to, but not exactly, zero.

Skip to 2 minutes and 23 seconds Sometimes these random initial weights start off in a good spot and converge to a good local optima, sometimes they don’t. But what we’re going to do to try to make sure that we get at least one good random initialization of the weights for each pair of hyperparameter values that we’re looking at, we’re going to create four neural networks for each combination of the size and weight decay parameters. So we’re going to do that. We’re going to be creating quite a few models. Here we see the computer’s at work. Give us an idea of how many models we created, let’s get an output to the console. There we go. 84 neural network models.

Skip to 3 minutes and 6 seconds Now, if it was taking a long time to build each model, if it was taking an hour or day to build each model, then the approach we’re looking at here to choose good values of hyperparameters would be extremely time expensive. Because these models can be created quickly, we’re able to do this grid search approach that we’re looking at. Now that we have this set of models, we’re going to want to calculate the foundation error for each model. So here we go. We’ll do that.

Skip to 3 minutes and 43 seconds We can have a look at these misclassification errors for each of our 84 models. Now the first thing we notice when we look at these misclassification errors on the validation data is that there are a lot of zeros here. What this is reflective of is the fact that we really just don’t have enough data. Once again, because we wanted to be able to work with quick examples we are using a small amount of data. And we clearly do not have enough data to be able to confidently determine what the best model is.

Skip to 4 minutes and 19 seconds So what that means in practise is that when the computer picks our best model on the validation data, it’s just going to pick out the first one that ends up getting zero. Well, we’ve actually just done that then. We picked out the first of the models that ended up getting zero on the validation data, and we’ve seen how well it performs on the test data. Now we can output that here. And we say the best model, or the first model that ended up having zero was a neural network with four hidden nodes and zero weight decay. It ended up with the zero misclassification error on the validation data.

Skip to 5 minutes and 2 seconds It also ended up with zero misclassification error on the test data, which is a good sign, but one we should be cautious about getting too excited about because of the amount of data we’re working with. And to get a bit of an indication of what we’re facing here, these are the misclassifications errors of the different models. Each little set of bars here correspond to one set of hyperparameters. So for example, here we have four hidden nodes, zero weight decay, four hidden nodes, 0.1 weight decay, four hidden nodes, one weight decay. And of course, with four models made for each, we see a huge number of them were getting zero.

Skip to 5 minutes and 46 seconds We ended up selecting this first one simply because of how [INAUDIBLE] will implement the instruction to find which min in vector.

Skip to 6 minutes and 5 seconds Now, don’t get too hung up about the difficulties involved in this case with very little data. The important issue is the approach where we build models with different combinations of hyperparameters, work out which set of hyperparameters is going to provide us with a good model. And of course, we evaluate this using basic model selection techniques.

# ANN Exercise 2

The second exercise for artificial neural networks. The associated code is in the ANN Ex2.R file. Interested students are encouraged to replicate what we go through in the video themselves in R, but note that this is an optional activity intended for those who want practical experience in R and machine learning.

In this exercise, we perform a hyper-parameter search, where we are seeking to discover (i) a good number of hidden nodes; and (ii) what a good value would be for our L2 regularization parameter. The method we employ is grid search, and we proceed so as to minimize the effect of the non-determinism present in the ANN optimization process.

We will be using the well known Iris dataset, which means this is a classification problem. We discuss the specifics of this problem, and in particular the effect of the fact that our data is so small and that many of our models performed equally well on the validation data.

Note that the utils, nnet and datasets R packages are used in this exercise. You will need to have them installed on your system. You can install packages using the install.packages function in R.

Please note that the audio quality on parts of this video are of lower quality that other videos in this course.