Want to keep learning?

This content is taken from the The University of Waikato's online course, More Data Mining with Weka. Join the course to learn more.
5.4

The University of Waikato

Skip to 0 minutes and 11 seconds In the last lesson, we looked at the basic Perceptron algorithm, and now we’re going to look at the Multilayer Perceptron. Multilayer Perceptrons are simply networks of Perceptrons, networks of linear classifiers. They have an input layer, some hidden layers perhaps, and an output layer. If we just look at the picture on the lower left, the green nodes are input nodes. This is actually for the numeric weather data. Although you probably can’t read the labels, the top one is “outlook=sunny”; underneath is “outlook=overcast”; then “outlook=rainy”; and then we have “temperature”, “humidity” and “windy” for the nodes. This is the numeric weather data, so “outlook” is the only nominal variable, and that’s been made into three binary attributes.

Skip to 1 minute and 6 seconds These two [yellow] nodes are the output nodes for output is “play” and “don’t play”, respectively. Each of those two yellow nodes performs a weighted sum, and each of the connections has a weight. If we look at the more complicated picture to the right, we’ve got some red nodes here. These are three hidden layers with different numbers of neurons/nodes in each of these three hidden layers. Each node performs a weighted sum of its inputs and thresholds the result, just like in the regular, basic Perceptron. But in the basic Perceptron, you looked to see whether the result was greater than zero or less than zero. In Multilayer Perceptrons, instead of using that hard-edged function, people use what’s called a “sigmoid” function.

Skip to 1 minute and 54 seconds I’ve drawn a few sigmoid functions on the slide up in the top right. You can see that as they become more extreme, they approach the step function, which corresponds to the hard-edged threshold used in the basic Perceptron. But here we’re going to use a smooth, continuous sigmoid function. Actually, there is a theoretical property that the network will converge if the sigmoid function is differentiable. That’s kind of important. Anyway, that’s by the by. These nodes are often called “neurons”, the red nodes and the yellow nodes. These are not to be confused with the neurons that you have in your head. The big questions are how many layers, and how many nodes in each?

Skip to 2 minutes and 38 seconds We know for the input layer, we’re going to have one for each attribute, and the attributes are numeric or binary. For the output layer, we’re going to have one for each class. How many hidden layers? Well, that’s up to you. If you have zero hidden layers, that’s the standard Perceptron algorithm. That’s suitable if the data is linearly separable.

Skip to 2 minutes and 57 seconds There are theoretical results: with one hidden layer, that’s suitable for a single, convex region of the decision space; two hidden layers are enough to generate arbitrary decision boundaries. However, people don’t necessarily use two hidden layers, because that really increases the number of connections – that’s the number of weights that would have to be learned.

Skip to 3 minutes and 18 seconds The next big question is: how big should the layers be? They are usually chosen somewhere between the input and output layers. A common heuristic, Weka’s heuristic, is to use the mean value of the [number of] input and output layers. What are these weights? Well, they’re learned. They’re learned from the training set. They are learned by iteratively minimizing the error using the steepest descent method, and the gradient is determined using a backpropagation algorithm. We’re not going to talk about the backpropagation here. The change in weight is computed by multiplying the gradient by a constant called the “learning rate” and adding the previous change in weight multiplied by another parameter called “momentum”.

Skip to 3 minutes and 58 seconds So Wnext (the next weight vector) is W + ΔW, where ΔW is minus the learning_rate times the gradient (minus because we want to go downhill) plus momentum times the previous change in the weight parameter. Multilayer Perceptrons can get excellent results, but they often involve a lot of experimentation with the number and size of the hidden layers and the value of the learning rate and momentum parameters. Let’s take a look in Weka. I’m going to use the numeric weather data. Over here, I’ve got it open. I’m going to go to Classify and find MultilayerPerceptron in the functions category. Here it is, and let’s just run it. We get 79%. I want to show you the network we used.

Skip to 4 minutes and 44 seconds Let me just switch on GUI, the graphical user interface. Now when I run it, I get a picture of the network. That is Weka’s default network. These are the input nodes that we looked at before, the green ones. Weka has chosen 4 neurons in the hidden layer. That’s the average of the number of input and output layers. There are 2 output neurons.

Skip to 5 minutes and 16 seconds Going back to the slide: when I tried IBk, I also get 79% on this data set. J48 and so on do worse. However, it’s just a toy problem, so those results aren’t really indicative. On real problems Multilayer Perceptrons often do quite well, but they’re slow.

Skip to 5 minutes and 35 seconds There are a number of parameters: the number of hidden layers and the size of the hidden layers; the learning rate and momentum. The algorithm makes multiple passes through the data, and training continues until the error on the validation set consistently increases – that is, if we start going uphill – or the training time is exceeded, the maximum number of epochs allowed. Going back to Weka, I’m going to configure this to use 5 neurons, 10 neurons, and 20 neurons in 3 hidden layers. Look at this! You can see the three hidden layers with 5, 10, and 20 neurons – an awful lot of weights here. We’ve got the learning rate, so we can change the momentum.

Skip to 6 minutes and 27 seconds We’ve got the maximum number of epochs. We can just run that. Also, in Weka, you can create your own network structure. You can add new nodes, add connections, and delete nodes and so on. I’m going to go back to Weka, and I’m just going to use the default number of hidden layers. I’ve now got my 4 neurons in the 1 hidden layer. I’m going to add another hidden layer. If I click empty space, I create a neuron. It’s yellow, which means it’s selected. I’m going to deselect it by clicking empty space, and create another couple. With this one here, I’m going to connect it up to this.

Skip to 7 minutes and 13 seconds If I click these, they connect the selected neuron – that is, the yellow one – to the one I click. Then I can deselect it and select this one and make connections here. You can see it’s pretty quick to add connections. I’ve added another hidden layer. Well, I need to do some things with the output here, but you can get the idea from this. We can click to select a node and right-click an empty space to deselect. We can create and delete nodes by clicking in empty space to create and right-clicking to delete. We can create and delete connections, and we can set parameters in this interface too. Are they any good?

Skip to 7 minutes and 54 seconds Well, I tried the Experimenter with 6 datasets, and I used 9 algorithms. MultilayerPerceptron gave me the best results on 2 of the 6 datasets.

Skip to 8 minutes and 7 seconds The other wins were: SMO won on another 2 datasets; J48 and IBk won on 1 dataset each. When I say “win”, I mean beat all the other methods. MultilayerPerceptron was not too bad, but in fact it was between 10 and 2000 times slower than other methods, which is a bit of a disadvantage. Here’s the summary. Multilayer Perceptrons can implement arbitrary decision boundaries given two or more hidden layers, providing you’ve got enough neurons in the hidden layers, and providing they’re trained properly. Training is done by backpropagation, which is an iterative algorithm based on gradient descent. In practice, you get quite good performance, but Multilayer Perceptrons are extremely slow. I’m still not a fan of Multilayer Perceptrons; I’m sorry about that.

Skip to 8 minutes and 58 seconds They might be a lot more impressive on more complex datasets; I don’t know. But for me, configuring Multilayer Perceptrons involves too much messing around.

Multilayer perceptrons

Multilayer perceptrons are networks of perceptrons, networks of linear classifiers. In fact, they can implement arbitrary decision boundaries using “hidden layers”. Weka has a graphical interface that lets you create your own network structure with as many perceptrons and connections as you like. A quick test showed that a multilayer perceptron with one hidden layer gave better results than other methods on two out of six data sets – not too bad. But it was 10–2000 times slower than other methods, which is a bit of a disadvantage.