Skip to 0 minutes and 11 seconds Hello! Well, we’ve come to Class 5, the last class of More Data Mining with Weka. Congratulations on having got this far. In this class, we’re going to look at some miscellaneous things. We’ll have a couple of lessons on neural networks and the Multilayer Perceptron. Then we’ll take a quick look at learning curves and performance optimization in Weka. Then we’ll come back and have another look at the ARFF file format. You’ve been listening to me talking for quite a long time now, and I just wonder if you might be interested in finding out a little bit more about me. If so, if you go to the web and search for “A stroll through the gardens of computer science” in quotes.
Skip to 0 minutes and 54 seconds So you’ve got to get it exactly right: “A stroll through the gardens of computer science”.
Skip to 0 minutes and 59 seconds You’ll get just one result, or, I got just one result: News from New Zealand. This, in fact, is an interview with me, an extended interview. It starts on the next page. A dialogue with me. You’ll learn where I came from and what I’ve done and what I’ve been doing and what I’ve been thinking of and some of my biases. That might be interesting or not. It’s up to you. Let’s get back to the lesson. We’re going to talk about simple neural networks. Now, a lot of people love neural networks. I’m not one of them. I think it’s a brilliant term, “neural network”, because it conjures up the image on the left of some really cool brain-like mechanism.
Skip to 1 minute and 47 seconds Actually, you should think of the rather grungy picture on the right, a linear sum. We’ll talk about that in a minute. The very name is suggestive of intelligence. However, the reality, I think, is not. In this lesson, we’re going to talk about the simplest neural network, the Perceptron. It’s a simple learning method that determines the class in a two-class dataset using a linear combination of attributes. For test instance a – that is, with attributes a1, a2, a3 – then we take the sum w0 plus w1a1 plus w2a2 and so on over all the attributes. We’ll express that as a sigma from j=0. We’re implicitly defining a0 as 1 here just to make the notation look nice.
Skip to 2 minutes and 37 seconds If the result, x, is greater than zero, then we’re going to say that instance belongs to class 1; otherwise, we’re going to say it belongs to class 2. This, of course, works most naturally with numeric attributes. Where do the weights come from? That’s the big question. We have to learn them. Here’s the algorithm. We start by setting all weights to zero until all the instances in the training data are classified correctly. We continue for each instance in the training data. If it’s classified correctly, then we do nothing.
Skip to 3 minutes and 8 seconds If it’s classified incorrectly, then, if it belongs to the first class we’ll add it to the weight vector, and if it belongs to the second class, we’ll subtract it from the weight vector. There’s a theorem that if you continue to do this (the Perceptron Convergence Theorem), it will converge if you cycle repeatedly through the training data, perhaps many times. It will converge providing the problem is linearly separable, that is, there exists a straight line that separates the two classes, class 1 and class 2. Actually, we talked about linear decision boundaries before when we talked about Support Vector Machines.
Skip to 3 minutes and 42 seconds They were also restricted to linear boundaries, but they can get more complex boundaries using the “Kernel trick”, which I mentioned but did not explain back then in Data Mining with Weka. And I’m not going to explain it now, but I’m just going to tell you that the Perceptron can use the same trick to get non-linear boundaries. The Weka implementation is called the Voted Perceptron, a slightly different algorithm. It stores all of the weight vectors, all versions of the weight vector, and lets them vote on test examples. Their importance, the weight vectors are themselves weighted according to the length of time that they survived before the weights got changed.
Skip to 4 minutes and 20 seconds You know, we’re going to use a weight vector, keep classifying training instances, and when the system makes a mistake, then we’re going to change the weight vector. The survival time is some kind of indication of how successful that version of the weight vector is. This is claimed to have many of the advantages of Support Vector Machines, but it’s faster, simpler, and nearly as good. We’ll take a look. I’m going to look at the ionosphere dataset. I’ve got it open here in Weka. I’m going to go to Classify, and the VotedPerceptron is in the functions category.
Skip to 4 minutes and 58 seconds If I select that – there’s a bunch of options, but we won’t worry about that – and just run it using cross-validation, I get 86%. If I were to choose SMO, then I would get 89%. Back to the slide. For the German credit data, we also get slightly better performance with SMO. For the breast cancer dataset, they are almost exactly the same, and for the diabetes, again SMO is a little bit better. It’s certainly true that the VotedPerceptron is faster, maybe 2 times, 5 times, perhaps up to 10 times depending on the dataset. The Perceptron’s got a long history. It was first published in 1957, the basic Perceptron algorithm. It was derived from theories about how the brain works.
Skip to 5 minutes and 46 seconds It’s an acronym for “a perceiving and recognizing automaton”, and a guy called
Skip to 5 minutes and 51 seconds Rosenblatt published a book in 1958 called “Principles of neurodynamics: Perceptrons and the theory of brain mechanisms”. Very suddenly, in 1970, it went out of fashion with a book by two well-known computer scientists, called “Perceptrons”, and they showed that there were some simple things that Perceptrons simply couldn’t do. They proved theorems about what Perceptrons could and couldn’t do. This is the cover of their famous book that basically took Perceptrons off the map. Until 1986, when they came back rebranded “connectionism”, the movement was the “connectionist movement”, and a couple of guys wrote another book “Parallel distributed processing”. Some people claim that artificial neural networks mirror brain function, just like Richard Rosenblatt did back in the 50’s.
Skip to 6 minutes and 44 seconds The main form of Perceptron the connectionists use is a Multilayer Perceptron, which is capable of drawing nonlinear decision boundaries using an algorithm called the backpropagation algorithm that we’ll look at in the next lesson. Here’s the summary. The basic Perceptron algorithm implements a linear decision boundary. It’s very reminiscent of classification by regression. It works with numeric attributes. It’s an iterative algorithm, and it depends on the order in which it encounters the training instances, the result depends on the order. Actually, many years ago, in 1971, I described a simple improvement to the Perceptron in my Master’s thesis, but I’m still not very impressed with the Perceptron stuff; sorry about that.
Skip to 7 minutes and 30 seconds Recently, there have been some improvements: the use of the Kernel trick to get more complex boundaries, and this Voted Perceptron strategy with multiple weight vectors and voting.
Simple neural networks
Neural network learning methods invite an analogy to the brain that is seductive but entirely misleading! The simplest form of neural network, called a “Perceptron”, implements a linear decision boundary. It operates iteratively, and the result depends on the order in which the instances are presented, but there is a theorem that proves that under certain circumstances it will converge rather than cycle indefinitely. The Perceptron has been the subject of much controversy during its 60-year life span.
© University of Waikato, New Zealand. CC Creative Commons Attribution 4.0 International License.