Skip to 0 minutes and 11 seconds Hi! I’m sitting here in New Zealand. It’s on the globe behind me. That’s New Zealand, at the top of the world, surrounded by water. But that’s not where I’m from originally. I moved here about 20 years ago. Here on this map, of course, this is New Zealand – Google puts things with the north at the top, which is probably what you’re used to. I came here from the University of Calgary in Canada, where I was for many years. I used to be head of computer science for the University of Calgary. But, originally, I’m from Belfast, Northern Ireland, which is here in the United Kingdom. So, my accent actually is Northern Irish, not New Zealand.
Skip to 0 minutes and 50 seconds This is not a New Zealand accent. We’re going to talk about another machine learning method called the nearest neighbor, or instance-based, machine learning method. When people talk about rote learning, they just talk about remember stuff without really thinking about it. It’s the simplest kind of learning. Nearest neighbor implements rote learning. It just remembers the training instances, and then, to classify a new instance, it searches the training set for one that is most like the new instance. The representation of the knowledge here is just the set of instances. It’s a kind of lazy learning. The learner does nothing until it has to do some predictions. Confusingly, it’s also called instance-based learning. Nearest neighbor learning and instance-based learning are the same thing.
Skip to 1 minute and 41 seconds Here is just a little picture of 2-dimensional instance space. The blue points and the white points are two different classes – yes and no, for example. Then we’ve got an unknown instance, the red one. We want to know which class it’s in. So, we simply find the closest instance in each of the classes and see which is closest. In this case, it’s the blue class. So, we would classify that red point as though it belonged to the blue class. If you think about this, that’s implicitly drawing a line between the two clouds of points. It’s a straight line here, the perpendicular bisector of the line that joins the two closest points. The nearest neighbor method produces a linear decision boundary.
Skip to 2 minutes and 26 seconds Actually, it’s a little bit more complicated than that. It produces a piece-wise linear decision boundary with sometimes a bunch of little linear pieces of the decision boundary. Of course, the trick is what do we mean by “most like”. We need a similarity function, and conventionally, people use the regular distance function, the Euclidean distance, which is the sum of the squares of the differences between the attributes. It’s the square root of the sum of the squares, but since we’re just comparing two instances, we don’t need to take the square root. Or, you might use the Manhattan or city block distance, which is the sum of the absolute differences between the attribute values. Of course, I’ve been talking about numeric attributes here.
Skip to 3 minutes and 13 seconds If attributes are nominal, we need the difference between different attribute values. Conventionally, people just say the distance is 1 if the attribute values are different and 0 if they are the same. It might be a good idea with nearest neighbor learning to normalize the attributes so that they all lie between 0 and 1, so the distance isn’t skewed by some attribute that happens to be on some gigantic scale. What about noisy instances. If we have a noisy dataset, then by accident we might find an incorrectly classified training instance as the nearest one to our test instance. You can guard against that by using the k-nearest-neighbors.
Skip to 3 minutes and 55 seconds k might be 3 or 5, and you look for the 3 or the 5 nearest neighbors and choose the majority class amongst those when classifying an unknown point. That’s the k-nearest-neighbor method. In Weka, it’s called IBk (instance-based learning with parameter k), and it’s in the lazy class. Let’s open the glass dataset. Go to Classify and choose the lazy classifier IBk.
Skip to 4 minutes and 27 seconds Let’s just run it. We get an accuracy of 70.6%. The model is not really printed here, because there is no model. It’s just the set of training instances. We’re using 10-fold cross-validation, of course. Let’s change the value of k, this kNN is the k value. It’s set by default to 1. (The number of neighbors to use.) We’ll change that to, say, 5 and run that. In this case, we get a slightly worse result, 67.8% with k as 5. This is not such a noisy dataset, I guess. If we change it to 20 and run it again. We get 65% accuracy, slightly worse again.
Skip to 5 minutes and 20 seconds If we had a noisy dataset, we might find that the accuracy figures improved as k got little bit larger. Then, it would always start to decrease again. If we set k to be an extreme value, close to the size of the whole dataset, then we’re taking the distance of the test instance to all of the points in the dataset and averaging those, which will probably give us something close to the baseline accuracy. Here, if I set k to be a ridiculous value like 100. I’m going to take the 100 nearest instances and average their classes. We get an accuracy of 35%, which, I think is pretty close to the baseline accuracy for this dataset.
Skip to 6 minutes and 4 seconds Let me just find that out with ZeroR, the baseline accuracy is indeed 35%. Nearest neighbor is a really good method. It’s often very accurate. It can be slow. A simple implementation would involve scanning the entire training dataset to make each prediction, because we’ve got to calculate the distance of the unknown test instance from all of the training instances to see which is closest. There are more sophisticated data structures that can make this faster, so you don’t need to scan the whole dataset every time. It assumes all attributes are equally important. If that wasn’t the case, you might want to look at schemes for selecting or weighting attributes depending on their importance.
Skip to 6 minutes and 47 seconds If we’ve got noisy instances, than we can use a majority vote over the k nearest neighbors, or we might weight instances according to their prediction accuracy. Or, we might try to identify reliable prototypes, one for each of the classes. This is a very old method. Statisticians have used k-nearest-neighbor since the 1950’s. There’s an interesting theoretical result. If the number (n) of training instances approaches infinity, and k also gets larger in such a way that k/n approaches 0, but k also approaches infinity, the error of the k-nearest-neighbor method approaches the theoretical minimum error for that dataset. There is a theoretical guarantee that with a huge dataset and large values of k, you’re going to get good results from nearest neighbor learning.
“Nearest neighbor” (equivalently, “instance based”) is a classic method that often yields good results. Just stash away the training instances. Be lazy! – do nothing until you have to make a prediction. Then, to classify a test instance, search the training set for the closest training instance and use its class. A similarity function is needed to measure “closeness”; you might want to normalize the attributes first; and you might want to use several neighbors and let them vote on the result.
© University of Waikato, New Zealand. CC Creative Commons Attribution 4.0 International License.