Skip to 0 minutes and 12 secondsHi! Now, we’re going to actually build a classifier. We’re going to use a system called J48—I’ll tell you why it’s called J48 in a minute—to analyze the “glass” dataset that we looked at in the last lesson. I’ve got the glass dataset open here. I going to go to the Classify panel. I choose a classifier here. There are different kinds of classifiers. Weka has bayes classifiers, functions classifiers, lazy classifiers, meta classifiers, and so on. We’re going to use a tree classifier. J48 is a tree classifier. I’m going to open trees and click J48. Here is the J48 classifier. Let’s run it. If we just press start, we’ve got the dataset, we’ve got the classifier, and lo and behold, it’s done it.
Skip to 1 minute and 7 secondsIt’s a bit of an anticlimax, really. Weka makes things very easy for you to do. The problem is understanding what it is that you have done. Let’s take a look. Here is some information about the datasets, the glass dataset, the number of instances and attributes. Then it’s printed out a representation of a tree here. We’ll look at these trees later on, but just note that this tree has 30 leaves and 59 nodes altogether. The overall accuracy is 66.8%. So it’s done pretty well. Down at the bottom, we’ve got a confusion matrix. Remember there were about seven different kinds of glass. This is “building windows made of float glass”.
Skip to 2 minutes and 1 secondYou can see that 50 of these have been classified as ‘a’, which is correctly classified. 15 of them have been classified as ‘b’, which is “building windows non-float glass”, so those are errors, and 3 have been classified as ‘c’, and so on. This is a confusion matrix. Most of the weight is down the main diagonal, which we like to see because that indicates correct classifications. Everything off the main diagonal indicates a misclassification. That’s the confusion matrix. Let’s investigate this a bit further. We’re going to open a configuration panel for J48. Remember I chose it by clicking the Choose button. Now, if I click it here, I get a configuration panel.
Skip to 2 minutes and 48 secondsI clicked J48 in this menu, and I get a configuration panel, which gives a bunch of parameters. I’m not going to really talk about these parameters. Let’s just look at one of them, the “unpruned” parameter, which by default is false. What we’ve just done is to build a pruned tree, because “unpruned” is False. We can change this to make it true and build an unpruned tree. We’ve changed the configuration. We can run it again. It just ran again, and now we have a potentially different result. Let’s just have a look. We have 67% correct classification. What did we have before? These are the runs. This is the previous run, and there we had 66.8%.
Skip to 3 minutes and 39 secondsNow, in this run that we’ve just done with the unpruned tree, we got 67% accuracy, and the tree is the same size. That’s one option. I’m just going to look at another option, and then we’ll look at some trees. I’m going to click the configuration panel again, and I’m going to change the “minNumObj” parameter. What is that? That is the minimum number of instances per leaf. I’m going to change that from 2 up to 15 to have larger leaves. These are the leaves of the tree here, and these numbers in brackets are the number of instances that get to that leaf.
Skip to 4 minutes and 32 secondsWhen there are two numbers, this means that one incorrectly classified instance got to this leaf and five correctly classified instances got there. You can see that all of these leaves are pretty small, with sometimes just two or three or—here is one with 31 instances. We’ve constrained now this number, the tree is going to be generated, and this number is always going to be 15 or more. Let’s run it again. Now we’ve got a worse result, 61% correct classification, but a much smaller tree, with only 8 leaves. Now, we can visualize this tree.
Skip to 5 minutes and 17 secondsIf I right-click on the line—these are the lines that describe each of the runs that we’ve done, and this is the third run—if I right-click on that, I get a little menu, and I can visualize the tree. There it is. If I right-click on empty space, I can fit this to the screen. This is the decision tree. This says first look at the Barium (Ba) content. If it’s large, then it must be headlamps. If it’s small, then Magnesium (Mg). If that’s small, then let’s look at potassium (K), and if that’s small, then we’ve got tableware. That sounds like a pretty good thing to me; I don’t want too much potassium in my tableware.
Skip to 5 minutes and 57 secondsThis is a visualization of the tree and it’s the same tree that you can see by looking here. This is a different representation of the same tree. I’ll just show you one more thing about this configuration panel, the “More” button. This gives you more information about the classifier, about J48. It’s always useful to look at that to see where these classifiers have come from. In this case, let me explain why it’s called J48. It’s based on a famous system called C4.5, which was described in a book. The book is referenced here. In fact, I think I’ve got on my shelf here.
Skip to 6 minutes and 41 secondsThis book here, “C4.5: Programs for Machine Learning” by an Australian computer scientist called Ross Quinlan. He started out with a system called ID3—I think that might have been in his PhD thesis—and then C4.5 became quite famous. This kind of morphed through various versions into C4.5. It became famous; the book came out, and so on. He continued to work on this system. It went up to C4.8, and then he went commercial. Up until then, these were all open source systems. When we built Weka, we took the latest version of C4.5, which was C4.8, and we rewrote it. Weka’s written in Java, so we called it J48. Maybe it’s not a very good name, but that’s the name that stuck.
Skip to 7 minutes and 31 secondsThere’s a little bit of history for you. We’ve talked about classifiers in Weka. I’ve shown you where you find the classifiers. We classified the “glass” dataset. We looked at how to interpret the output from J48, in particular the confusion matrix. We looked at the configuration panel for J48.
Skip to 7 minutes and 50 secondsWe looked at a couple of options: pruned versus unpruned trees and the option to avoid small leaves. I told you how J48 really corresponds to the machine learning system that most people know as C4.5. C4.5 and C4.8 were really pretty similar, so we just talk about J48 as if it’s synonymous with C4.5. See you again soon!
Building a classifier
© University of Waikato, New Zealand. CC Creative Commons Attribution 4.0 International License.