Want to keep learning?

This content is taken from the The University of Waikato's online course, More Data Mining with Weka. Join the course to learn more.

Skip to 0 minutes and 11 seconds Hello again! And now, as they say, for something completely different. The second half of this class is about document classification, this lesson and the next two. And the only thing it has to do with the first half of the class is that both use the Filtered Classifier. Let’s look at some documents. Here are 6 documents. They are very short documents (we’ll look at a much larger example in a minute), just a single sentence each, and they’re classified into “yes” and “no” classes. You can see when you read these that they are all about oil.

Skip to 0 minutes and 41 seconds The “yes” documents are about oil coming from the ground, and the “no” documents are about oil as used in cooking, “the food was very oily,” for example. We code this training set into ARFF in the standard way, with string attributes. For string attributes we just take the text and surround it by quotes, just as I’ve shown in the bottom here. I’ve loaded this dataset into Weka. We can just have a look at it here. There it is, just what you saw on the slide. And of course we can’t do anything with this at the moment. There are 6 distinct values for the text attribute, and no learning system can learn anything from these 6 different values.

Skip to 1 minute and 23 seconds What we’re going to do is use a filter, the StringToWordVector filter – Unsupervised, Attribute, StringToWordVector – which is here. It’s got a bunch of options, but let’s just apply it.

Skip to 1 minute and 40 seconds Wow! Look at all these attributes. We’ve got 34 attributes. They’re words like “Crude” and “Demand” and “The”. When you look at it, these are just the words that appeared in the training documents. Actually, the type, the “yes” or “no” thing, has been moved to the first attribute, not the last attribute. When we look at the individual word attributes, like the one for “crude”, it’s just

Skip to 2 minutes and 8 seconds a number, it’s a numeric attribute with two values, 0 or 1: 0 if it doesn’t appear in that document, and 1 if it does appear in that document. Let’s go and classify this. Let’s use J48.

Skip to 2 minutes and 26 seconds It’s in gray, actually. I can still select it, but I can’t start it. The reason why I can’t start it is that, by default, Weka is predicting the last attribute, and the last attribute is numeric, the word “was”. So I’ll just change this to predict the “type”. Then I can run J48, but there’s a problem evaluating it, because there are only 6 instances and we’re trying to do 10-fold cross-validation, which isn’t going to work. Let’s just evaluate this on the training set for the moment. The most useful thing to look at in the result here is the decision tree that’s produced, which is here. Let’s look at the tree. You can see that it tests on the single word “crude”.

Skip to 3 minutes and 10 seconds If “crude” does not appear, then it’s a “no” document – that is, it’s about food. If “crude” does appear, it’s a “yes” document – that is, it’s about oil coming out of the ground. It makes kind of sense; it’s a pretty of trivial example, I guess. I’ll just go back to the slide. This is what we’ve done. We loaded the data set into Weka. We looked at the string attributes. We applied this filter, which created a lot of new attributes, one for each word. They were binary (2-valued) numeric attributes. We used J48, had to set the class attribute, and evaluated on the training set. Then we looked at the tree. I want to evaluate this on a “Supplied test set”.

Skip to 3 minutes and 48 seconds I want to see what the predictions are on this test set. These are the documents in the test set. I’ve coded them as Unknown, that is, a question mark in the ARFF file. We’ve never done this before. We haven’t ever looked at predictions for individual test documents or test instances. Let me now go and get the Supplied test set, which I have here. Now I’ve got that test set. I can start this running. Well, it’s obvious really – there’s a problem evaluating the classifier, because, you know, when I look at the test documents, it’s an ARFF file with string attributes, and the training documents are an ARFF file with word attributes.

Skip to 4 minutes and 39 seconds Of course, I can take these test documents and convert them using the StringToWordVector filter, but that still wouldn’t solve the problem, because I might have different words in a different order here, so I’d still have a different structure to the ARFF file. We’ve got to do something different. That’s where the FilteredClassifier comes in.

Skip to 5 minutes and 1 second Just going back to the slide, there’s a problem evaluating the classifier: we can’t simply apply StringToWordVector to the test file. The solution is the FilteredClassifier. As we saw previously, the FilteredClassifier will create a filter from the training set and use it for the test set. That’s exactly what we’re going to do here. Coming back to Weka, I’m going to undo the effect of this filter, so I’ve got the original string attribute. I’m going to find the FilteredClassifier (meta>FilteredClassifier). I’m going to configure that to use J48 as the classifier, which is done by default,

Skip to 5 minutes and 41 seconds and I’m going to use the StringToWordVector filter: it’s an Unsupervised Attribute filter. Let me just run this.

Skip to 5 minutes and 59 seconds Here we get the result.

Skip to 6 minutes and 3 seconds That’s actually not very interesting, because these documents had question marks instead of classifications. What I wanted to do was output the predictions, and I can do that in the More options menu. If I click Output predictions and run it again, now I can see the predictions for the test instances. As you can see, there’s 1 “yes” and 3 “no” predictions. The actual class is a question mark in each case. Coming back to the slide. That’s not exactly what I wanted. The first instance is certainly “yes”, oil coming out of the ground, but so is the third. That should have been a “yes”, and, in fact J48 has predicted a “no” for that document “Iraq has significant oil reserves”.

Skip to 6 minutes and 51 seconds Obviously, it doesn’t contain the word “crude”, which is the test that J48 is doing. Well, these are tiny little documents. Let’s look at something a bit more substantial. I’m going to take a big dataset, ReutersCorn-train.arff. Let’s just look at it in a minute. I’m going to open it now.

Skip to 7 minutes and 14 seconds There are 1,554 documents. This is a lot bigger. If I apply the StringToWordVector filter, then – it just takes a second – I get a lot of attributes, corresponding to words. Actually, there are 2,234 attributes. Again, the class attribute has been moved to the top, attribute number 1. I’m going to undo the effect of this [filter], because we’re going to classify this using the FilteredClassifier. I’m going to set a different test set. I’m going to open ReutersCorn-test.arff. Then I’m going to run this with J48. The FilteredClassifier. It’s just going to take a second. It’s finished now. I get 97% accuracy. Before we go on, let’s have a look at what this dataset looks like.

Skip to 8 minutes and 16 seconds I’m going to open up the file, the training file. Here it is.

Skip to 8 minutes and 22 seconds There are two attributes: a string attribute and a class attribute which is 0 or 1. Here’s the beginning of the first string, and it’s a long string. In fact, this open quote, right down to the closing quote here, this whole bit of text is one string attribute value. It’s followed by a 0, which means the classification of that document is 0. For this dataset, that means it’s not about corn. You can see this is regular text except these “\n”s, those are new lines. If we just had a regular newline in a string, then Weka would get confused when you tried to load in that ARFF file, because it would think that the continuation of the line was the next instance.

Skip to 9 minutes and 10 seconds So we just encode newlines as “\n”. This is one instance, classified as 0. The next thing starts with a quote – this is the string – and it ends here. That’s a 1 document; this document is about corn. That doesn’t necessarily mean it just contains the word “corn”, it means that a human has decided whether this document is about corn or not about corn. I don’t know a lot about corn, but an expert will have made that decision. These are the documents, and, like I said, there are 1,554 of them. Each instance contains this extensive string. If I now go back and have a look, well, I’ve got really high accuracy, 97%, which sounds really good.

Skip to 9 minutes and 54 seconds Unfortunately, though, when I look at this, the documents that are about corn, the “1” documents – there’s only 24 of them – and the accuracy there is 15 correct out of 24, which is not so good. For the “0” documents, the ones which aren’t about corn, then I’ve got 573 correct out of 580, which is very good. When I combine those two, that’s what gives me this rather high-looking 97% accuracy. When I look at the tree – here it is – it’s little bit more complicated. We’re going to branch on the word “corn”. If the document contains the word “corn”, then we’re going to look for the word “planted”. If it contains the word “planted”, then it’s a “0”.

Skip to 10 minutes and 43 seconds If it doesn’t contain the word “planted”, then it’s a “1”, that is, it’s about corn. Down here, we’re looking for the word “1986/87”, which is a very strange thing to be looking for. We’re looking for the word “maize”. Here we’re looking for the word “the”. This tree doesn’t look like is makes a huge amount of sense. And yet it does get 97% accuracy. This is what we’ve done here. We looked at this dataset. We applied the StringToWordVector filter. We just had a look, and we found that there were 2,234 attributes. Then we used the FilteredClassifier to get 97% classification accuracy, but we discovered that the accuracy on the 24 corn-related documents was only 62%.

Skip to 11 minutes and 28 seconds That’s a shame, because those are probably the documents we’re most interested in. These are the ones that aren’t about corn, and we get very high accuracy on those. Which makes you wonder whether the overall classification accuracy is really the right thing to optimize. This is what we’ve done in this lesson. We looked at string attributes. We looked at the StringToWordVector filter, which creates one attribute for each different word. We looked at the options for the StringToWordVector – no we didn’t look at the options. Let’s have a really quick look back in Weka here at the options for the StringToWordVector filter.

Skip to 12 minutes and 2 seconds Suffice it to say, there are a lot of options: it’s a pretty comprehensive kind of filter. We’ll look at those options in a subsequent lesson. We looked at J48 models for text data. J48 is not necessarily a very sensible learning scheme to use on text data. Then we looked at the overall classification accuracy. Is it really what we care about? Perhaps not. That’s what we’re going to look at in the next lesson.

Document classification

A document classification problem can be represented in the ARFF format with two attributes per instance, the document text in a “string” attribute and the document class as a nominal attribute. But what can Weka do with a string? Every value – every document’s text – is different from all the others. To accomplish something meaningful, you need to look inside the string. And that’s exactly what the StringToWordVector filter does: it creates an attribute for each word. But – hey! – test documents will contain different words, some of which aren’t in any of the training documents. This conundrum is solved by using the FilteredClassifier, which you’ve already encountered. Cool!

Share this video:

This video is from the free online course:

More Data Mining with Weka

The University of Waikato

Get a taste of this course

Find out what this course is like by previewing some of the course steps before you join: