Skip to 0 minutes and 11 secondsHello again! This is the last lesson in Class 2, and we’re going to get back to some actual document classification. In fact, we’re going to introduce a new classifier, Multinomial Naive Bayes, designed for document classification. I’d like you to recall the Naive Bayes classifier. We talk about the probability of the event H, that is, the probability of a particular class, given evidence E, that is, a particular set of attribute values for an instance. Naive Bayes updates the prior probability of H without knowing anything about the instance. So in the weather data, I think there are 9 “play” instances and 5 “don’t play” instances, so the prior probability of “play” is 9/14 without knowing anything about the instance.

Skip to 0 minutes and 55 secondsNaive Bayes updates that with information about the instance, that is, the attribute values, to get the probability of H, the class, given the instance. The “naive” part is that it takes these attribute values, this evidence, and splits it into independent parts, one for each attribute, and multiplies these together. This is a good thing to do if the attributes really are independent. So E1 is like the first attribute value, and E2 is like the second attribute value and so on. That’s how Naive Bayes works. There are a couple of problems here for document classification. First of all, the non-appearance of a word counts just as much in Naive Bayes as the appearance of a word.

Skip to 1 minute and 37 secondsIt makes intuitive sense that the class of a document is more determined by the words that are in it than the words that aren’t in it. Secondly, Naive Bayes doesn’t account for the fact that a word might occur multiple times in a document. A word that occurs lots probably should have a greater influence on the class of the document than a word that only appears once. Thirdly, it treats all words the same. The word “and” or “the” is treated the same as an unusual word like “weka” or “breakfast”, and that doesn’t sound reasonable, either. Multinomial Naive Bayes is an enhancement of Naive Bayes that solves these problems.

Skip to 2 minutes and 16 secondsWe take that complicated formula and replace it by the thing at the bottom. Just forget about those exclamation marks for the moment. This is basically a product over all the words in the document of p_i, that is the probability of word i, to the power n_i, that is the number of times that word appears in that document. It’s like treating each word appearance as an independent event and multiplying them all together. And those n-factorials are just a technicality that account for the possibility of different word orderings. That’s the theory; you don’t have to understand that. It’s very easy to use Multinomial Naive Bayes in Weka. This is what we’re going to do. I’m going to open a training set.

Skip to 3 minutes and 1 secondWe’re going to use ReutersGrain, which is like the “corn” dataset we used previously, only it’s about documents that are about grain. I’m going to open that training file. Then I’m going to use a Supplied test set, that is, the corresponding test file. Then I’m going to use J48. When I try to choose J48, well, it’s grayed out.

Skip to 3 minutes and 35 secondsWe know why it’s grayed out: it’s grayed out because the training file contains a string attribute, and J48 can’t deal with string attributes. We know that what we’re supposed to do here is to use the FilteredClassifier, which is here. Configure that to have J48 as the classifier, which is the default; and for the filter we’re going to choose the unsupervised attribute filter called StringToWordVector. There it is. Let me just run that. Here I get 96% accuracy, but if I look at the accuracy in the minority class, the one that we’re most interested in, the “grain” class, the accuracy is not very good. I get 38 correct out of a total of 57 (19 + 38).

Skip to 4 minutes and 26 secondsThat’s not very good accuracy at all. We know now that I should be looking at the ROC area, which is 0.906.

Skip to 4 minutes and 36 secondsGoing back to the slide: I’ve summarized that information. I could run NaiveBayes; I won’t do that, but let me just tell you that I would get quite a bit worse classification accuracy but a better success rate on the [grain]-related documents, 46/57, and a slightly worse ROC Area (0.885). I’m going to run Multinomial Naive Bayes. I’m going to go back to my FilteredClassifier and configure it to choose NaiveBayesMultinomial.

Skip to 5 minutes and 15 secondsRun that. It’s very quick. I don’t get very good classification accuracy, but I get rather a good ROC area, and not a bad accuracy on the minority class, 52 out of 57. That’s not too bad, a definite improvement in terms of ROC Area and minority class accuracy over J48. I can actually mess around with some of the parameters in the StringToWordVector filter. There are a lot of parameters here, and they’re very useful. One of the parameters is to output word counts. By default, the filter outputs a 1 if the document contains that word and a 0 otherwise. But we can output the number of appearances of that word in the document, which is suitable for Multinomial Naive Bayes.

Skip to 6 minutes and 10 secondsI’m going to do a few other things at the same time. I can change all the tokens, all the words, into lower case. I’m going to do that, so that it doesn’t matter whether a word is expressed in uppercase or lowercase, it’s going to count as the same word. Also, I’m going to use a “stoplist”. “Stop words” are those common words, like “and” and “the”, and there’s a standard stoplist for English. If I set this to True, then it’s going to disregard common words, words on the stoplist in Weka. Let me run that again and see what I get. Here I get a slightly better accuracy, a pretty good accuracy actually.

Skip to 6 minutes and 56 secondsI get a much better ROC Area, and I get phenomenal accuracy on the minority class: just 1 error out of 57 here.

Skip to 7 minutes and 6 secondsGoing back to my slide: with J48 I got really good classification accuracy; now I’m not quite at the same level with NaiveBayesMultinomial. When I first did NaiveBayesMultinomial, it wasn’t too bad, but then when I set outputWordCounts, well, it got slightly worse, actually. I got a worse ROC Area, which is a little bit surprising; better accuracy on the minority class, 54 out of 57. Then when I set lowerCaseTokens and the stoplist as well, I got very good accuracy on the minority class, and a very good ROC Area of 0.978. That’s it. Multinomial Naive Bayes is a machine learning method that’s designed for use with text. It takes into account word appearance, rather than word non-appearance.

Skip to 7 minutes and 58 secondsIt accounts for multiple repetitions of a word, and it treats common words differently from unusual ones by looking at the frequency with which they appear in the document collection. It’s actually a lot faster in Weka than plain Naive Bayes. For one thing, it ignores words that don’t appear in a document – when you think about it, most words don’t appear in a document! Internally, Weka uses what’s called a “sparse representation” of the data; Multinomial Naive Bayes takes advantage of that. The StringToWordVector filter has many interesting options. We looked at some of those. It actually outputs the results in sparse format, which Multinomial Naive Bayes takes advantage of.

Multinomial Naive Bayes

Naive Bayes has three flaws when applied to document classification. First, a word’s non-appearance counts just as much its appearance, whereas surely a document’s class is determined by the words that are in it rather than those that aren’t? Second, Naive Bayes doesn’t take account of the number of appearances of a word, whereas surely frequently occurring words should have a greater influence on the class than ones that only appear once? Third, it treats all words the same, whereas surely unusual words like “weka” and “breakfast” should count more than common ones like “and” and “the”? Multinomial Naive Bayes is a classification method that solves these problems and is generally better and faster than plain Naive Bayes.

(Note: Ian sets “stopList” to “True” in this video. In the version of Weka you are using you should set “stopwordsHandler” to “Rainbow”.)

Share this video:

This video is from the free online course:

More Data Mining with Weka

The University of Waikato