Skip main navigation

Multinomial Naive Bayes

Multinomial Naive Bayes is a classification method designed for text, and is generally better and faster than plain Naive Bayes, as Ian Witten shows.
10.9
Hello again! This is the last lesson in Class 2, and we’re going to get back to some actual document classification. In fact, we’re going to introduce a new classifier, Multinomial Naive Bayes, designed for document classification. I’d like you to recall the Naive Bayes classifier. We talk about the probability of the event H, that is, the probability of a particular class, given evidence E, that is, a particular set of attribute values for an instance. Naive Bayes updates the prior probability of H without knowing anything about the instance. So in the weather data, I think there are 9 “play” instances and 5 “don’t play” instances, so the prior probability of “play” is 9/14 without knowing anything about the instance.
54.5
Naive Bayes updates that with information about the instance, that is, the attribute values, to get the probability of H, the class, given the instance. The “naive” part is that it takes these attribute values, this evidence, and splits it into independent parts, one for each attribute, and multiplies these together. This is a good thing to do if the attributes really are independent. So E1 is like the first attribute value, and E2 is like the second attribute value and so on. That’s how Naive Bayes works. There are a couple of problems here for document classification. First of all, the non-appearance of a word counts just as much in Naive Bayes as the appearance of a word.
97.3
It makes intuitive sense that the class of a document is more determined by the words that are in it than the words that aren’t in it. Secondly, Naive Bayes doesn’t account for the fact that a word might occur multiple times in a document. A word that occurs lots probably should have a greater influence on the class of the document than a word that only appears once. Thirdly, it treats all words the same. The word “and” or “the” is treated the same as an unusual word like “weka” or “breakfast”, and that doesn’t sound reasonable, either. Multinomial Naive Bayes is an enhancement of Naive Bayes that solves these problems.
135.9
We take that complicated formula and replace it by the thing at the bottom. Just forget about those exclamation marks for the moment. This is basically a product over all the words in the document of p_i, that is the probability of word i, to the power n_i, that is the number of times that word appears in that document. It’s like treating each word appearance as an independent event and multiplying them all together. And those n-factorials are just a technicality that account for the possibility of different word orderings. That’s the theory; you don’t have to understand that. It’s very easy to use Multinomial Naive Bayes in Weka. This is what we’re going to do. I’m going to open a training set.
181
We’re going to use ReutersGrain, which is like the “corn” dataset we used previously, only it’s about documents that are about grain. I’m going to open that training file. Then I’m going to use a Supplied test set, that is, the corresponding test file. Then I’m going to use J48. When I try to choose J48, well, it’s grayed out.
214.9
We know why it’s grayed out: it’s grayed out because the training file contains a string attribute, and J48 can’t deal with string attributes. We know that what we’re supposed to do here is to use the FilteredClassifier, which is here. Configure that to have J48 as the classifier, which is the default; and for the filter we’re going to choose the unsupervised attribute filter called StringToWordVector. There it is. Let me just run that. Here I get 96% accuracy, but if I look at the accuracy in the minority class, the one that we’re most interested in, the “grain” class, the accuracy is not very good. I get 38 correct out of a total of 57 (19 + 38).
266.1
That’s not very good accuracy at all. We know now that I should be looking at the ROC area, which is 0.906.
275.5
Going back to the slide: I’ve summarized that information. I could run NaiveBayes; I won’t do that, but let me just tell you that I would get quite a bit worse classification accuracy but a better success rate on the [grain]-related documents, 46/57, and a slightly worse ROC Area (0.885). I’m going to run Multinomial Naive Bayes. I’m going to go back to my FilteredClassifier and configure it to choose NaiveBayesMultinomial.
314.5
Run that. It’s very quick. I don’t get very good classification accuracy, but I get rather a good ROC area, and not a bad accuracy on the minority class, 52 out of 57. That’s not too bad, a definite improvement in terms of ROC Area and minority class accuracy over J48. I can actually mess around with some of the parameters in the StringToWordVector filter. There are a lot of parameters here, and they’re very useful. One of the parameters is to output word counts. By default, the filter outputs a 1 if the document contains that word and a 0 otherwise. But we can output the number of appearances of that word in the document, which is suitable for Multinomial Naive Bayes.
369.6
I’m going to do a few other things at the same time. I can change all the tokens, all the words, into lower case. I’m going to do that, so that it doesn’t matter whether a word is expressed in uppercase or lowercase, it’s going to count as the same word. Also, I’m going to use a “stoplist”. “Stop words” are those common words, like “and” and “the”, and there’s a standard stoplist for English. If I set this to True, then it’s going to disregard common words, words on the stoplist in Weka. Let me run that again and see what I get. Here I get a slightly better accuracy, a pretty good accuracy actually.
416.1
I get a much better ROC Area, and I get phenomenal accuracy on the minority class: just 1 error out of 57 here.
425.5
Going back to my slide: with J48 I got really good classification accuracy; now I’m not quite at the same level with NaiveBayesMultinomial. When I first did NaiveBayesMultinomial, it wasn’t too bad, but then when I set outputWordCounts, well, it got slightly worse, actually. I got a worse ROC Area, which is a little bit surprising; better accuracy on the minority class, 54 out of 57. Then when I set lowerCaseTokens and the stoplist as well, I got very good accuracy on the minority class, and a very good ROC Area of 0.978. That’s it. Multinomial Naive Bayes is a machine learning method that’s designed for use with text. It takes into account word appearance, rather than word non-appearance.
477.8
It accounts for multiple repetitions of a word, and it treats common words differently from unusual ones by looking at the frequency with which they appear in the document collection. It’s actually a lot faster in Weka than plain Naive Bayes. For one thing, it ignores words that don’t appear in a document – when you think about it, most words don’t appear in a document! Internally, Weka uses what’s called a “sparse representation” of the data; Multinomial Naive Bayes takes advantage of that. The StringToWordVector filter has many interesting options. We looked at some of those. It actually outputs the results in sparse format, which Multinomial Naive Bayes takes advantage of.

Naive Bayes has three flaws when applied to document classification. First, a word’s non-appearance counts just as much its appearance, whereas surely a document’s class is determined by the words that are in it rather than those that aren’t? Second, Naive Bayes doesn’t take account of the number of appearances of a word, whereas surely frequently occurring words should have a greater influence on the class than ones that only appear once? Third, it treats all words the same, whereas surely unusual words like “weka” and “breakfast” should count more than common ones like “and” and “the”? Multinomial Naive Bayes is a classification method that solves these problems and is generally better and faster than plain Naive Bayes.

(Note: Ian sets “stopList” to “True” in this video. In the version of Weka you are using you should set “stopwordsHandler” to “Rainbow”.)

This article is from the free online

More Data Mining with Weka

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education