• # Using probabilities

Bayes Theorem is a statistical result that underpins a simple classification method called “Naïve Bayes,” as Ian Witten explains.
10.5
Hi! The one bit of Data Mining with Weka that we’re going to see a little bit of mathematics, but don’t worry, I’ll take you through it gently. The OneR strategy that we’ve just been studying assumes that there is one of the attributes that does all the work, that takes the responsibility for the decision. That’s a simple strategy. Another simple strategy is the opposite, to assume all of the attributes contribute equally and independently to the decision. This is called the “Naive Bayes” method – I’ll explain the name later on.
42.5
There are two assumptions that underline Naive Bayes: that the attributes are equally important; and that they are statistically independent, that is, knowing the value of one of the attributes doesn’t tell you anything about the value of any of the other attributes. This independence assumption is never actually correct, but the method based on it often works well in practice. There’s a theorem in probability called “Bayes Theorem” after this guy Thomas Bayes from the 18th century. It’s about the probability of a hypothesis H given evidence E. In our case, the hypothesis is the class of an instance and the evidence is the attribute values of the instance.
91.1
The theorem is that Pr[H E] – the probability of the class given the instance, the hypothesis given the evidence – is equal to Pr[E H] times Pr[H] divided by Pr[E]. Pr[H] by itself is called the prior probability of the hypothesis H. That’s the probability of the event before any evidence is seen. That’s really the baseline probability of the event. For example, in the weather data, I think there are 9 yes’s and 5 no’s, so the baseline probability of the hypothesis “play equals yes” is 9/14 and “play equals no” is 5/14. What this equation says is how to update the probability Pr[H] when you see some evidence, to get what’s call the “a posteriori” probability of H – that means “after the evidence”.
152.1
The evidence in our case is the attribute values of an unknown instance; that’s E. That’s Bayes Theorem. Now, what makes this method “naive”? The naive assumption is – I’ve said it before – that the evidence splits into parts that are statistically independent. The parts of the evidence in our case are the four different attribute values in the weather data. When you have independent events, the probabilities multiply, so Pr[H E] according to the top equation is the product of Pr[E H], times the prior probability Pr[H], divided by Pr[E].
194.2
Pr[E H] splits up into these parts: Pr[E1 H], the first attribute value; Pr[E2 H], the second attribute value; and so on, for all of the attributes. That’s maybe a bit abstract. Let’s look at the actual weather data. On the right-hand side is the weather data. In the large table at the top, we’ve taken each of the attributes. Let’s start with “outlook”. Under the “yes” hypothesis and the “no” hypothesis, we’ve looked at how many times the outlook is “sunny”. It’s sunny twice under yes and 3 times under no. That comes straight from the data in the table. Overcast. When the outlook is overcast, it’s always a “yes” instance, so there were 4 of those, and zero “no” instances.
239
Then, rainy is 3 “yes” instances and 2 “no” instances. Those numbers just come straight from the data table giving the instance values. Then we take those numbers and underneath we make them into probabilities. Let’s say
258.3
we know the hypothesis: let’s say we know it’s a “yes”. Then the probability of it being “sunny” is 2/9ths, “overcast” is 4/9ths, and “rainy” 3/9ths – simply because when you add up 2 plus 4 plus 3 you get 9. Those are the probabilities. If we know that the outcome is “no”, the probabilities are “sunny” 3/5ths, “overcast” 0/5ths, and “rainy” 2/5ths. That’s for the “outlook” attribute. That’s what we’re looking for, you see, the probability of each of these attribute values given the hypothesis H. The next attribute is temperature, and we just do the same thing with that to get the probabilities of the 3 values – hot, mild, and cool – under the “yes” hypothesis or the “no” hypothesis.
305.2
The same with humidity and windy. Play, that’s the prior probability – Pr[H]. It’s “yes” 9/14ths of the time, “no” 5/14ths of the time – even if you don’t know anything about the attribute values. The equation we’re looking at is this one below, and we just need to work it out. Here’s an example. Here’s an unknown day, a new day. We don’t know what the value of “play” is, but we know it’s sunny, cool, high, and windy. We can just multiply up these probabilities. If we multiply for the “yes” hypothesis, we get 2/9th times 3/9ths times 3/9ths times 3/9ths – those are just the numbers on the previous slide, Pr[E1 H], Pr[E2 H], Pr[E3 H], Pr[E4 H], finally Pr[H], that is, 9/14ths.
358.6
That gives us a likelihood of 0.0053 when you multiply them. Then, for the “no” class we do the same, to get a likelihood of 0.0206. These numbers are not probabilities. Probabilities have to add up to 1. They are likelihoods. But we can get the probabilities from them by using the straightforward technique of normalization. Take those likelihoods for “yes” and “no” and we normalize them as shown below to make them add up to 1. That’s how we get the probability of “play” on a new day, with different attribute values. Just to go through that again. The evidence is “outlook” is “sunny”, “temperature” is “cool”, “humidity” is “high”, “windy” is “true” – and we don’t know what “play” is.
406.8
The probability of a “yes” given the evidence is the product of those 4 probabilities – one for outlook, temperature, humidity and windy – times the prior probability, which is just the baseline probability of a “yes”. That product of fractions is divided by Pr[E]. We don’t know what Pr[E] is, but it doesn’t matter, because we can do the same calculation for Pr[E] of “no”, which gives us another equation just like this, and then we can calculate the actual probabilities by normalizing them so that the two probabilities add up to 1. Pr[E] for “yes” plus Pr[E] for “no” equals 1.
453.6
It’s actually quite simple when you look at it in numbers, and it’s simple when you look at it in Weka, as well. I’m going to go to Weka here, and I’m going to open the nominal weather data, which is here. We’ve seen that before, of course, many times. I’m going to go to Classify. I’m going to use the NaiveBayes method. It’s under this bayes category here. There are a lot of implementations of different variants of Bayes; I’m just going to use the straightforward NaiveBayes method here. I’ll just run it. This is what we get. The success probability calculated according to cross-validation. More interestingly, we get the model.
497.8
The model is just like the table I showed you before divided under the “yes” class and the “no” class. We’ve got the four attributes – outlook, temperature, humidity, and windy – and then, for each of the attribute values, we’ve got the number of times that attribute value appears. Now, there’s one little and important difference between this table and the one I showed you before. Let me go back to my slide and look at these numbers. You can see that for outlook under “yes” on my slide I’ve got 2, 4, and 3, and Weka has got 3, 5, and 4. That’s 1 more each time, for a total of 12 instead of a total of 9.
542.4
Weka adds 1 to all of the counts. The reason it does this is to get rid of the zeros. In the original table under outlook, under “no”, the probability of overcast given “no” is zero, and we’re going to be multiplying that into things. What that would mean in effect, if we took that zero at face value, is that the probability of the class being “no” given any day for which the outlook was overcast would be zero. Anything multiplied by zero is zero. These zeros in probability terms have sort of a veto over all of the other numbers, and we don’t want that.
585.1
We don’t want to categorically conclude that it must be a “no” day on a basis that it’s overcast, and we’ve never seen an overcast outlook on a “no” day before. That’s called the “zero-frequency problem,” and Weka’s solution – the most common solution – is
602
very simple: just add 1 to all the counts. That’s why all those numbers in the Weka table are 1 bigger than the numbers in the table on the slide. Aside from that, it’s all exactly the same. We’re avoiding zero frequencies by effectively starting all counts at 1 instead of starting them at 0, so they can’t end up at 0. That’s the Naive Bayes method. The assumption is that all attributes contribute equally and independently to the outcome. That works surprisingly well, even in situations where the independence assumption is clearly violated. Why does it work so well when the assumption is wrong? That’s a good question. Basically, classification doesn’t need accurate probability estimates.
647.7
We’re just going to choose as the class the outcome with the largest probability. As long as the greatest probability is assigned to the correct class, it doesn’t matter if the probability estimates are all that accurate. This actually means that if you add redundant attributes you get problems with Naive Bayes. The extreme case of dependence is where two attributes have the same values, identical attributes. That will cause havoc with the Naive Bayes method. However, Weka contains methods for attribute selection to allow you to select a subset of fairly independent attributes, after which you can safely use Naive Bayes.

OneR assumes that there is one attribute that does all the work. Another simple strategy is the opposite, all attributes contribute equally and independently to the decision. This is called “Naive Bayes,” and is based on a classical statistical result called Bayes Theorem. The method makes two assumptions that are generally violated in practice: the attributes are equally important; and statistically independent. Despite this naivety, the method often works surprisingly well.  