Skip to 0 minutes and 11 secondsHi! Welcome back to Data Mining with Weka. In the last lesson, we looked at classification by regression, how to use linear regression to perform classification tasks. In this lesson we’re going to look at a more powerful way of doing the same kind of thing. It’s called “logistic regression”. It’s fairly mathematical, and we’re not going to go into the dirty details of how it works, but I’d like to give you a flavor of the kinds of things it does and the basic principles that underline logistic regression. Then, of course, you can use it yourself in Weka without any problem. One of the things about data mining is that you can sometimes do better by using prediction probabilities rather than actual classes.
Skip to 0 minutes and 55 secondsInstead of predicting whether it’s going to be a “yes” or a “no”, you might do better to predict the probability with which you think it’s going to be a “yes” or a “no”. For example, the weather is 95% likely to be rainy tomorrow, or 72% likely to be sunny, instead of saying it’s definitely going to be rainy or it’s definitely going to be sunny. Probabilities are really useful things in data mining. Naïve Bayes produces probabilities; it works in terms of probabilities. We’ve seen that in an earlier lesson. I’m going to open “diabetes” and run Naïve Bayes.
Skip to 1 minute and 40 secondsI’m going to use a percentage split with 90%, so that leaves 10% as a test set.
Skip to 1 minute and 55 secondsThen I’m going to make sure I output the predictions on those 10%, and run it. I want to look at the predictions that have been output. This is a 2-class dataset, the classes are “tested_negative” and “tested_positive”, and these are the instances – number 1, number 2, number 3, etc. This is the actual class – tested_negative, tested_positive, tested_negative, etc. This is the predicted class – tested_negative, tested_negative, tested_negative, tested_negative, etc. This is a plus under the error column to say where there’s an error, so there’s an error with instance number 2. These are the actual probabilities that come out of NaiveBayes. So for instance 1 we’ve got a 99% probability that it’s negative, and a 1% probability that it’s positive.
Skip to 2 minutes and 43 secondsSo we predict it’s going to be negative; that’s why that’s tested_negative. And in fact we’re correct; it is tested_negative. This instance, which is actually incorrect, we’re predicting 67% percent for negative and 33% for positive, so we decide it’s a negative, and we’re wrong. We might have been better saying that here we’re really sure it’s going to be a negative, and we’re right; here we think it’s going to be a negative, but we’re really not sure, and it turns out that we’re wrong. Sometimes it’s a lot better to think in terms of the output as probabilities, rather than being forced to make a binary, black-or-white classification. Other data mining methods produce probabilities, as well.
Skip to 3 minutes and 25 secondsIf I look at ZeroR, and run that, these are the probabilities – 65% versus 35%. All of them are the same. Of course, it’s ZeroR! – it always produces the same thing. In this case, it always says tested_negative and always has the same probabilities. The reason why the numbers are like that, if you look at the slide here, is that we’ve chosen a 90% training set and a 10% test set, and the training set contains 448 negative instances and 243 positive instances. Remember the “Laplace Correction” in [the Simplicity first video, Week 3]? – we add 1 to each of those counts to get 449 and 244. That gives us a 65% probability for being a negative instance.
Skip to 4 minutes and 22 secondsThat’s where these numbers come from. If we look at J48 and run that, then we get more interesting probabilities here – the negative and positive probabilities, respectively. You can see where the errors are. These probabilities are all different. Internally, J48 uses probabilities in order to do its pruning operations. We talked about that when we discussed J48’s pruning, although I didn’t explain explicitly how the probabilities are derived. The idea of logistic regression is to make linear regression produce probabilities, too. This gets a little bit hairy. Remember, when we use linear regression for classification, we calculate a linear function using regression and then apply a threshold to decide whether it’s a 0 or a 1.
Skip to 5 minutes and 20 secondsIt’s tempting to imagine that you can interpret these numbers as probabilities, instead of thresholding like that, but that’s a mistake. They’re not probabilities. These numbers that come out on the regression line are sometimes negative, and sometimes greater than 1. They can’t be probabilities, because probabilities don’t work like that. In order to get better probability estimates, a slightly more sophisticated technique is used. In linear regression, we have a linear sum. In logistic regression, we have the same linear sum down here – the same kind of linear sum that we saw before – but we embed it in this kind of formula. This is called a “logit transform”. A logit transform – this is multi-dimensional with a lot of different a’s here.
Skip to 6 minutes and 6 secondsIf we’ve got just one dimension, one variable a1, then if this is the input to the logit
Skip to 6 minutes and 12 secondstransform, the output looks like this: it’s between 0 and 1. It’s sort of an S-shaped curve that applies a softer function. Rather than just 0 and then a step function, it’s soft version of a step function that never gets below 0, never gets above 1, and has a smooth transition in between. When you’re working with a logit transform, instead of minimizing the squared error (remember, when we do linear regression we minimize the squared error), it’s better to choose weights to maximize a probabilistic function called the “log-likelihood function”, which is this pretty scary looking formula down at the bottom. That’s the basis of logistic regression.
Skip to 6 minutes and 54 secondsWe won’t talk about the details any more: let me just do it. We’re going to use the “diabetes” dataset. In the last lesson we got 76.8% with classification by regression. Let me tell you if you do ZeroR, NaiveBayes, and J48, you get these numbers here. I’m going to find the logistic regression scheme. It’s in “functions”, and called “Logistic”. I’m going to use 10-fold cross-validation. I’m not going to output the predictions. I’ll just run it – and I get 77.2% accuracy. That’s the best figure in this column, though it’s not much better than Naïve Bayes, so you might be a bit skeptical about whether it really is better.
Skip to 7 minutes and 44 secondsI did this 10 times and calculated the means myself, and we get these figures for the mean of 10 runs. ZeroR stays the same, of course, at 65.1%; it produces the same accuracy on each run. NaiveBayes and J48 are different, and here logistic regression gets an average of 77.5%, which is appreciably better than the other figures in this column. You can extend the idea to multiple classes. When we did this in the previous lesson, we performed a regression for each class, a multi-response regression. That actually doesn’t work well with logistic regression, because you need the probabilities to sum to 1 over the various different classes. That introduces more computational complexity and needs to be tackled as a joint optimization problem.
Skip to 8 minutes and 34 secondsThe result is logistic regression, a popular and powerful machine learning method that uses the logit transform to predict probabilities directly. It works internally with probabilities, like Naïve Bayes does. We also learned in this lesson about prediction probabilities that can be obtained from other methods, and how to calculate probabilities from ZeroR.
© University of Waikato, New Zealand. CC Creative Commons Attribution 4.0 International License.