3.2

## The University of Waikato

Skip to 0 minutes and 11 secondsWe're going to start out emphasizing the message that simple algorithms often work very well. In data mining, maybe in life in general, you should always try simple things before you try more complicated things. There are many different kinds of simple structure, For example, it might that one attribute in the dataset does all the work, everything depends on the value of one of the attributes. Or, it might be that all of the attributes contribute equally and independently. Or a simple structure might be a decision tree that tests just a few of the attributes. We might calculate the distance from an unknown sample to the nearest training sample, or a result my depend on a linear combination of attributes.

Skip to 0 minutes and 59 secondsWe're going to look at all of these simple structures in the next few lessons. There's no universally best learning algorithm. The success of a machine learning method depends on the domain. Data mining really is an experimental science. We're going to look at OneR rule learner, where one attribute does all the work. It's extremely simple, very trivial, actually, but we're going to start with simple things and build up to more complex things. OneR learns what you might call a one-level decision tree, or a set of rules that all test one particular attribute.

Skip to 1 minute and 37 secondsA tree that branches only at the root node depending on the value of a particular attribute, or, equivalently, a set of rules that test the value of that particular attribute. The basic version of OneR, there's one branch for each value of the attribute. We choose which attribute first, and we make one branch for each possible value of the attribute. Each branch assigns the most frequent class that comes down that branch. The error rate is the proportion of instances that don't belong to the majority class of their corresponding branch. We choose the attribute with the smallest error rate. Let's look at what this actually means. Here's the algorithm. For each attribute, we're going to make some rules.

Skip to 2 minutes and 26 secondsFor each value of the attribute, we're going to make a rule that counts how often each class appears, finds the most frequent class, makes the rule assign that most frequent class to this attribute value combination, and then we're going to calculate the error rate of this attribute's rules. We're going to repeat that for each of the attributes in the dataset, and choose the attribute with the smallest error rate. Here's the weather data again. What OneR does, is it looks at each attribute in turn, outlook, temperature, humidity, and wind, and forms rules based on that.

Skip to 3 minutes and 2 secondsFor outlook, there are three possible values: sunny, overcast, and rainy. We just count out of the 5 sunny instances, 2 of them are yeses and 3 of them are nos.

Skip to 3 minutes and 29 secondsWe're going to choose a rule, if it's sunny choose no. We're going to get 2 errors out of 5. For overcast, all of the 4 overcast values of outlook lead to yes values for the class play. So, we're going to choose the rule, if outlook is overcast, then yes, giving us 0 errors. Finally, for outlook is rainy, we're going to choose yes, as well, and that would also give us 2 errors out of the 5 instances. We've got a total number of errors if we branch on outlook of 4. We can branch on temperature and do the same thing. When temperature is hot, there are 2 nos and 2 yeses.

Skip to 4 minutes and 12 secondsWe just choose arbitrarily in the case of a tie, so we'll choose if it's hot, let's predict no, getting 2 errors. If temperature is mild, we'll predict yes, getting 2/6 errors, and if the temperature is cool, we'll predict yes, getting 1 out of the 4 instances as an error. And the same for humidity and wind. We look at the total error values; we choose the rule with the lowest total error value -- either outlook or humidity. That's a tie, so we'll just choose arbitrarily, and choose outlook. That's how OneR works, it's as simple as that. Let's just try it. Here's Weka. I'm going to open the nominal weather data.

Skip to 4 minutes and 59 secondsI'm going to go to Classify. This is such a trivial dataset that the results aren't very meaningful, but if I just run ZeroR to start off with, I get an error rate of 64%. If I now choose OneR, and run that. I get a rule, and the rule I get is branched on outlook, if it's sunny then choose no, overcast choose yes, and rainy choose yes. We get 10 out of 14 instances correct on the training set. We're evaluating this using cross-validation. Doesn't really make much sense on such a small dataset. Interesting, though, that the success rate we get, 42% is pretty bad, worse than ZeroR.

Skip to 5 minutes and 45 secondsActually, with any 2-class problem, you would expect to get a success rate of at least 50%. Tossing a coin would give you 50%. This OneR scheme is not performing very well on this trivial dataset. Notice that the rule it finally prints out since we're using 10-fold cross-validation, it does the whole thing 10 times and then on the 11th time calculates a rule from the entire dataset and that's what it prints out. That's where this rule comes from. OneR, one attribute does all the work.

Skip to 6 minutes and 22 secondsThis is a very simple method of machine learning described in 1993, 20 years ago in a paper called "Very Simple Classification Rules Perform Well on Most Commonly Used Datasets" by a guy called Rob Holte, who lives in Canada. He did an experimental evaluation of the OneR method on 16 commonly used datasets. He used cross-validation just like we've told you to evaluate these things, and he found that the simple rules from OneR often outperformed far more complex methods that had been proposed for these datasets. How can such a simple method work so well? Some datasets really are simple, and others are so small, noisy, or complex that you can't learn anything from them.

Skip to 7 minutes and 11 secondsSo, it's always worth trying the simplest things first.

# Simplicity first

In data mining you should always try simple things before you try more complicated things. There are many different kinds of simple structure. One is to create a one-level decision tree, or a set of rules that all test one particular attribute. Here one attribute does all the work: all you need to do is to choose which attribute. The method is called OneR, and it’s easy to see how it operates on the weather data. Sometimes, simple methods like this work surprisingly well.