Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

Find out more

What about real-life classification methods?

This week we encounter some of the most important classification methods. Although the principles are not difficult, actual, working, industrial-strength implementations often involve many nit-picky little details, and you’ll probably …

Classification boundaries

Different classifiers are biased towards different kinds of decision. You can see this by examining classification boundaries for various machine learning methods trained on a 2D dataset with numeric attributes. …

Nearest neighbor

“Nearest neighbor” (equivalently, “instance based”) is a classic method that often yields good results. Just stash away the training instances. Be lazy! – do nothing until you have to make …

Pruning decision trees

Decision trees run the risk of overfitting the training data. One simple counter-measure is to stop splitting when the nodes get small. Another is to construct a tree and then …

Decision trees

Another simple method is to build a decision tree from the training data. Start at the top, with the whole training dataset. Select which attribute to split on first; then …

Using probabilities

OneR assumes that there is one attribute that does all the work. Another simple strategy is the opposite, all attributes contribute equally and independently to the decision. This is called …

Overfitting

“Overfitting” is a problem that plagues all machine learning methods. It occurs when a classifier fits the training data too tightly and doesn’t generalize well to independent test data. It …

Simplicity first

In data mining you should always try simple things before you try more complicated things. There are many different kinds of simple structure. One is to create a one-level decision …

How do simple classification methods work?

Simplicity first! That’s the underlying theme of this whole course on data mining. Always start by checking how well simple methods work on your dataset, before progressing to more complicated …

Cross-validation results

Cross-validation is better than randomly repeating percentage split evaluations. The reason is that each instance occurs exactly once in a test set, and is tested just once. Repeated random splits …

Cross-validation

Cross-validation, a standard evaluation technique, is a systematic way of running repeated percentage splits. Divide a dataset into 10 pieces (“folds”), then hold out each piece in turn for testing …

Baseline accuracy

The diabetes dataset has several attributes and a class that is either tested_negative or tested_positive (for diabetes). With Percentage split evaluation (66% training set, 34% test set), J48 yields 76% …

Repeated training and testing

You can evaluate a classifier by splitting the dataset randomly into training and testing parts; train it on the former and test it on the latter. Of course, different splits …

Training and testing

How can you evaluate how well a classifier does? Training set performance is misleading. It’s like asking a child to memorize 1+1=2, 1+2=3 and then testing them on exactly the …