Skip main navigation

Data Mining with Weka: Q&A

In this article, we discuss questions and answers about some basic issues about the use of Weka in practice. I’ve trained a classifier. How can I use it to classify …

Review of terms

Let’s review some of the key terms we’ll be using. A dataset is a set of instances In Weka, it’s stored in what’s called an ARFF file. This is just …

Growing random numbers from seeds

In the preceding video I talk about changing the random number seed in the Weka Explorer and getting a different result. Were you mystified? An explanation follows. Here’s the issue. …

Farewell

Thanks for taking this course. We hope you’ve enjoyed it. We’ve introduced you to practical data mining using the Weka workbench. We explained the basic principles of several popular algorithms …

Summary

There’s no magic in data mining! In fact, perhaps Weka makes things too easy. It is important to understand, and evaluate, what you’re doing, not just click around looking for …

Data mining and ethics

Data mining is a powerful technology, and I urge you to be ethical in its use. Data is sensitive stuff and should be treated with care. Personal data is particularly …

Pitfalls and pratfalls

Be skeptical, and wary of overfitting. Always use fresh data for evaluation. Datasets often have missing values, which can mean different things – and different classifiers treat them in different …

The data mining process

If your vision of data mining is to get some data, apply Weka, get a cool result, and everyone’s happy – think again! Before you even begin to apply a …

What else is there to know?

You’ve learned lots in this course about machine learning and its use in data mining. Most importantly, you’ve learned that there’s no magic in data mining, just a bunch of …

Ensemble learning

Sometimes committees make better decisions than individuals. An ensemble of different classification methods can be applied to the same problem and vote on the classification of test instances. Bagging, randomization, …

Support vector machines

In essence, support vector machines drive a straight line between two classes, right down the middle of the channel – which you can see using Weka’s boundary visualizer. If the …

Logistic regression

Many classification methods produce probabilities rather than black-or-white classifications. Naive Bayes is an obvious example, but other methods do too. The numbers between 0 and 1 produced by linear regression …

Classification by regression

Linear regression can be used for classification too. On the diabetes data, use the NominalToBinary filter to convert the two classes, which are nominal, to the numeric values 0 and …

Linear regression

Classification involves a nominal class value, whereas regression involves a numeric class. Linear regression is a classical statistical method that computes the coefficients or “weights” of a linear expression, and …