Skip to 0 minutes and 11 seconds This might be your vision of the data mining process. You’ve got some data or someone gives you some data. You’ve got Weka. You apply Weka to the data, you get some kind of cool result from that, and everyone’s happy. If so, I’ve got bad news for you. It’s not going to be like that at all. Really, this would be a better way to think about it. You’re going to have a circle; you’re going to go round and round the circle. It’s true that Weka is important – it’s in the very middle of the circle here. It’s going to be crucial, but it’s only a small part of what you have to do.
Skip to 0 minutes and 49 seconds Perhaps the biggest problem is going to be to ask the right kind of question. You need to be answering a question, not just vaguely exploring a collection of data. Then you need to get together the data that you can get hold of that gives you a chance of answering this question using data mining techniques. It’s hard to collect the data. You’re probably going to have an initial dataset, but you might need to add some demographic data, or some weather data, or some data about other stuff. You’re going to have to go to the web and find more information to augment your dataset.
Skip to 1 minute and 29 seconds Then you’ll merge all that together: do some database hacking to get a dataset that contains all the attributes that you think you might need – or that you think Weka might need. Then you’re going to have to clean the data. The bad news is that real world data is always very messy. That’s a long and painstaking process of looking around, looking at the data, trying to understand it, trying to figure out what the anomalies are and whether it’s good to delete them or not. That’s going to take a while. Then you’re going to need to define some new features, probably. This is the feature engineering process, and it’s the key to successful data mining.
Skip to 2 minutes and 9 seconds Then, finally, you’re going to use Weka, of course. You might go around this circle a few times to get a nice algorithm for classification, and then you’re going to need to deploy the algorithm in the real world. Each of these processes is difficult. You need to think about the question that you want to answer. “Tell me something cool about this data” is not a good enough question. You need to know what you want to know from the data. Then you need to gather it. There’s a lot of data around, like I said at the very beginning, but the trouble is that we need classified data to use classification techniques in data mining.
Skip to 2 minutes and 52 seconds We need expert judgements on the data, expert classifications, and there’s not so much data around that includes expert classifications, or correct results. They say that more data beats a clever algorithm. So rather than spending time trying to optimize the exact algorithm you’re going to use in Weka, you might be better off employed in getting more and more data. Then you’ve got to clean it, and like I said before, real data is very mucky. That’s going to be a painstaking matter of looking through it and looking for anomalies. Feature engineering, the next step, is the key to data mining. We’ll talk about how Weka can help you a little bit in a minute. Then you’ve got to deploy the result.
Skip to 3 minutes and 39 seconds Implementing it – well, that’s the easy part. The difficult part is to convince your boss to use this result from this data mining process that he probably finds very mysterious and perhaps doesn’t trust very much. Getting anything actually deployed in the real world is a pretty tough call. The key technical part of all this is feature engineering, and Weka has a lot of [filters] that will help with this. Here are just a few of them. It might be worthwhile defining a new feature, a new attribute, that’s a mathematical expression involving existing attributes. Or you might want to modify an existing attribute. With AddExpression, you can use any kind of mathematical formula to create a new attribute from existing ones.
Skip to 4 minutes and 27 seconds You might want to normalize or center your data, or standardize it statistically. Transform a numeric attribute to have a zero mean – that’s “center”. Or transform it into a given numeric range – that’s “normalize”. Or give it a zero mean and unit variance, that’s a statistical operation called “standardization”. You might want to take those numeric attributes and discretize them into nominal values. Weka has both supervised and unsupervised attribute discretization filters. There are a lot of other transformations. For example, the PrincipalComponents transformation involves a matrix analysis of the data to select the principal components in a linear space. That’s mathematical, and Weka contains a good implementation. RemoveUseless will remove attributes that don’t vary at all, or vary too much.
Skip to 5 minutes and 21 seconds Then there are a couple of filters that help you deal with time series, when your instances represent a series over time. You probably want to take the difference between one instance and the next, or a difference with some kind of lag – one instance and the one 5 before it, or 10 before it. These are just a few of the filters that Weka contains to help you with your feature engineering. The message of this lesson is that Weka is only a small part of the entire data mining process, and it’s the easiest part. In this course, we’ve chosen to tell you about the easiest part of the process! I’m sorry about that.
Skip to 5 minutes and 58 seconds The other bits are, in practice, much more difficult.
Skip to 6 minutes and 1 second There’s an old programmer’s blessing: “May all your problems be technical ones”. It’s the other problems – the political problems in getting hold of the data, and deploying the result – those are the ones that tend to be much more onerous in the overall data mining process. So good luck!
The data mining process
If your vision of data mining is to get some data, apply Weka, get a cool result, and everyone’s happy – think again! Before you even begin to apply a classifier you’re going to have to ask the right question, find suitable data, clean it, devise new features … Weka is only part of the entire data mining process – the easiest part. Other aspects, including political problems in getting hold of the data and deploying the result, are often more onerous in the overall data mining process.
© University of Waikato, New Zealand. CC Creative Commons Attribution 4.0 International License.