Skip to 0 minutes and 12 seconds One of the main points I’ve been trying to convey is that there’s no magic in data mining. There’s a huge array of alternative techniques, and they’re all fairly straightforward algorithms. We’ve seen the principles of many of them. Perhaps we don’t understand the details, but we’ve got the basic idea of the main methods of machine learning used in data mining. And there is no single, universal best method. Data mining is an experimental science. You need to find out what works best on your problem. Weka makes it easy for you. Using Weka you can try out different methods, you can try out different filters, different learning methods. You can play around with different datasets. It’s very easy to do experiments in Weka.
Skip to 0 minutes and 54 seconds Perhaps you might say it’s too easy, because it’s important to understand what you’re doing, not just blindly click around and look at the results. That’s what I’ve tried to emphasize in this course – understanding and evaluating what you’re doing. There are many pitfalls you can fall into if you don’t really understand what’s going on behind the scenes. It’s not a matter of just blindly applying the tools in the workbench. We’ve stressed in the course the focus on evaluation, evaluating what you’re doing, and the significance of the results of the evaluation. Different algorithms differ in performance, as we’ve seen. In many problems, it’s not a big deal.
Skip to 1 minute and 37 seconds The differences between the algorithms are really not very important in many situations, and you should perhaps be spending more time on looking at the features and how the problem is described and the operational context that you’re working in, rather than stressing about getting the absolute best algorithm. It might not make all that much difference in practice. Use your time wisely. There’s a lot of stuff that we’ve missed out. I’m really sorry I haven’t been able to cover more of this stuff. There’s a whole technology of filtered classifiers, where you want to filter the training data, but not the test data.
Skip to 2 minutes and 17 seconds That’s especially true when you’ve got a supervised filter, where the results of the filter depend on the class values of the training instances. You want to filter the training data, but not the test data, or maybe take a filter designed for the training data and apply the same filter to the test data without re-optimizing it for the test data, which would be cheating. You often want to do this during cross-validation. The trouble in Weka is that you can’t get hold of those cross-validation folds; it’s all done internally. Filtered classifiers are a simple way of dealing with this problem. We haven’t talked about costs of different decisions and different kinds of errors, but in real life different errors have different costs.
Skip to 3 minutes and 3 seconds We’ve talked about optimizing the error rate, or the classification accuracy, but really, in most situations, we should be talking about costs, not raw accuracy figures, and these are different things. There’s a whole panel in the Weka Explorer for attribute selection, which helps you select a subset of attributes to use when learning, and in many situations it’s really valuable, before you do any learning, to select an appropriate small subset of attributes to use. There are a lot of clustering techniques in Weka.
Skip to 3 minutes and 36 seconds Clustering is where you want to learn something even when there is no class value: you want to cluster the instances according to their attribute values. Association rules are another kind of learning technique where we’re looking for associations between attributes. There’s no particular class, but we’re looking for any strong associations between any of the attributes. Again, that’s another panel in the Explorer. Text classification. There are some fantastic text filters in Weka which allow you to handle textual data as words, or as characters, or n-grams (sequences of three, four, or five consecutive characters). You can do text mining using Weka. Finally, we’ve focused exclusively on the Weka Explorer, but the Weka Experimenter is also worth getting to know.
Skip to 4 minutes and 26 seconds We’ve done a fair amount of rather boring, tedious, calculations of means and standard deviations manually by changing the random-number seed and running things again. That’s very tedious to do by hand. The Experimenter makes it very easy to do this automatically. So, there’s a lot more to learn. Let me just finish off here with a final thought. We’ve been talking about data, data mining. Data is recorded facts, a change of state in the world, perhaps.
Skip to 4 minutes and 58 seconds That’s the input to our data mining process, and the output is information, the patterns – the expectations – that underlie that data: patterns that can be used for prediction in useful applications in the real world. We’ve going from data to information. Moving up in the world of people, not computers, “knowledge” is the accumulation of your entire set of expectations, all the information that you have and how it works together – a large store of expectations and the different situations where they apply. Finally, I’d like to define “wisdom” as the value attached to knowledge. I’d like to encourage you to be wise when using data mining technology. You’ve learned a lot in this course.
Skip to 5 minutes and 49 seconds You’ve got a lot of power now that you can use to analyze your own datasets. Use this technology wisely for the good of the world. That’s my final thought for you.
There’s no magic in data mining! In fact, perhaps Weka makes things too easy. It is important to understand, and evaluate, what you’re doing, not just click around looking for good results. You’ve learned lots, but we’ve missed out plenty. Finally, I’d like to encourage you to be wise when using data mining technology. You’ve gained the power to analyze your own datasets. Use this technology wisely, for the good of the world.
© University of Waikato, New Zealand. CC Creative Commons Attribution 4.0 International License.