Skip to 0 minutes and 11 seconds Hi! Well, they say all good things come to an end, and this is the end of More Data Mining
Skip to 0 minutes and 17 seconds with Weka: the last class. Let’s just summarize a few things here. This summary is actually from the previous course, Data Mining with Weka. These are the main messages I wanted to convey there, and it’s the same main messages this time. There’s no magic in data mining. There’s no single universal “best method”. It’s an experimental science. Weka makes it easy to experiment, especially now that you know how to use the Experimenter. But there are many pitfalls, and many ways to go wrong. You really need to understand what it is that you’re doing, and be focused a lot on evaluation and statistical significance using the Experimenter. I talked more about all of these points at the end of the last course.
Skip to 0 minutes and 58 seconds You could go back and look at that video if you’d like some more expansion on these. This slide is also from the last course.
Skip to 1 minute and 5 seconds This is what we missed from Data Mining with Weka: filtered classifiers, working with cost matrices, selecting attributes, clustering, association rules, text classification, and using the Experimenter. These should all sound very familiar to you, because we’ve talked about them all extensively in this course. Plus more besides! We talked about big data. You experienced big data. We talked about the Command Line Interface; the Knowledge Flow Interface; streaming data using the Command Line Interface through NaiveBayesUpdateable; discretization and discretization filters; the difference between rules and trees – the similarities and differences; Multinomial Naive Bayes for text classification.
Skip to 1 minute and 54 seconds We had a little look at neural networks: the simple Perceptron and the Multilayer Perceptron. We learned about ROC curves, and learning curves, and some more stuff about the ARFF format and the XML version of it. We’ve done a lot. You’ve done a lot, actually, and I congratulate you on having got this far. This has been pretty intensive stuff. You’ve learned a lot about a lot of important things. Of course, there’s always more!
Skip to 2 minutes and 23 seconds Time series analysis is a really important area: how to do data mining on time series.
Skip to 2 minutes and 30 seconds Stream-oriented algorithms: NaiveBayesUpdateable is stream-oriented, but there exist stream-oriented versions of other algorithms, like decision tree methods. They’re in a MOA package, Massive Online Analysis, also from the University of Waikato. Multi-instance learning, where it’s not single instances, but bags containing several instances that are labeled positive or negative. One-class classification, where you don’t have any information about the negative class, just about the positive class. That makes things very difficult, but there are some things you can do. Other data mining packages. There’s a package called R, which has a lot of excellent resources. Actually, you can interface to this from Weka, so Weka can take advantage of those resources.
Skip to 3 minutes and 16 seconds Also, there’s the LibSVM package for support vector machines and the LibLinear package for linear classification. They can all be reached through the Weka interface with the appropriate wrapper package. There’s a distributed version of Weka with the Hadoop system for distributing processing. Finally, there’s a technique called “latent semantic analysis” that you really need to know about to work on text classification. All of these things are available as packages for Weka. Here are just a few, final remarks. Data mining is really important. They’re talking about data as the new oil. The economic and social importance of data mining will rival that of the oil economy – some people say by 2020; it might be happening as we speak. You’re right in there.
Skip to 4 minutes and 5 seconds Data mining is a wonderful thing to know about. It’s an exploding field. It will continue to explode. Personal data is becoming a new economic asset class. You know, it used to be that the data revolution, the internet revolution, was about our ability to learn stuff from the internet. You know, Wikipedia and all the things you can learn. But a lot of it now is about personal data, our own personal data and the economic importance of that. We need a lot more trust than we have at the moment between individuals and governments and the private sector in order to take full advantage of this new, economic asset. We had a lesson on ethics in the last course.
Skip to 4 minutes and 49 seconds We haven’t had a lesson on ethics here, but it’s just as important. I would urge you to think ethically whenever you’re working with data. “A person without ethics is a wild beast loosed upon this world,” Albert Camus said. I don’t want to loose a whole bunch of wild beasts through this course. So please think of ethics and what is ethical and the right kind of thing to do when you’re working with other people’s data.
Skip to 5 minutes and 18 seconds Finally, wisdom: you know, the value attached to knowledge. This is the really important thing.
Skip to 5 minutes and 24 seconds Jimi Hendrix is supposed to have said: “knowledge speaks, but wisdom listens”, which is worth pondering. I’ve enjoyed giving this course, and I hope maybe I’ll meet you again in another version of this course. But for now, I’m just going to relax and play some music while you do the assessment. Bye for now!
There’s no magic in data mining – no universal “best” method. It’s an experimental science. This video reviews what this course has covered, and points out many things that it hasn’t covered. Finally, data mining is a powerful technology – please use it wisely.
© University of Waikato, New Zealand. CC Creative Commons Attribution 4.0 International License.