Hi! Welcome to the course Data Mining with Weka. I’m Ian Witten from the University of Waikato in New Zealand and I’m presenting the videos for this course which is being prepared by the Department of Computer Science at the University of Waikato. Data mining is a mature technology that a lot of people are beginning to take very seriously, and a lot of other people find it mysterious. The real aim of this course is to take the mystery out of data mining. This is a practical course on how to use the Weka workbench, which you will download as part of the course, for data mining.
We explain the basic principles of several popular data mining algorithms and how to use them in practical applications. In the world today, we’re overwhelmed with data. Every time you swipe your credit card, every item you checkout out at the supermarkets, every time you send a text, make a phone call, or send an email, or type a key on a computer, even every time you walk past a security camera – it all generates a little bit of data in a database. Data mining is about going from the raw data to information, information that can be used to make predictions, predictions that are useful in the real world. Let me give you an example. You’re at the supermarket checkout.
The till records every item you bought. At the end, you hand over your loyalty card, and they give you a couple of percent off, and you give them your name and address, and, indirectly, access to all sorts of demographic information about you and people like you. Everybody likes a good bargain. It’s been a good day today, because, thanks to those coupons they sent you in the mail last week, you’ve been able to stock up on some things you wouldn’t normally have bought, but you bought today because they’re such a good deal. Next week they’ll send you some more coupons, and you’ll go shopping again and buy some more stuff.
They do little experiments on you, you know, they try to figure out how much more you would buy if the price was just that little bit less. These coupons are a mechanism for personalized pricing. They’ve got access to all sorts of data from you, and people like you, in order to do these experiments and figure these things out.
Everybody wins: you get your bargains; they sell more stuff. It sounds like a good deal to me. Here’s another application. Suppose you and your partner want a child, but you can’t have one. It’s fun trying, but it can get a little bit frustrating, and, ultimately, very frustrating, perhaps even tragic. In artificial insemination, they take some eggs from the woman’s ovaries, and they fertilize them with partner or donor sperm, and then they select from amongst the embryos that are produced some to implant back into the womb. You want to select the ones with the best chance of success of producing a live birth, but you don’t want too many live births.
The embryologist has access to all sorts of data on these embryos. I think there are 50–100 pieces of information that they record about individual embryos, and they have historical data on which ones produced a live birth – a success. So here’s an ideal situation for data mining. We have lots of historical data; we have data on the present situation; and we want to select those embryos that have the best chance of success. Now, that’s a good application for data mining, bringing a live child to a couple who wants one. I talk about “data mining” and “machine learning”. Data mining is the application, and machine learning is the algorithms we use.
We’re talking about using machine learning algorithms for the purposes of data mining. The next question – this is Data Mining with Weka – “What’s Weka?” This is a weka here, this little bird. It’s a flightless bird, like its better known cousin the kiwi, found only in the islands of New Zealand. This is what it sounds like, coming to you from New Zealand. However, in our context, Weka is a data mining workbench. It’s an acronym for the Waikato Environment for Knowledge Analysis. We just call it Weka. It contains a large number of algorithms for classification, and a lot of algorithms for data preprocessing, feature selection, clustering, finding association rules, things like that.
It’s a very comprehensive workbench, and it’s free open source software that you will download as part of this course in the next lesson. It runs on any computer. It’s written in Java, and runs on Linux, Windows, Mac. You’ll be able to download it and run it on your workstation and use it during the course. You’re going to learn how to load data into Weka and look at it. You’re going to learn about preprocessing, cleaning up data using filters, exploring it using visualizations, applying classification algorithms, interpreting the output, understanding evaluation methods – evaluation is very important in this area – understand various representations for models, how popular machine learning algorithms work, and be aware of common pitfalls with data mining.
The ultimate goal really is to empower you to use Weka on your own data, and, most importantly, to understand what it is you are doing. That’s it. I just thought I’d show you were I am. I’m in New Zealand, that’s where Weka is from. That’s where I’m sitting right now. This is the world as we see it in New Zealand. We’re at the top, you’re probably down at the bottom somewhere. We’re at the top, in the center, and that arrow to the North Island of New Zealand is where the University of Waikato is. I’ll see you again in the next lesson. I’m looking forward to that. Goodbye for now.