Skip to 0 minutes and 10 seconds Hi! It’s me, back again. I’ve probably got a bit of a tan since I saw you last. I’ve been out sailing around the coast of New Zealand in my beautiful yacht, Beulah – here she is – for a couple of weeks while the other guys have been recording the lessons. Anyway, I’m just back here to close out the course. We’ve done a lot in this course. We’ve covered a lot of ground, and you’ve learned a lot. Congratulations for getting this far, and double congratulations if you’ve managed to do all the activities [quizzes].This is what we’ve done in Advanced Data Mining with Weka. A couple of things have been missed out, multi-instance learning, and latent semantic analysis.
Skip to 0 minutes and 47 seconds You’ll have to learn those yourself, I’m afraid. We didn’t do a lesson on one-class classification, but there was a good activity [quiz] on that. We’ve done some extra things. We’ve done some scripting in the Python and Groovy languages, and we’ve done some applications. The applications we’ve looked at have been particularly enlightening, I think. The first one was Geoff Holmes talking about infrared data from soil samples. He explained that it was hard to achieve sufficiently good performance for practical application. In the activity [quiz], you didn’t get there. You need to do more work on those datasets.
Skip to 1 minute and 21 seconds You need to investigate dealing with outliers and improving the quality of the data and some more tweaking of the classifiers and filters in that huge space of experimentation. Then Tony Smith talked about bioinformatics, the problem of signal peptide prediction, and he emphasized that domain knowledge is vital. You need to collaborate with experts. That’s true, of course, for all applications. You need to know whether you’re looking for an accurate prediction or an explanatory model, and overfitting, of course, is a big issue in all applications. Then Pamela talked about functional MRI neuroimaging data. You know, what’s going on in your brain! It was a 3D – a 4D – dataset, the 3 dimensions of your head plus an extra dimension of time.
Skip to 2 minutes and 12 seconds Again, the performance we got in the activity [quiz] was not all that high, and there were various things that we might consider doing to improve that, most of which would involve domain experts to help interpret the data. This is a common thread through all the applications. A very interesting finding was, in an early competition, just the demographic data alone did well – in fact, it won the competition! It’s extremely important to evaluate what you’re doing and try the simple models first. We’ve been saying that all along. Finally, Mike told us about image classification and the specialist feature extraction techniques for images.
Skip to 2 minutes and 51 seconds In fact, when I asked him to do this lesson, we didn’t have the feature extraction package that we now have in Weka. He created it in order to do the lesson. This is typical in applications. You need different extraction techniques for different kinds of data. I’m interested in enabling you to carry on learning, to keep learning in the future. One really good way is to look at data mining competitions. There’s a website called Kaggle. Let me just find it for you. Just do a Google search for Kaggle, and here we have it. Kaggle Competitions. There are a large number of competitions here. The first group, this group here, these are the featured competitions, and here you can win money.
Skip to 3 minutes and 36 seconds This AI Science Challenge is worth $80,000, for example, and the Home Depot Product Search Relevance for $40 ,000. You can win real money doing data mining with these competitions. The second group of competitions are for recruitment purposes. You can get jobs if you do well with the Airbnb challenge or the Telstra challenge, or the Yelp challenge. They’ll offer you a job in data mining, so that’s pretty cool. Here are some featured datasets. Actually, the Iris dataset you’re very familiar with
Skip to 4 minutes and 14 seconds from the first courses, but here are some interesting ones: the Ocean Ship Logbooks, and Salaries in the San Francisco area.
Skip to 4 minutes and 19 seconds And some datasets for playing around: here’s the San Francisco crime classification dataset; sounds very interesting. And this last group, “Getting Started”, contains tutorial/educational datasets. You can play around with these and look at what other people have done. These are all current competitions. You can find past competitions by looking for … “completed competitions”, that’s the phrase. Let’s just look for those. Here we’ve got competitions from two years ago. $500,000, two years ago. Sorry, you’re too late for that, but anyway, someone won half a million two years ago. There’s big money in competitions. Here’s $250,000, again a couple of years old. So there are a lot of past competitions.
Skip to 5 minutes and 10 seconds On the Kaggle website, we have not just those competitions, but information about completed competitions, past solutions, interviews with winners on the Kaggle blog, and descriptions of winners’ solutions. So there’s a lot of information there. If you want to keep learning about data mining, Kaggle would be a good place to start. I have to finish with a little word on ethics. Don’t forget! I’m always saying this. Ethics of data mining is very much in the news these days. Here are just a few web quotes I got with a very quick search. “More than ever, knowingly or unknowingly, consumers disseminate personal data in daily activities.” Well, we all know that. “As companies seek
Skip to 5 minutes and 51 seconds to capture data about consumer habits, privacy concerns have flared.” Yes. “Data mining: where legality and ethics rarely meet.” That’s an interesting little title, and the point of that article was that just because you’re doing things legally in accordance with the law doesn’t necessarily mean you’re doing things ethically. I would like you to do things ethically, because you’re an ethical person. It’s the right thing to do. You have personal integrity. But if that’s not enough for you, there are good business reasons for doing things ethically. “Big data might be big business, but overzealous data mining can seriously destroy your brand.” You have to be very careful when you’re doing data mining.
Skip to 6 minutes and 30 seconds And, the final one, “What big data needs: A code of ethical practices”. So please be aware of ethical issues when you do your data mining. Well, that’s it. This is the end of the course. I hope to meet you again in some other place, some other time. I look forward to that. Meanwhile, enjoy your data mining. And while you’re doing that, I’ll go back to doing something I really love, and play some music. Bye for now!
We’ve covered a lot of ground in this course. Congratulations for getting this far, and double congratulations if you’ve managed to do all the Quizzes! I encourage you to keep learning; one good way is to challenge yourself by tackling data mining competitions. Kaggle offers many competitions, past and present, some with attractive prizes! Finally, a little word on ethics. I urge you to be ethical in your use of data mining, partly because there are good business reasons for doing things ethically, but mainly because I want to encourage you to have personal integrity and conduct yourself in an ethical way.
© University of Waikato, New Zealand. CC Creative Commons Attribution 4.0 International License.