Discover practical data mining and its applications using the popular Weka workbench.

37,955 enrolled on this course

  • Duration

    5 weeks
  • Weekly study

    3 hours
  • Accreditation

    AvailableMore info
The CPD Certification Service

The CPD Certification Service was established in 1996 and is the leading independent CPD accreditation institution operating across industry sectors to complement the CPD policies of professional and academic bodies. Find out more.

Learn how to mine data using Weka, with the University of Waikato

On this five-week course, you’ll discover how to mine data using the Weka workbench, a powerful tool for machine learning and data mining.

Guided by experts at the University of Waikato, the original developers of Weko, you’ll learn the basics of data visualisation, classification algorithms, and data interpretation and evaluation.

Explore the basics of data interpretation and evaluation

Beginning with an introduction to data mining concepts, you’ll discover the various applications of data mining in personal and professional contexts.

You’ll examine how to evaluate a classifier’s performance and use training, testing, and cross-validation to gauge the accuracy of the data you’ve gathered.

With these skills, you’ll be able to improve the quality of your data and develop meaningful answers to the questions you’re trying to answer.

Organise your data using classifiers

Exploring both simple and more complex classifiers, you’ll learn how different classification methods can be used to interpret datasets.

You’ll investigate the applications of concepts including decision trees, linear regression, and support vector machines, learning how to apply the correct classification method to your problem.

Examine the full data mining process

In the final week of this course, you’ll put your learning into context by exploring the full data mining process.

You’ll address common pitfalls and challenges to accessing data, as well as assessing the ethics of data mining, giving you a broader understanding of how and when data mining should be used in different contexts.

You’ll finish this course understanding what Weka is and how to gather and interpret big data. You’ll be aware of the full data mining process and be able to explain and apply Weka within your own data mining work.

Download video: standard or HD

Skip to 0 minutes and 4 seconds Hello! My name’s Ian Witten, I’m from the University of Waikato here in New Zealand, and I want to tell you about our new, free, online course – Data Mining with Weka. We’re overwhelmed by data in the world today. Every time we check out an item at the supermarket, every time we swipe our credit card, every time we send an email, every time we type a keystroke on our computer, every time we make a phone call, send a text, walk past a security camera – we all generate a little bit of data.

Skip to 0 minutes and 35 seconds Data mining is about taking this raw data, and transforming it into something more useful: information, perhaps; or predictions, predictions about what might happen next, predictions that can be used in the real world. The real aim of this course is to take the mystery out of data mining, to give you some practical experience actually using the Weka toolkit to do some mining on the data sets that we provide, to set you up so that, later on, you can use Weka to work on your own data sets and do your own data mining. It doesn’t involve any programming or anything like that. You’re going to be using the tools that we provide, the Weka tools.

Skip to 1 minute and 13 seconds It might help to know a little bit of elementary statistics, like means, variances, standard deviations, and so on. You might see a couple of mathematical formulae, but I’ll explain those, so don’t worry about that. You don’t really need any specific mathematical background. So that’s it – Data Mining with Weka, coming soon to a computer near you. I’m looking forward to it, and I hope to see you there. Bye for now!

Syllabus

  • Week 1

    A little bit of everything

    • What's data mining? What's Weka? What's the course about?

      Everybody talks about data mining and "big data" nowadays. This course introduces you to practical data mining. Weka is a powerful yet easy to use tool for machine learning and data mining that can also tackle large problems.

    • What's it like to do data mining?

      Each week we’ll focus on a “Big Question” relating to data mining. This is the question for this week.

    • Exploring the Explorer

      In this activity you will install Weka on your computer, start the Weka Explorer, and load, view, and edit datasets. (Note: You may need admin access to install Weka.)

    • Exploring datasets

      Classification (also called "supervised learning") is a common kind of data mining problem. Datasets contain "instances," which are described in terms of a fixed set of features, or "attributes".

    • Building a classifier

      Now you will learn how to use Weka's popular J48 classifier, which builds decision trees. J48 is a reimplementation of a classic classifier algorithm called C4.5.

    • Using a filter

      WEKA contains "filters" that help with cleaning and preparing your data. Some filters operate on attributes; others operate on instances.

    • Visualizing your data

      With Weka's visualization tool you can clean the data and remove anomalous instances (outliers). You can also visualize the errors that classifiers make.

  • Week 2

    Evaluation

    • How do I evaluate a classifier’s performance?

      This week's Big Question!

    • Be a classifier!

      WEKA incorporates many different classification algorithms. One, called the “UserClassifier,” enables you to build your own decision tree for classification. How well can you do? It’s a challenge!

    • Training and testing

      Evaluating what has been learned is an essential part of data mining. You should never evaluate on the training set! – the results will be overly optimistic. If you have a single dataset, hold some data back for testing.

    • Repeated training and testing

      Ideally, training and test sets are sampled independently from a large population. Different samples give slightly different performance estimates. More reliable results are obtained by averaging over several experimental runs.

    • Baseline accuracy

      How do you know how well your machine learning method is doing? You should always compare it with the “baseline accuracy” obtained by simple methods. ZeroR is an extremely simple method that serves as a useful baseline.

    • Cross-validation

      Cross-validation, a standard evaluation technique, is a systematic way of running repeated percentage splits. In “stratified” cross-validation, training and test sets have the same class distribution as the full dataset.

    • Cross-validation results

      Cross-validation is better than randomly repeating percentage split evaluations. It gives a more reliable performance estimate – that is, one with lower variance. Ten-fold cross-validation is a standard evaluation method.

    • How are you getting on?

      We're well into the course now. Let's just take stock.

  • Week 3

    Simple classifiers

    • How do simple classification methods work?

      This week's Big Question!

    • Simplicity first

      Always try simple methods before complex ones! (A good maxim for life in general, not just data mining.) Sometimes, simple algorithms perform really well. We learn about OneR, a simple method that is sometimes quite effective.

    • Overfitting

      “Overfitting” is a general problem that plagues all machine learning methods. It’s when a classifier fits the training data too tightly. The classifier works well on the training data but not on independent test data.

    • Using probabilities

      Why not use all attributes, equally weighted, instead of a single one as OneR does. Bayes' Theorem provides a sound probabilistic foundation for this. "Naive" Bayes assumes that attributes are equally important, and independent.

    • Decision trees

      Decision trees are another simple classification method, based on a top-down, recursive, divide-and-conquer strategy. J48 (aka C4.5) finds a good attribute to split on at each stage using a measure called "information gain."

    • Pruning decision trees

      Decision trees can easily overfit the training data, and pruning techniques are needed to guard against overfitting. There are various different methods. Unfortunately, this is where elegant algorithms get messy!

    • Nearest neighbor

      How about storing the training instances and giving new instances the same classification as their nearest neighbor? A similarity function is needed to select the closest instance. Using several neighbors can improve performance.

  • Week 4

    More classifiers

    • What about real-life classification methods?

      This week's Big Question!

    • Classification boundaries

      Different classifiers are biased towards different kinds of decision, which you can explore by visualizing the classification boundaries. We look at classification boundaries for OneR, IBk, NaiveBayes, and J48.

    • Linear regression

      "Regression" problems are where the class is numeric, and "linear regression" is a standard mathematical technique for predicting numeric classes. In addition, there are non-linear methods that build trees of linear models.

    • Classification by regression

      Linear regression can be used for classification as well. For two-valued nominal classes, just convert them to 0 and 1. For more class labels, either "multi-response linear regression" or "pairwise linear regression" can be used.

    • Logistic regression

      Sometimes it’s best to predict class probabilities instead of predicting the classes themselves. Linear regression can be made to work with probabilities, resulting in logistic regression, a popular classification technique.

    • Support vector machines

      Support vector machines separate the classes using the "maximum margin hyperplane." This is defined by a few instances, called "support vectors," from each class. The boundary depends on a few points, which reduces overfitting.

    • Ensemble learning

      Many of us dislike committees, but nevertheless they often make good, unbiased, decisions. Several machine learning methods use "committees" of different classifier algorithms: Bagging, Random forests, Boosting, and Stacking.

  • Week 5

    Putting it all together

    • What else is there to know?

      This week's Big Question!

    • The data mining process

      Producing classifiers is just a small part of the overall data mining process – perhaps the easiest part! Other parts involve formulating the question, gathering data, cleaning it, defining new features, and deploying the result.

    • Pitfalls and pratfalls

      Be skeptical, and particularly wary of overfitting. Missing values can signify various things; classifiers treat them differently. There’s no single "best learner"; all methods have biases. Data mining is an experimental science!

    • Data mining and ethics

      It’s far harder to anonymize data than you think! The purpose of data mining is to discriminate, but some kinds of discrimination are unethical, and illegal. Data mining discovers correlations, but these do not imply causation.

    • There's no magic in data mining

      There’s no magic in data mining! – in fact, perhaps Weka makes things too easy. You’ve learned lots, but don’t be smug: this course has missed out plenty. And you've learned a powerful technology: please use it wisely.

    • Farewell

      It's time to say goodbye.

Who is this accredited by?

The CPD Certification Service
The CPD Certification Service:

The CPD Certification Service was established in 1996 and is the leading independent CPD accreditation institution operating across industry sectors to complement the CPD policies of professional and academic bodies.

Learning on this course

On every step of the course you can meet other learners, share your ideas and join in with active discussions in the comments.

What will you achieve?

By the end of the course, you‘ll be able to...

  • Demonstrate use of Weka for key data mining tasks
  • Evaluate the performance of a classifier on new, unseen, instances
  • Explain how data miners can unwittingly overestimate the performance of their system
  • Identify learning methods that are based on different flavors of simplicity
  • Apply many different learning methods to a dataset of your choice
  • Interpret the output produced by classification methods
  • Describe the principles behind many modern machine learning methods
  • Compare the decision boundaries produced by different classification algorithms
  • Debate ethical issues raised by mining personal data

Who is the course for?

This course is designed for anyone considering a career in data science or those currently working in the data sector wanting to further their knowledge of data mining software.

What software or tools do you need?

You will download the free Weka software during Week 1. It runs on any computer, under Windows, Linux, or Mac. It has been downloaded millions of times and is being used all around the world.

(Note: Depending on your computer and system version, you may need admin access to install Weka.)

Who will you learn with?

I grew up in Ireland, studied at Cambridge, and taught computer science at the Universities of Essex in England and Calgary in Canada before moving to paradise (aka New Zealand) 25 years ago.

Who developed the course?

The University of Waikato

Sitting among the top 3% of universities world-wide, The University of Waikato prepares students to think critically and to show initiative in their learning.

  • Established

    1964
  • Location

    Waikato, New Zealand
  • World ranking

    Top 380Source: QS World University Rankings 2021

Learning on FutureLearn

Your learning, your rules

  • Courses are split into weeks, activities, and steps to help you keep track of your learning
  • Learn through a mix of bite-sized videos, long- and short-form articles, audio, and practical activities
  • Stay motivated by using the Progress page to keep track of your step completion and assessment scores

Join a global classroom

  • Experience the power of social learning, and get inspired by an international network of learners
  • Share ideas with your peers and course educators on every step of the course
  • Join the conversation by reading, @ing, liking, bookmarking, and replying to comments from others

Map your progress

  • As you work through the course, use notifications and the Progress page to guide your learning
  • Whenever you’re ready, mark each step as complete, you’re in control
  • Complete 90% of course steps and all of the assessments to earn your certificate

Want to know more about learning on FutureLearn? Using FutureLearn

Learner reviews

Learner reviews cannot be loaded due to your cookie settings. Please and refresh the page to view this content.

Get a taste of this course

Find out what this course is like by previewing some of the course steps before you join:

Do you know someone who'd love this course? Tell them about it...

You can use the hashtag #FLdatamining to talk about this course on social media.