Learn more about this course.

How do you classify documents?

Ian Witten introduces this week's second Big Question

Document classification is a popular and important application of data mining.

But how can it be done? Weka allows string attributes, and it’s simple to load an entire document into a single “string” attribute, but what then? We need to be able to get inside the document, to somehow look at its content, in order to stand a chance of classifying it.

It’s easy once you know how. That’s what you’re about to find out.

You’ll also learn about a new classifier, Multinomial Naïve Bayes, which is particularly appropriate for text classification. And you’ll learn ways of evaluating two-class classification that are more nuanced than the “percent correct” we have been using so far.

Want to keep
learning?

This content is taken from
The University of Waikato online course,

More Data Mining with Weka

View Course

Aside: I learned about the problems of evaluating correctness years ago, when we were trying to apply machine learning to detect when a cow is on heat from various behavioral attributes. (Really! This is an important economic issue for artificial insemination. Believe it or not, semen is expensive.) Cows have similar menstrual cycles to women, and remain in estrus for about a day. If you always predict “This cow is not in estrus today”, you’ll be right about 96% of the time (27/28). That’s an impressive correctness figure. But it’s not necessarily what you want.

At the end of this week you’ll be able to use Weka to classify documents. And you’ll be able to use “threshold curves” to show different tradeoffs between error types (e.g. predicting “in estrus” for a cow that is not in estrus is a different type of error than predicting “not in estrus” for a cow that is in estrus)*.

Want to keep learning?

This content is taken from The University of Waikato online course

More Data Mining with Weka

View Course

See other articles from this course

This article is from the free online

More Data Mining with Weka

Created by

Join Now

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now

Learn more about this course.

How do you classify documents?

Want to keep
learning?

More Data Mining with Weka

Want to keep learning?

More Data Mining with Weka

More Data Mining with Weka

More Data Mining with Weka

Reach your personal and professional goals

Register to receive updates

Learn more about this course.

Learn more about this course.

See all FutureLearn courses.

Learn more about this course.

How do you classify documents?

Want to keep learning?

More Data Mining with Weka

Want to keep learning?

More Data Mining with Weka

Share this

More Data Mining with Weka

More Data Mining with Weka

Reach your personal and professional goals

Register to receive updates

Learn more about this course.

Learn more about this course.

See all FutureLearn courses.

Want to keep
learning?