Learn more about this course.

Document classification

Weka represents documents as "string" attributes, but Ian Witten shows how to use the StringToWordVector filter to create an attribute for each word.

A document classification problem can be represented in the ARFF format with two attributes per instance, the document text in a “string” attribute and the document class as a nominal attribute. But what can Weka do with a string? Every value – every document’s text – is different from all the others. To accomplish something meaningful, you need to look inside the string. And that’s exactly what the StringToWordVector filter does: it creates an attribute for each word. But – hey! – test documents will contain different words, some of which aren’t in any of the training documents. This conundrum is solved by using the FilteredClassifier, which you’ve already encountered. Cool!

Want to keep learning?

This content is taken from The University of Waikato online course

More Data Mining with Weka

View Course

See other articles from this course

This article is from the free online

More Data Mining with Weka

Created by

Join Now

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now

Learn more about this course.

Document classification

Share this post

Want to keep learning?

More Data Mining with Weka

Share this post

More Data Mining with Weka

More Data Mining with Weka

Reach your personal and professional goals

Register to receive updates

Learn more about this course.

Learn more about this course.