Skip main navigation

Document classification

Weka represents documents as "string" attributes, but Ian Witten shows how to use the StringToWordVector filter to create an attribute for each word.

A document classification problem can be represented in the ARFF format with two attributes per instance, the document text in a “string” attribute and the document class as a nominal attribute. But what can Weka do with a string? Every value – every document’s text – is different from all the others. To accomplish something meaningful, you need to look inside the string. And that’s exactly what the StringToWordVector filter does: it creates an attribute for each word. But – hey! – test documents will contain different words, some of which aren’t in any of the training documents. This conundrum is solved by using the FilteredClassifier, which you’ve already encountered. Cool!

This article is from the free online

More Data Mining with Weka

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education