Skip main navigation

Review of terms

Ian Witten reviews the terms introduced in the preceding video

Let’s review some of the key terms we’ll be using.

A dataset is a set of instances

In Weka, it’s stored in what’s called an ARFF file. This is just a text file, where each line represents one instance. In the next video (“The Glass data”) we’ll see what an ARFF file looks like. For example, weather.nominal.arff is an ARFF file; so is weather.numeric.arff.

An instance is a single example

In Weka, each line of an ARFF file represents a different instance. For example, the weather data contains 14 instances that are supposed to represent 14 different days. The instances are the rows of the ARFF file.

An attribute is a characteristic of an instance

… or you might call it a “feature” of the instance. For example, the weather data has 5 attributes: outlook, temperature, humidity, windy, and play. For every instance there a value for each attribute – for the first instance (day) in the weather data it’s outlook = sunny, temperature = hot, humidity = high, windy = FALSE, play = no. The attributes are the columns of the ARFF file.

In Weka, attributes can be nominal or numeric. The value of a nominal attribute is represented by a word: sunny, overcast, and rainy for the outlook attribute; yes and no for the play attribute. As you might expect, the value of a numeric attribute is a number: 85, 72, 55.34, whatever.

The class of an instance is what you’re trying to predict

It’s one of the attributes. In the weather data the goal is to predict the value of the attribute play (yes or no) from the values of the other attributes outlook, temperature, humidity, and windy. Given these attribute values for the weather, should you play the game?

The goal is to determine the class of new instances

The goal is to create a classifier – something that can determine the classes of instances. A classifier is a model – like some kind of a formula – that allows the class attribute to be determined from the other attributes. The classifier is produced automatically from a “training” data set. “New” instances are ones that aren’t in the training set.

This article is from the free online

Data Mining with Weka

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education