Skip main navigation

Review of terms

Ian Witten reviews the terms introduced in the preceding video

Let’s review some of the key terms we’ll be using.

A dataset is a set of instances

In Weka, it’s stored in what’s called an ARFF file. This is just a text file, where each line represents one instance. In the next video (“The Glass data”) we’ll see what an ARFF file looks like. For example, weather.nominal.arff is an ARFF file; so is weather.numeric.arff.

An instance is a single example

In Weka, each line of an ARFF file represents a different instance. For example, the weather data contains 14 instances that are supposed to represent 14 different days. The instances are the rows of the ARFF file.

An attribute is a characteristic of an instance

… or you might call it a “feature” of the instance. For example, the weather data has 5 attributes: outlook, temperature, humidity, windy, and play. For every instance there a value for each attribute – for the first instance (day) in the weather data it’s outlook = sunny, temperature = hot, humidity = high, windy = FALSE, play = no. The attributes are the columns of the ARFF file.

In Weka, attributes can be nominal or numeric. The value of a nominal attribute is represented by a word: sunny, overcast, and rainy for the outlook attribute; yes and no for the play attribute. As you might expect, the value of a numeric attribute is a number: 85, 72, 55.34, whatever.

The class of an instance is what you’re trying to predict

It’s one of the attributes. In the weather data the goal is to predict the value of the attribute play (yes or no) from the values of the other attributes outlook, temperature, humidity, and windy. Given these attribute values for the weather, should you play the game?

The goal is to determine the class of new instances

The goal is to create a classifier – something that can determine the classes of instances. A classifier is a model – like some kind of a formula – that allows the class attribute to be determined from the other attributes. The classifier is produced automatically from a “training” data set. “New” instances are ones that aren’t in the training set.

This article is from the free online

Data Mining with Weka

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now