Review of terms
Let’s review some of the key terms we’ll be using.
A dataset is a set of instances
In Weka, it’s stored in what’s called an ARFF file. This is just a text file, where each line represents one instance. In the next video (“The Glass data”) we’ll see what an ARFF file looks like. For example, weather.nominal.arff is an ARFF file; so is weather.numeric.arff.
An instance is a single example
In Weka, each line of an ARFF file represents a different instance. For example, the weather data contains 14 instances that are supposed to represent 14 different days. The instances are the rows of the ARFF file.
An attribute is a characteristic of an instance
… or you might call it a “feature” of the instance. For example, the weather data has 5 attributes: outlook, temperature, humidity, windy, and play. For every instance there a value for each attribute – for the first instance (day) in the weather data it’s outlook = sunny, temperature = hot, humidity = high, windy = FALSE, play = no. The attributes are the columns of the ARFF file.
In Weka, attributes can be nominal or numeric. The value of a nominal attribute is represented by a word: sunny, overcast, and rainy for the outlook attribute; yes and no for the play attribute. As you might expect, the value of a numeric attribute is a number: 85, 72, 55.34, whatever.
The class of an instance is what you’re trying to predict
It’s one of the attributes. In the weather data the goal is to predict the value of the attribute play (yes or no) from the values of the other attributes outlook, temperature, humidity, and windy. Given these attribute values for the weather, should you play the game?
The goal is to determine the class of new instances
The goal is to create a classifier – something that can determine the classes of instances. A classifier is a model – like some kind of a formula – that allows the class attribute to be determined from the other attributes. The classifier is produced automatically from a “training” data set. “New” instances are ones that aren’t in the training set.