## Want to keep learning?

This content is taken from the The University of Waikato's online course, Data Mining with Weka. Join the course to learn more.
1.13

## The University of Waikato

Skip to 0 minutes and 10 seconds Hi! Welcome back for another five minutes in New Zealand with Data Mining with Weka. We looked at this data file in the last lesson. It’s the weather data, a toy dataset of course. It has 14 days, or instances, and each instance, each day, is described by five attributes, four to do with the weather, and the last attribute, which we call the “class” value – the thing that we’re trying to predict, whether or not to play this unspecified game. This is called a classification problem. We’re trying to predict the class value. Let’s open up Weka. It’s here on my desktop. I’m going to go into the Explorer. We always use the Explorer. I’m going to open the file.

Skip to 0 minutes and 56 seconds I put the datasets in the My Documents folder, so I can see them here. Just open the Weka datasets and the nominal weather data. There’s the weather data in Weka. As we saw last time, you can see the size of the dataset, the number of instances (14), you can see the attributes, you can click any of these attributes and get the values for those attributes up here in this panel. You also get at the bottom a histogram of the attribute values with respect to the different class values. The different class values are blue for “yes”, play, and red for “no”, don’t play. By default, the last attribute in Weka is always the class value.

Skip to 1 minute and 47 seconds You can change this if you like. If you change it here you can decide to predict a different one other than the last attribute. That’s the weather dataset, and we’ve already explored that. As I said, it’s a classification problem, sometimes called a “supervised learning” problem – “supervised” because you get to know the class values of the training instances. We take as input a data set as classified examples; these examples are independent examples with a class value attached. The idea is to produce automatically some kind of model that can classify new examples. That’s the “classification” problem. Here is what the examples look like.

Skip to 2 minutes and 33 seconds This is an “instance”, with the different attribute values a fixed set of features; and then we add to that the class to get the classified example. That’s what we have to have in our training dataset. These attributes, or features, can be discrete or continuous. What we looked at in the weather data were discrete; we call them “nominal” attribute values when they belong to a certain fixed set. Or they can be numeric, or “continuous”, values. Also, the class can be discrete or continuous. We’re looking at a discrete class, “yes” or “no”, in the case of the weather data. Another kind of machine learning problem would involve continuous classes, where you’re trying to predict a number.

Skip to 3 minutes and 22 seconds I’m going to have a look at a similar dataset to the weather dataset: the numeric weather dataset. Let me just open that in Weka, weather.numeric.arff. Here it is. It’s very similar, almost identical in fact, with 14 instances, 5 attributes, the same attributes. Maybe I should just look at this dataset in the edit panel. You can see here that two of the attributes – temperature and humidity – are numeric attributes, whereas previously they were nominal attributes. So here there are numbers. What we see when we look at the attributes values for outlook, just as before, we have sunny, overcast and rainy. For temperature, though, we can’t enumerate the values, there are too many numbers to enumerate.

Skip to 4 minutes and 16 seconds We have the minimum and maximum value, mean, and standard deviation. That’s what Weka gives you for numeric values.

# More weather

Key concepts when talking about datasets are instances, attributes, and the class (which is conventionally the last attribute). Attributes can be nominal or numeric (there are also other types). In a classification problem the goal is to produce automatically some kind of model that can determine the class of new instances. These concepts are illustrated using the weather dataset.