Skip to 0 minutes and 11 secondsI’m going to look at a different dataset. I’m going to look at the “glass” dataset, which is a rather more extensive dataset. It’s a real world dataset, not a terribly big one. Let’s open it. Here we’ve got 214 instances and 10 attributes. Here are the 10 attributes, it’s not clear what they are. Let’s look at the “class”, by default the last attribute shown. There are seven values for the class, and the labels of these values give you some indication of what this dataset is about. We have “headlamps”, “tableware” (starting from the bottom), “containers”. Then we have “building” and “vehicle” windows, both “float” and “non-float”.
Skip to 0 minutes and 57 secondsYou may not know this, but there are different ways of making glass, and the floating process is a way of making glass. These are seven different kinds of glass. What are the attribute values? I don’t know what you remember about physics, and I guess it doesn’t matter if you don’t remember. RI stands for the refractive index. It’s always a good idea to check for reasonableness when you’re looking at datasets. It’s really important to get down and dirty with your data. Here we’re looking at the values of the refractive index—a minimum of 1.511, a maximum of 1.534. It’s good to think about whether these are reasonable values for refractive index.
Skip to 1 minute and 39 secondsIf you go to the web and have a look around, you’ll find that these are good values for the refractive index. Na. If you did chemistry, you’ll recognize Na as sodium. Here, it looks like these are percentages, the different percentages of sodium, Magnesium, Mg, and so on. We would expect Silicon (Si) to make up the majority of glass. It varies between 69.81% and 75.41%. These are percentages of different elements in the glass. We can confirm our guesses here by looking at the data file itself. Let me just find the “glass” data. It’s in Weka datasets, and it’s glass.arff. This is the ARFF file format. It starts with a bunch of comments about the glass database.
Skip to 2 minutes and 36 secondsThese lines beginning with percentage signs (%) are comments. You can read about this. We don’t have time to read it now. You can see about the attributes and it does say that the attributes are refractive index, sodium, magnesium, and so on. And the type of glass, just like I said, is about windows, containers, and tableware, and so on. We get down to the end of the comments, and here we have stuff for Weka. This is the ARFF format. The relation has a name, you’ll see it printed in the interface when you look. The attributes are defined, they are real valued attributes, numeric attributes. The “type” attribute is nominal, and the different values of type are enumerated here in quotes.
Skip to 3 minutes and 24 secondsThat defines the relation and the attributes. Then we have an ‘@data’ line, and following that in the ARFF format, are simply the instances, one after the other, with the attribute values all on one line, ending with the class by default. This is the class value for the first instance. I think there are 214 instances here. There’s the last one. That’s the ARFF format. It is a very simple, textual file format. Now we’ve confirmed our guesses about these numbers being percentages and different elements. We can think about this some more. It’s important then, that these numbers are reasonable. If they went negative, for example, that would indicate some kind of corrupted value—you can’t have a negative percentage.
Skip to 4 minutes and 16 secondsWe’re expected silicon to be the majority component; we’re expecting the refractive index to be in this kind of range. It’s always a good idea when you get a dataset to just click around in the Weka interface and make sure things look real. Rather small amounts of aluminum in glass; I guess that’s not surprising; I don’t know very much about glass myself. We’re just checking for reasonableness here—a very good thing to do. That’s it then. In this lesson, we’ve looked at the classification problem. We’ve looked at the nominal weather data and the numeric weather data. We’ve talked about nominal versus numeric attributes, and we’ve talked about the ARFF file format.
Skip to 4 minutes and 55 secondsWe’ve looked at the glass.arff dataset, and I’ve talked about sanity checking of attributes, and the importance of getting down and dirty with your data. We’ll see you soon. Bye!
The glass data
The glass dataset is a more realistic dataset with 214 instances and 10 attributes. Each instance represents a piece of glass, and its class is the type of the glass. There are 7 possible types, corresponding to different glass manufacturing processes. We interpret the attributes, and check their values for reasonableness. Weka datasets use a format called ARFF, and we take a look at the raw glass.arff data file.
© University of Waikato, New Zealand. CC Creative Commons Attribution 4.0 International License.