I just wanted to go right back to the beginning and talk about the ARFF format a little bit more. Remember, an ARFF file starts out with an @relation to name the relation, and then some “@” attribute statements, one for each attribute. It declares them to be nominal, in which case it gives the values, or numeric. Integer or real, it’s the same thing – they’re all numeric for Weka. There are also string attributes. Then there’s an @data line, and following that, for each instance there’s one data line. We use question mark for missing values. And of course there are comment lines, beginning with %. Well, you know all that, but there are a few more things that you don’t know.
First of all, we can have sparse ARFF files. There’s a sparse format, and filters NonSparseToSparse and SparseToNonSparse. Here’s an example of the weather data, first of all in the regular format. Both the sparse and the regular format have the same header. On the left is the regular format, on the right is the sparse format. In the first instance, which is “sunny, hot, high, false, no”, well, in the sparse format, if the attribute has the first value, then that’s considered the default. “Sunny”, “hot”, and “high” are all default. So the first instance in the sparse format is 3, attribute number 3 (we count from 0). Attribute number 3 is “false” and 4 is “no”.
In the second instance, “sunny, hot, high, true, no”; “sunny”, “hot”, “high”, and “true” are all default. Those are all the first possible values as declared in the ARFF header, so we don’t need to specify those. We just specify that the 4th attribute – numbering again from 0 – is a “no”. And the third instance, “overcast”, well, that’s not the first value for “outlook”, so we’ve got to specify that, so we say the 0th [attribute] is “overcast”. Then “hot”, “high”, and “yes” are all default, but “false” isn’t, so we say the 3rd attribute is “false”.
And so we go on: just specify those attributes that do not have the first value. All classifiers accept sparse data as input, but some of them just nullify the savings by expanding the sparse data internally. Others actually use sparsity to speed up the computation. Good examples are NaiveBayesMultinomial and SMO. There are a couple of filters – the StringToWordVector, for example – that produce sparse output. So if you use the StringToWordVector filter in combination with Multinomial Naive Bayes, you get a very fast system, and you probably noticed that when you were doing document classification. There are a couple of other features like weighted instances. We’ve talked now and again about instances being weighted internally to Weka.
You can specify weighted instances in ARFF files in curly brackets at the end of the instance. Again, with the weather data, we’ve got a couple of instances and the first instance has got a weight of 0.5 and the second instance has got a weight of 2.0. If weights are missing, of course, they’re assumed to be 1.0. You can specify weights explicitly in your ARFF file. There are also date attributes. I won’t go into the format. You can have relational attributes, which are really intended for multi-instance learning, which we haven’t touched upon in this course. There’s an XML version of the ARFF format called XRFF (I don’t know how to pronounce that). The Explorer can read and write XRFF files.
It’s very verbose. Here’s an example. We’ve got a header, and then at the end of the header, we’ve got the body. The header contains the ARFF header, and the body contains the data, the instances. In the header, there’s a bit for each attribute where it specifies the name of the attribute and the type of the attribute, and if it’s a nominal attribute, the possible labels for it, for each attribute.
In the body, we say <instances>, and within <instances>, we have <instance>: define the first instance, define the attribute values. Then we would follow that with another instance defined in the second <instance>, and so on. It’s the same information as in ARFF files. It’s clearly very verbose. You can have instance weights, as you can with ARFF files. You can do a little bit more than you can with ARFF files. In the XML format, you can specify which is the “class” attribute – remember, Weka assumes by default that the last attribute is the class. There’s no way to change that in an ARFF file, but there is in an XRFF file. You can also specify attribute weights to have weighted attributes.
There’s a compressed version of this: .xrff.gz. The Explorer can read and write those files, as well. So you should know about that. That’s it.
ARFF has some extra features that you didn’t know about: the sparse format, instance weights, date attributes, and relational attributes. Some filters and classifiers take advantage of the sparsity to operate more efficiently in both time and space. XRFF is an XML equivalent of ARFF, plus some additional features.