Aggregate data

This step introduces one more operation to use with SELECT queries, before we move on to CONSTRUCT queries.

If we want to report the total duration of an album, we need to aggregate the duration of each track on that album first.

Suppose that the dataset contains triples specifying the tracks on each album, and associating a duration in milliseconds with each track, represented by an integer literal. (In fact the MusicBrainz dataset associates records with albums, and tracks with records, which complicates the query slightly – see below.)

Fully listed this is a lot of data, and you might wish instead to report, for each album, the total duration obtained by summing the durations of the tracks.

This is an example of aggregate data, and it requires two operations: first, we must segment the variable bindings into groups, corresponding in this case to all bindings relating to a given album; second, for each group, we must submit the values of a specified variable to an aggegation function – in this case, sum the track durations.

For instance, suppose the dataset has just two albums matching the query, namely Revolver and Abbey Road; and suppose that for Revolver just three tracks are included, namely “Eleanor Rigby”, “I’m only sleeping”, and “Doctor Robert”, with durations respectively of 200000, 240000, and 160000 milliseconds. (These are just round numbers made up for the example.)

This will mean that for Revolver we have three bindings of the variables ?album, ?track, and ?track_duration; and if Abbey Road has four specified tracks, it will correspondingly have four bindings.

We now separate the bindings into groups – three for Revolver, four for Abbey Road – and within each group we want to sum the track durations, and return only a table that gives albums along with their total durations.

This will mean that the first row of the table will specify Revolver in the ?album column, and 600000 in the second column for which we need a new name – perhaps ?album_duration.

Here is a query that achieves this result. In addition it imposes a condition on the total duration of the album, reporting only albums with duration exceeding 3600000 milliseconds (i.e., one hour), selecting (and hence grouping by) album title rather than album.

PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX mo: <http://purl.org/ontology/mo/>

SELECT ?album_title (SUM(?track_duration) AS ?album_duration)
WHERE { ?album mo:record ?record .
        ?album dc:title ?album_title .
        ?record mo:track ?track .
        dbpedia:The_Beatles foaf:made ?track .
        ?track mo:duration ?track_duration .
  } 
 GROUP BY ?album_title
 HAVING (?album_duration > 3600000)

Note two new keywords here: AS introducing a variable name for the sum of durations (this becomes the heading of the second column of the output table); and HAVING introducing a filter over the group of variable bindings that has just been specified by the GROUP BY component.


Share this article:

This article is from the free online course:

Introduction to Linked Data and the Semantic Web

University of Southampton

Get a taste of this course

Find out what this course is like by previewing some of the course steps before you join: