1.7

# An overview of Web standards

In this step, we introduce the standard protocols behind the World Wide Web. It is important to understand these before we discuss protocols that are more specific to the Semantic Web, which are covered in the next step.

## Background standards

The technologies behind the Web are implemented through a number of standard protocols and languages, with acronyms like HTTP, URI, XML, RDF, RDFS, OWL, SPARQL. You may be familiar with some of these already.

You can look up further details of these standards as needed (the W3C website - contains a lot of detailed information), but as background it is useful to know a little about each one, and in particular what they are for.

The standards in the next step build on the earlier ones, so they are often described as a stack of languages, as shown in Figure 1.3 below.

Figure 1.3 The Semantic Web Language Stack

### HTTP

From using the World Wide Web, most people are familiar with the HTTP prefix in front of web addresses such as http://musicbrainz.org/.

The meaning of this acronym is HyperText Transfer Protocol, and it refers to a set of conventions governing communication between a client and a server.

More precisely, these conventions define the structure of request messages from client to server, and response messages from server to client.

Message structure varies from one protocol to another: thus a different protocol such as FTP (File Transfer Protocol) will define a different message structure.

A request message in HTTP consists essentially of a method to be applied to a resource. The fundamental method is GET, which requests the server to send back a representation of the resource, typically an HTML file that can be displayed in a browser pane. However, there are several other methods including DELETE, which deletes the resource, and POST, which submits data to be processed with respect to the resource.

The resource, specified through a relative document ID (often a filename/path on the server), may be a document, or picture, or an executable that will generate data for the response.

### URI

A Uniform Resource Identifier (URI) is defined in the standard1 as

a compact sequence of characters that identifies an abstract or physical resource.

The word ‘compact’ here means that the string must contain no space characters (or other white-space padding).

‘Abstract or physical’ means that the URI may refer to an abstract resource such as the concepts ‘Beethoven’ and ‘symphony’, as well as to a document or other file that can be retrieved from the WWW.

A URI that is linked to a retrievable resource is known also as a Uniform Resource Locator, or URL. For instance, the following URI for the MusicBrainz FAQ page is a URL:





The definition of a correctly formed URI is quite complicated, with constituents that vary according to the scheme (the initial constituent before the colon), which specifies the relevant internet protocol, such as HTTP.

For an HTTP URI, the other constituents most relevant for our purposes are the authority, and the path, which occur in that order. The authority specifies the server where the resource (if it really exists) is located.

Finally, the path locates the resource precisely within the server’s directory structure. Thus for the URL given above, ‘http’ is the scheme, ‘musicbrainz.org’ is the authority, and ‘/doc/Frequently_Asked_Questions’ is the path; the other characters such as the colon are punctuation separating these constituents.

Note that the constituents following the scheme will be different for different schemes: thus the ‘tel’ scheme, for example, is followed simply by a telephone number. Here are some examples indicating this variety:


ldap://[2001:db8::7]/c=GB?objectClass?one
mailto:John.Doe@example.com
news:comp.infosystems.www.servers.unix
tel:+1-816-555-1212
telnet://192.0.2.16:80/
urn:oasis:names:specification:docbook:dtd:xml:4.1.2
http://dbpedia.org/resource/Karlsruhe



Since URIs are typically long, and hence difficult to read and write, it is convenient to make use of abbreviated forms known as ‘compact URIs’ or ‘CURIEs’.

A compact URI consists simply of a namespace and a local name, separated by a colon. Typically, the namespace includes the scheme, the authority, and perhaps the early part of the path; the local name contains the remainder of the URI, chosen so as to convey intuitively what the URI means, while observing some syntactic restrictions (e.g., there should be no further use of the characters ‘/’ and ‘#’).

Thus in the example just given, one could introduce a namespace ‘dbp’ for http://dbpedia.org/resource/, so reducing the URI to ‘dbp:Karlsruhe’, where the local name preserves the substring that is significant to human readers. We will use this convenient method of abbreviation often in the rest of this course.

## XML

The eXtensible Markup Language (XML) is a refinement of Standard Generalised Markup Language (SGML), which was introduced in the 1980s as a meta-language suitable for defining particular mark-up languages - for instance, languages for adding formatting information to documents.

The basic concept, now well known from widespread use of HTML, is that labeled tags are placed around spans of text, thus indicating perhaps that the span should be formatted in italics:

<i>text in italics</i>


The italic tag ‘i’ is part of HTML, not SGML, but the convention of placing tags within angle brackets, and distinguishing the closing tag by a forward slash character, comes from SGML, as does the syntax for adding attributes to the opening tag, as in the following example which yields blue text:

<font color="blue">blue text</font>


SGML is versatile because it can be used simply for encoding data, as well as for adding structure to text.

In the mid-1990s, the newly formed World Wide Web Consortium (abbreviated W3C) set up a working group to simplify and rework SGML to meet the requirements of the WWW.

The result was the first XML specification, which became a W3C recommendation in 1998, and has become the standard convention for data exchange over the web.

The essential advance on SGML is that XML is simpler and stricter: to give just one example, it is permissible in SGML (but not in XML) to omit closing tags, as in the common practice of inserting <p> without a closing </p> when writing HTML.

We have now introduced the fundamental protocols and standards upon which the Web runs. In the next step, we move on to look at protocols specific to the Semantic Web.

## Reference

1. T. Berners-Lee, R. Fielding and L. Masinter (2005) “Uniform Resource Identifier (URI): Generic Syntax”. Published on-line at http://tools.ietf.org/html/rfc3986. - See more at: http://www.euclid-project.eu/modules/chapter1#sthash.ftkfKf0Z.dpuf