What is a corpus?

In the previous historical overview, we talked about index cards and their prominent role in pre-computer lexicography.

Index cards display snippets of text from books, journals, etc. They are selected because they display the usage of a word or construction, which is considered interesting for the purposes of a dictionary. One risk related to this practice is that readers tend to gather such citations with a bias towards uncommon usages. However, a dictionary should also tell us about common properties of words.

This is where a corpus comes in. If we had access to a large electronic collection of texts in a given language, we could search for any word we are interested in and retrieve the texts in which that word is found. We could choose how many context words we want to read and we could even compare different contexts of the same word from the same text or in different texts. We could count how many times a word is used in certain constructions.

Corpus (plural: corpora) is a term from the field of linguistics and refers to a large set of texts (usually in electronic format) which is considered to be representative of a language (or language variety, to be more precise) and is used to analyse it. Corpora provide evidence of how a language is used in real situations and are a key resource for lexicography, as we will see in the rest of this week.

There are many different types of corpora. For example, depending on the number of languages represented, a corpus can be monolingual (with content in a single language) or multilingual (two or more languages). Bilingual corpora, for instance, are widely used by lexicographers to create bilingual dictionaries as they present the same text in two languages. A corpus can contain written texts or transcriptions of spoken language. It can collect texts from the same time period (synchronic) or from several consecutive eras (diachronic) and it can contain texts from a specific domain, such as medicine or agriculture (in this case we call it a specialised corpus), or can be about the general language (general corpus).

Given their size and design, corpora are rarely read from start to end. One of the main tools lexicographers use to investigate a word or phrase is the ‘concordance’. Concordances are lines of text which show every instance of a given word with the context in which it occurs in a corpus. A typical concordance line shows the target word that is being examined with a number of words to its right and to its left (commonly known as ‘Key Word In Context’ or KWIC).

Corpora are great resources for observing language as it is used in real communicative situations, and therefore are an essential tool for lexicographers when they are preparing to draft an entry. Depending on the characteristics of the corpus you work with, the language behaviour you observe in it may differ from the general language. For example, if you are defining the word ‘tweet’, a corpus containing ornithology articles will probably contain many instances of the meaning of ‘tweet’ related to the sound of birds, but not necessarily many of the meaning related to social media communication. Also, if a word or property does not occur in a corpus, it doesn’t necessarily mean that it does not exist in the language. Obviously, having a very large corpus covering a variety of different domains, registers, geographical sources and so on will make it easier to find even rare occurrences, such as words used in specialised contexts or very recent neologisms.

How did corpora enter the world of dictionaries?

The rapid advances in computer technologies in the 1960s and 1970s made it possible to store and analyse increasingly larger collections of texts. This created a great opportunity for linguists to build electronic corpora that could be automatically searched to analyse language. One famous example is the Brown Corpus, developed at Brown University in the US, which contains one million words of written American English. However, it took a few more years before corpora of large enough size could be fruitfully used by lexicographers.

In the 1980s, the COBUILD dictionary project, a collaboration between the publisher Collins and the University of Birmingham in the UK, was the first dictionary project that was entirely based on a corpus. The COBUILD Main Corpus contained more than seven million words, which was small by today’s standards, but groundbreaking at the time. This was a pivotal point in the history of lexicography. For the first time, a corpus was built with the specific purpose of supporting the creation of a dictionary, in this case, aimed at learners of English as a foreign language. This meant that lexicographers could, for example, count the frequency of words (ie the number of times words were found in the corpus) and use this as a basis to decide whether a word was common enough to be included in the dictionary (as illustrated in Week 2, Step 2.7). In Step 3.6 you will also find out more information about the COBUILD project.

Share this article:

This article is from the free online course:

Understanding English Dictionaries

Coventry University