Learn more about the role of corpora in today’s lexicographic practice.
Compared to paper slips, a corpus gives lexicographers access to a much wider context of use for a word.
The electronic format of corpora makes it possible for lexicographers to find all the occurrences of a particular word or structure and observe examples of the nuances of its meaning. This way, they can also see whether the word is used differently depending on the register of the text (formal, informal, slang, etc), the geographical location, the author, the subject matter and so on.
We can think of two main ways in which corpora help the dictionary-making process. On the one hand, the so-called ‘corpus-based
’ approach sees the corpus as a source of examples. In this approach, we can imagine the lexicographer relying on their intuition of what a word means and how it is used, and then resorting to the corpus to find examples of each meaning and usage. On the other hand, the so-called ‘corpus-driven
’ approach sees the corpus as the starting point of the process, in a bottom-up way. In this approach, we can imagine the lexicographer starting from the collection of occurrences of the word in the corpus, analysing them and grouping them into categories (for example, one for each sense), and then drafting the dictionary entry based on this analysis.
Corpus-driven approaches have become increasingly popular in lexicography following recent technological advances, which have made it possible for computer programs to process vast amounts of data very rapidly. It’s probably safe to say that a combination of corpus-based and corpus-driven approaches is, today, a very common practice. As we will see in more detail at the end of this course, applied research in corpus linguistics and computational linguistics are pushing the boundaries and have made it possible for dictionary publishers to mine very large corpora in search of evidence of words’ behaviour. The typical size of corpora used today by dictionary publishers is a few billion words. These corpora are usually drawn from the web and a variety of sources, including news, academic content and social media. One example is the Oxford English Corpus, which contains 21bn words and is used by lexicographers working on the OED and other Oxford dictionaries.
For more information on the Oxford English Corpus
select the following link to the Wikipedia article discusses the Oxford English Corpus
which contains further information.
© Barbara McGillivray. CC BY-NC 4.0