Skip main navigation

The role of corpora in dictionary making

Learn more about the role of corpora in today’s lexicographic practice.
© Barbara McGillivray. CC BY-NC 4.0

Compared to paper slips, a corpus gives lexicographers access to a much wider context of use for a word.

The electronic format of corpora makes it possible for lexicographers to find all the occurrences of a particular word or structure and observe examples of the nuances of its meaning. This way, they can also see whether the word is used differently depending on the register of the text (formal, informal, slang, etc), the geographical location, the author, the subject matter and so on.

We can think of two main ways in which corpora help the dictionary-making process. On the one hand, the so-called ‘corpus-based’ approach sees the corpus as a source of examples. In this approach, we can imagine the lexicographer relying on their intuition of what a word means and how it is used, and then resorting to the corpus to find examples of each meaning and usage. On the other hand, the so-called ‘corpus-driven’ approach sees the corpus as the starting point of the process, in a bottom-up way. In this approach, we can imagine the lexicographer starting from the collection of occurrences of the word in the corpus, analysing them and grouping them into categories (for example, one for each sense), and then drafting the dictionary entry based on this analysis.

Corpus-driven approaches have become increasingly popular in lexicography following recent technological advances, which have made it possible for computer programs to process vast amounts of data very rapidly. It’s probably safe to say that a combination of corpus-based and corpus-driven approaches is, today, a very common practice. As we will see in more detail at the end of this course, applied research in corpus linguistics and computational linguistics are pushing the boundaries and have made it possible for dictionary publishers to mine very large corpora in search of evidence of words’ behaviour. The typical size of corpora used today by dictionary publishers is a few billion words. These corpora are usually drawn from the web and a variety of sources, including news, academic content and social media. One example is the Oxford English Corpus, which contains 21bn words and is used by lexicographers working on the OED and other Oxford dictionaries.

Further reading

For more information on the Oxford English Corpus select the following link to the Wikipedia article discusses the Oxford English Corpus which contains further information.

© Barbara McGillivray. CC BY-NC 4.0
This article is from the free online

Understanding English Dictionaries

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education