The added value
Examples taken from corpora, unlike artificially constructed examples, show evidence for how language is actually used by people.
This is one of the main strengths of corpus-based and corpus-driven lexicography. But there is another, very important aspect added by corpora to dictionary making. Corpora provide quantitative information.
Imagine you are drafting the dictionary entry for ’chairperson’ and you are in doubt about whether the spelling ‘chair-person’ is more common than ‘chairperson’. How would you know? A corpus will tell you, and if you choose an up-to-date general language corpus that is large enough, you can safely rely on it to make a decision about your entry. Of course, corpora may also contain mistakes or typos, and it’s always important to keep this in mind when dealing with them.
Frequency information offers lexicographers empirical evidence against which they can compare their intuitions, and helps them make decisions about which words or senses to include, and how to present them to the users.
Linguists have studied frequency effects in language for a long time. One well-known fact about language is that word frequencies have a skewed distribution. You probably know already that some words like ‘take’ or ‘and’ are more frequent than others like ‘diagonalise’ or ‘amoxicillin’. What is more, a relatively small number of words (the most frequent ones) tend to cover a very large proportion of the words found in a text. This is usually known as Zipf’s Law, from the name of the linguist George Kingsley Zipf, who popularised it. In simple terms, following Zipf’s Law we can expect to see that the most frequent word in a text occurs about twice as often as the second most frequent word, three times as often as the third most frequent word, and so on. In other words, the first 15 words will account for 25% of the text, the first 100 will account for 60%, and the first 1,000 for 85%. The first 4,000 will account for 97.5%.
According to Zipf’s Law, when lexicographers work on rare words, they need very large corpora, typically in the range of several hundred million or a few billion words. This is because in a medium-sized corpus rare words may simply never be seen. Certain dictionaries have used corpus frequency information to highlight the most common words in a language. For example, as we saw in Week 2, the Macmillan Dictionary presents a core vocabulary of the top 7,500 most frequent word in the English language as ‘red words’, thus signalling that these are the most important words to learn for people whose first language isn’t English.
For more information about ‘red words’ the following link to the Macmillan Dictionary includes information about ‘red words’, with the most frequent words used in a language.
© Barbara McGillivray. CC BY-NC 4.0