Finding good examples automatically

An account of the GDEX tool for finding good examples in a corpus, explaining how this works and how successful it is.
© Michael Rundell. Barbara McGillivray. CC BY-NC 4.0

The previous exercise will have given you an idea of what makes a good dictionary example.

Evaluating sentences like the ones in Step 6.7 helps us get a clearer understanding of the factors that go into making good (and bad) examples. Similar thinking informed the development of a software tool called ‘GDEX’ (standing for ‘Good Dictionary EXamples’) which is designed to automatically find good examples in a corpus.

The developers started from a set of characteristics which are typical of good examples, including:

  • They shouldn’t be too long.

  • They should be easy to understand, avoiding rare or technical vocabulary and distracting names of people or organisations.

  • They should illustrate the most typical ways a word is used, such as its normal grammar patterns and collocates (the words it most often occurs with).

  • They should be as self-contained as possible – for example, by avoiding pronouns which refer back to something in a previous sentence.

These conditions were then translated into specific, measurable features, such as sentence length (not too short, not too long); frequencies of other words in the sentence (to avoid anything too rare); number of pronouns in the sentence (these can be confusing if you don’t know what they refer to); and the appearance of common collocates (using data in the Word Sketches). Each feature was given a ‘weighting’. The system then went through sentences that included the search word and gave each sentence a score based on these criteria. The ones with the best scores were then ‘promoted’, so that in a concordance for ‘demonstrate’, for example, the ‘best’ examples appear at the top. This gives the lexicographer a candidate set of potential dictionary examples to choose from.

The GDEX algorithm has been used in a number of dictionary projects, with considerable success. The system doesn’t always get it right, and in a given set of 10 best examples, there will usually be two or three which are definitely not suitable. But it is still being improved, and it is already a more cost-effective way of finding examples than simply asking lexicographers to scan dozens or hundreds of corpus sentences.

Further reading

If you want to learn more about the GDEX tool and how it works, there is an article about it from the EURALEX conference in 2008, available at the following link: GDEX: Automatically finding good dictionary examples in a corpus.

