Can definition-writing be automated?
On the face of it, automating the process of defining seems a daunting task.
In general, people don’t explain the meanings of the words they use while they’re communicating, so we are unlikely to find definitions in our corpus data. But there are some important exceptions. In certain types of text – especially encyclopedic articles or educational textbooks designed for undergraduates or high-school students – writers will sometimes provide a ‘gloss’ to explain some new concept that they are discussing. For example, a textbook on economics, having introduced the term ‘sustainable economy’, continues like this:
A sustainable economy can be defined as an economy that results in improved human well-being and social equity, while significantly reducing environmental risks and ecological scarcities.
This is effectively a definition. It opens up the possibility of finding definition-like sentences in corpora, and providing these for the dictionary user, in place of conventional definitions written by lexicographers.
How would this work? One useful approach is to identify the various formulae which writers use when they provide explanatory glosses of this type. We can see some examples in the following sentences, which were extracted from a concordance – from a corpus of Wikipedia articles – for ‘algal bloom’ (a term which is used in fields like marine biology):
An algal bloom is a rapid increase or accumulation in the population of algae (typically microscopic) in an aquatic system.
Extra nutrients are also supplied by treatment plants, golf courses, fertilizers, and farms. These nutrients result in an excessive growth of plant life known as an algal bloom.
When certain conditions are present, such as high nutrient or light levels, these organisms reproduce explosively. The resulting dense swarm of phytoplankton is called an algal bloom.
The most obvious formula is the first one: where the writer says ‘X is a …’. But there are several others, including ‘X refers to …’, ‘X can be defined as …’, ‘the term X is used to describe …’, ‘…is known as X’, and ‘… is called X’.
So when you create a concordance for a term like ‘algal bloom’, the corpus system can be programmed to search only for sentences where one of these formulae occur close to the term you are searching for. These ‘proto-definitions’ could then be copied into your dictionary-writing system, and the lexicographer’s job would simply be to identify the best ones. Two or three sentences of this type, if carefully chosen from a longer set of candidates, could be even more useful for a dictionary-user than a single traditional definition. And they have the added advantage that they come from texts written by people with specialist knowledge, in fields in which the average lexicographer may have little or no expertise.
This is a promising line of inquiry, but it has one big drawback: it is likely to work well for technical terms like ‘algal bloom’ or ‘sustainable economy’, but what about all the other (more mainstream) words in a language? People don’t generally produce definitions of ordinary words when they’re speaking or writing, so the pattern-matching approach described above is unlikely to help us automate definition-writing.
As with word sense disambiguation, clues in the context may help us to some extent. For example, if we want to know what a cat is, a word sketch for the word ‘cat’ tells us quite a lot. Word sketches (which we discussed in Week 3) list the other words – the ‘collocates’ – which most often occur alongside the word you are investigating. The lists in a word sketch are based on ‘grammatical relations’, so that you can see, for example, which adjectives typically modify a particular noun, or which nouns are the most frequent objects of a particular verb. In the case of ‘cat’, a word sketch will tell us:
What cats do: the list showing verbs where cat is the subject; includes things like ‘purr’, ‘miaow’, and ‘scratch’.
What physical characteristics cats have: the list showing nouns following cat, in a possessive relation; includes ‘fur’, ‘paws’, and ‘whiskers’.
Smart search algorithms can extract information like this to build up a picture of what cats are like. But it is clear that, despite the progress being made towards defining specialist terms, automated definitions of everyday vocabulary items are still a long way off.
This is a big challenge. A possible solution is for dictionaries to place less emphasis on definitions and – instead – to elevate the role of example sentences. After all, this is mostly how we learned our own first language: we may have sometimes asked someone to tell us the meaning of an unfamiliar word, but in most cases we developed our understanding of words through repeated exposure to them in real communication. A dictionary could replicate this process by giving users access to numerous examples of words in their typical contexts and environments. That would require a system for automatically extracting the ‘best’ example sentences from a corpus, and in the next two steps, we will look at the feasibility of such an approach.
Read the following questions.
Do you agree or disagree that automation is a positive step in compiling dictionaries?
Could the process of creating the dictionary entry be an automated process?
What could be the big issues that this could cause over a long period of time?
Discuss your thoughts in the comments area.
© Michael Rundell. CC BY-NC 4.0