We have seen that corpora have become an essential tool for lexicographers.
Corpora are often subjected to a series of analyses to make them more readily usable for linguistic analysis and lexicography. For example, imagine that you want to define the noun ‘book’. If you search a corpus for ‘book’, you will not be able to automatically rule out the verbal usages of ‘book’ as in the sentence ‘she wanted to book a flight to Vienna’, and would run the risk of being flooded by thousands of irrelevant examples. What is the solution?
The solution is called ‘annotation’ and is a process that adds linguistic information to a corpus. Annotation can be of different kinds, depending on the type of linguistic features we want to focus on.
In the first example above, we want to be able to say that both the singular form ‘book’ and the plural form ‘books’ have the same base form ‘book’. This base form is called a ‘lemma’ in linguistics, and the process of assigning a lemma to a word form is called ‘lemmatisation’. The details of lemmatisation depend on the specific language being analysed. Lemmatisation affects not just nouns, but also other classes of words. For instance, what do you think the lemma of the verb form ‘running’ is? Yes, it is ‘run’, and the lemma of ‘ran’ and ‘runs’ is also ‘run’.
How about the second example, where we wanted to rule out the usages of ‘book’ as in, ‘she wanted to book a flight to Vienna’? In this case, we need to be able to distinguish between the usage of ‘book’ as a noun and as a verb. In English we have the following word classes: nouns (such as the word ‘thing’), verbs (such as the word ‘compose’), adjectives (eg ‘funny’), adverbs (eg ‘incredibly’), pronouns (eg ‘we’), conjunctions (eg ‘because’), determiners (eg ‘the’ or ‘a’) and prepositions (eg ‘in’ or ‘of’). These classes are called ’parts of speech’ and the annotation process that assigns a part of speech to a word form is called ‘part of speech tagging’.
There are different ways to represent annotation in a corpus. A simple way is to add annotation tags and some special delimiter (a special character that shows the beginning and end of units in a text). For example, if we wanted to add part of speech information to the following sentences:
They went ahead and booked the tickets.
You always buy books on Sundays.
We could use the tags ‘Pr’ for ‘pronoun’, ‘V’ for ‘verb’, ‘Adv’ for ‘adverb’, ‘C’ for ‘conjunction’, ‘D’ for ‘determiner’, ‘P’ for ‘preposition’, and ‘N’ for ‘noun’ and add them straight after each word form, using the delimiter ‘_’:
They_Pr went_V ahead_Adv and_C booked_V the_D tickets_N.
You_Pr always_Adv buy_V books_N on_P Sundays_N.
If we wanted to add lemma information, we could use a similar approach:
They_they went_go ahead_ahead and_and booked_book the_the tickets_ticket.
You_you always_always buy_buy books_book on_on Sundays_Sunday.
In practice, a very common way to represent annotation tags in corpora nowadays uses different formats such as XML, though we won’t go into the details here.
Annotation can involve other levels of linguistic analysis, for example syntactic information (which specifies the role of elements such as verb subjects or verb objects in a sentence) or semantic information (which specifies aspects of the meaning of words).
Annotation can be performed by humans (manual annotation) or by computer programs (automatic annotation), or by a combination of the two (semi-automatic annotation), where the results of automatic annotation are further looked at and corrected by humans. You can imagine that automatic processes have the advantage of allowing large corpora to be equipped with annotation; although not perfect, annotated corpora are widely used in lexicography. The accuracy of automatic part of speech annotation is very high for English corpora, typically over 97%. As a matter of fact, the accuracy of automatic annotation systems varies widely depending on the task and the language being considered and this is still an active area of research in the branch of computational linguistics called Natural Language Processing or NLP.
What we want to stress here is that most corpus query tools nowadays allow searches on annotation tags (following some pre-defined annotation encoding) and would make it possible, for example, to search only for noun instances of the lemma ‘book’. In the rest of this week’s activities, you will get a chance to explore a corpus and find potentially surprising patterns of usage.
The following linked paper, from the COBUILD days, gives a useful overview of corpus linguistics and grammatical annotation of large corpora in the Opening Address at 5th Congress of EURALEX, Tampere, August 4, 1992 keynote talk at the Lexicography and Corpus Linguistics.
© Barbara McGillivray. CC BY-NC 4.0