Skip main navigation

£199.99 £139.99 for one year of Unlimited learning. Offer ends on 28 February 2023 at 23:59 (UTC). T&Cs apply

Find out more

The COBUILD corpus

Learn about the background to corpus-based approaches and their revolutionary impact on lexicography.
Today I’m interviewing my friend and former colleague Ramesh Krishnamurthy. Ramesh is a lexicographer and corpus linguist who worked for many years with Professor John Sinclair on the COBUILD project where he was a senior editor. So Ramesh is one of the very first lexicographers to have worked with corpus data back in the early 1980s. Ramesh could you tell us a little bit about your first experience of working with the corpus? Well as you can imagine I mean the first few days it was really daunting and overwhelming there were so many words and phrases and so many examples to look at and trying to organise them and group them and categorise them; it felt impossible.
It was therefore also very slow because back in the early 1980s everything was printed out on paper. We didn’t actually get to use the computers, only the techies used the computers. We had to have everything printed out. But it was also extremely exciting and liberating and for someone like me who had always enjoyed studying languages at school in university, I’d always been a bit frustrated because of all the rules that I was being taught. And I would keep coming across examples that didn’t fit those rules but if I haven’t mentioned them, the teacher would probably make me go stand in the corner or stand outside. So I felt there was some kind of hypocrisy going on here.
If the rules weren’t the rules then why tell us they’re rules? So, suddenly seeing the data myself, seeing how thousands of people actually used the words and phrases in the language, it was wonderful because I could then say ‘ah there are no rules there are no being marked right and wrong in essays or exams’. What it was that, most of us, most of the time, seemed to use one or two patterns and phrases. Whereas in certain circumstances or certain groups of people might use different ones. But it was all just to do with how often, whether something was common or whether it was rare. The final stage, I suppose, was difficult in a different way.
Because with all this information, the problem for me was deciding what was the appropriate level of detail. You know, with so much information available, what is it but the particular audience for a dictionary would need? And this was my first time not only working on a dictionary but working on a dictionary for language learners. So that was a big learning curve for me, was how can I explain the same things but a bit more simply and perhaps leave a few things out, which I’ve always been reluctant to do. But that’s where editing comes in, and as I, other people edited my work I gradually started to understand how to do that and then I became an editor myself.
In your view how did the COBUILD project change the way in which dictionaries are created? Well in the past most dictionaries were usually written by one person, usually an educated middle class person. And so they only had their personal memories, and we all know how fragile memory is, and their personal experiences and each of us only has a very limited experience of any language to work with. And of course like all of us they have their own prejudices and biases and so all of that meant that the only words and phrases that would be explained were words that were used by well-educated middle class people. And of course they were explained in terms of those prejudices and beliefs.
So you got a very biased form of dictionary that didn’t cater for the masses. But once corpora became used we could see the usage of thousands and thousands of people all at the same time. So this meant we could get a view of the language not just as it was used by educated middle-class people but by the whole population. So basically we enabled us to have a more democratic view of how language is used in our society. We can also look at different types of text, what we call genres.
So for example we can look at writing of all kinds, whether it’s on websites or in newspapers in published books or also in more private areas like letters or emails or text messages. Some of my forensic linguistic colleagues are using text messaging in order to pin down criminals. And we can also look at spoken language of lots of different kinds so whether it’s Hollywood films or TV programs, radio programs, YouTube, we can look at the language used in all of those. Computers developed so quickly. Very soon we were able to make these collections of texts, these corpora, bigger and bigger. And I mean now there are billions of words in modern corpora.
The other thing that corpora added was a degree of scientific principles. For example. you didn’t have to rely on memory, all the data could be stored for as long as you like. So you could look at it again and again to make sure. It could also be looked at by more than one person. So if you did an analysis and I didn’t agree with it, I could go back and check the data and challenge you on it. So not only was corpora democratic in the type of language they covered, but also in the way that analyses could be done.
The software is fairly simple, so they can actually be used by anyone and I’ve spent a large part of my career going around the world helping language teachers to learn how to use corpora and indeed to teach their students how to use corpora. Because I think that once you see the patterns on the screen they will affect your learning in a much more reliable way than rules and examples. So, COBUILD, we were fortunate to be in the time and place where corpora came together with new ideas about how to study language. And these ideas are now being used in most modern dictionaries. I can’t think of any dictionary that would not bother to use a corpus.
And not only in English but in most other widely used languages all over the world and especially in emerging countries, newly independent countries, where most of them use corpora in order to create a dictionary of their new national language. So it has had the COBUILD project and corpora especially. I’ve had a very far-reaching impact on how we create dictionaries.

In this step, you will learn about the background to corpus-based approaches and their revolutionary impact on lexicography.

In this interview with lexicographer Ramesh Krishnamurthy, you will learn about his experience working on the COBUILD project, jointly run by Collins Dictionaries and the University of Birmingham in the 1980s and 1990s.

Further reading

You can also learn more about the COBUILD corpus by viewing COBUILD: The Early Years:

Part 1: Where it all began

Part 2: A dictionary from a corpus.

This article is from the free online

Understanding English Dictionaries

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education