Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only T&Cs apply

Find out more

Corpus linguistics

Introduction to corpus linguistics.
Hello. In this lecture, we will explore the story of corpus linguistics, a story of language analysis using computers. As you will see, it is a story full of ingenuity and innovation. As every narrative, the lecture will highlight certain aspects of corpus linguistics and focus on a particular story line. I’m at Lancaster University, and behind me is the Bailrigg House, the oldest building on the campus. It stands as a great reminder of the early history of Lancaster University that is closely connected with corpus linguistics. In fact, Lancaster University is one of the places where corpus linguistics was born. So, let’s travel back in time to the year 1970, six years after Lancaster University was founded.
If you were in the UK at that time, you would have watched the BBC coverage of the Apollo 13 dramatic unsuccessful attempt at the landing on the moon, which followed two successful missions the previous year. We know from a corpus that the Apollo space programme was mentioned in The Times newspaper that year 833 times in 264 different articles, almost on a daily basis. In June 1970, you would be surprised by the election results in the UK, with Edward Heath replacing Harold Wilson against all predictions. From The Times corpus we know that the name of the new prime minister, Edward Heath, was mentioned more frequently than the Apollo space programme, with over 3,000 mentions in almost 300 articles.
In the same year, 1970, The Times newspaper mentions the word “computer”, in singular and plural, 6,323 times, more than twice the mentions of the new prime minister and more than seven times the mentions of the famous Apollo programme. The 1970s is really when the computer technology becomes prevalent at institutions such as the post office, met office, various businesses, and the universities. These are the large events which the newspapers record. The year 1970, however, was key for yet another reason. It was key for the emerging discipline corpus linguistics. In 1970 at Lancaster, Geoffrey Leech started collecting a dataset modelled according to a data set compiled at the Brown University in New York a few years before.
The Lancaster dataset will later become the first large corpus of written British English, Lancaster Oslo/Bergen Corpus, or LOB. The corpus consists of 1 million words, which was a major achievement at that time. This project was completed after an enormous effort and collaboration with two Norwegian universities in 1978. We didn’t really realise what a dreadful time we were going to have over the next 10 years. A big task. And the biggest task was actually the copyright problem we run into. We were trying to negotiate copyright with the British publishers for about 500 texts. And eventually, I sort of gave up the ghost, and I said, and I can’t do this anymore. We’ve run out of money.
We’ve got rather primitive computing resources. We can’t get permission from the publishers or the agents or the authors. Let’s give it up. But at that stage, an angel from on high appeared in the form of Stig Johansson. From Oslo. Yes, he was As you can see, building corpora is no easy task. It involves expertise, determination, and lot of hard work. After Brown and LOB, many other corpora were developed for different languages. This timeline summarises some of the milestones for the English language.
For example, you can see the London Lund Corpus, the COBUILD corpora, the British National Corpus, and the new British National Corpus 2014 for British English, the American National Corpus, the COCA Corpus for American English, the ICE, International Corpus of English, for international varieties of English, and also the Wellington Corpus of New Zealand English on the timeline. In sum, the development of corpora was enabled by the emergence of computer technology which can process ever large amounts of linguistic data. The contrast can’t be more striking. This was the first mainframe computer when the corpus linguistics started. Nowadays, we have more computational power in the palm of our hand and can use it productively for linguistic analysis.
So now it’s really an exciting time to be a corpus linguist, to be able to build on the past achievements in the field and use the advances in computational technology. So far, I have been using the term corpus, corpora, in a common sense way and assume that everyone understands those. Of course, this is not the case. I have been also quoting numbers from the historical Times corpus as if the information about the corpus was common knowledge. Again, this is not a fair assumption. So let me provide a formal definition of what a corpus is and explain a bit more about the Times corpus.
A useful definition of a corpus is provided by McEnery, Xiao, and Tono. They say that “a corpus is a collection of electronic naturally-occurring texts, written or spoken, which are selected to be representative of a particular language or language variety.”
There are four keywords in this definition: electronic, naturally-occurring, selected, and representative. A corpus needs to be available in an electronic format for the computers to be able to handle it using specialised software. Corpus linguists collect naturally-occurring data rather than collect experimental data, as other disciplines. These can be spoken or written, and they show how language is used in real life. Because there’s so much linguistic data out there, we need to select or sample the data in such a way that it actually reflects the general use of language. In other words, a corpus needs to be representative of a language or a language variety.
Turning to the Times corpus, the Times corpus is a very large historical corpus representing all issues of The Times newspaper since 1785. It consists of many billions of words. The 1970 subcorpus alone contains almost half a billion words. This is just to demonstrate the sheer size of some of the corpora available nowadays. Corpus linguistics has come a long way since its beginnings from early corpora, which were no larger than 1 million words, a major achievement in the 1970s, to the multi-billion words datasets which we have at our disposal today. So the message to remember is this. Corpus linguistics is a versatile methodology of language analysis.
It explores the advances in computational technology and draws on linguistic expertise to analyse language use. It employs empirical evidence available in corpora to provide insights into how language is used out there in different social contexts and situations. Thank you very much for listening to the lecture.

This short lecture introduces the discipline of corpus linguistics in the context of its history and present-day achievements. It is the first lecture of a lecture series from a new module on Fundamentals of corpus linguistics by Lancaster University as part of an MA programme, PGCert and also individually for credit.

The module offers a practical introduction to key corpus linguistic techniques such as concordance analysis, the analysis of wordlists and ngram lists, keyword analysis and collocation analysis. It also provides an overview of practical applications of corpus methods in a wide range of areas of linguistic and social research.

This article is from the free online

Corpus Linguistics: Method, Analysis, Interpretation

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now