Skip main navigation

A drop in the ocean: Corpora as samples of language

Corpora should be understood as samples of language.
Hello. Today, we will be talking about corpora as samples of language. We will be looking at the process of building corpora and also at different corpus types. I am standing on a beach near Lancaster, and behind me is a lighthouse at the end of Stone Jetty and the Morecambe Bay, which opens up to the Irish Sea. We are here to demonstrate how corpora work as samples of the vast amounts of language that is produced every day. The title of the lecture is “A Drop In The Ocean.” This is because corpora, especially general corpora, representing the whole language, are like small drops in a large sea of language production that is out there.
To put things into perspective, the following quote points to the sheer amount of English produced every day. “It has been estimated that on average, a person utters 16,000 words per day. With about 400 million people who speak English as their first language and an additional hundreds of millions who speak English as their second language, the daily spoken production of English alone can be estimated to be in the order of trillions of words.” A trillion is a number that starts with 1, followed by 12 zeros. This is also an estimated number of grains of sand on a beach, such as the one I am standing on.
So the daily spoken production of words in English can be compared to the number of grains of sand on this beach. Most of this production will be spoken language produced in the moment and never recorded. To this, we also have to add the written production through various media, online and offline. So given the vast amounts of language out there, analysing language might seem like an impossible task. We are dealing with a sea of words that we can easily drown in. But it is corpora that come to our rescue. The key notion here is a corpus as a sample.
A sample is a statistical term, which gives us great power to make inferences about the whole language based on small but carefully selected subset. So how does this work? I’m holding a test tube with the water from Morecambe Bay. This is a tiny sample, which I can use to determine the properties of the water in the sea, such as its acidity. When I drop in a litmus paper, I can see that the pH of the water is around eight, which is slightly basic. Interestingly, as we know from regular measurements by scientists, due to pollution, coastal waters are becoming more acidic, lowering the pH value, with all the negative implications for the whole ecosystem.
In a similar way, we can use corpora, samples of language, to measure frequencies of linguistic features, their distributions across genres and registers, and their development over time. For example, by using five small samples of written British English of 1 million words each, we can observe a decline in the use of modal verbs such as “must”. In this graph, we trace the mean frequencies of the modal “must” in British texts from 1931 to 2016. The graph shows a pattern of steady decline with the overall decrease in the frequencies of “must” being 55% between 1931 and 2016. This is evidence of the fact that, over time, English started preferring less direct ways of expressing that something should be done.
For instance, we now might be more likely to say, “You might like to do it”, instead of, “You must do it.” So, we have used five samples of the total size of only 5 million words to observe a major linguistic process in English, the decline in modal verbs such as “must” happening over a period of 85 years. Like with a test tube in our illustration, we don’t need to pour the whole ocean into the test tube to measure the quality of the water. So in corpus linguistics, we don’t need to include all the language that is out there. That wouldn’t be practicable. We just need to create a fair representative sample, which will help us answer our research questions.
Ultimately, the size of a corpus is a matter of granularity. The larger the corpus, the more detailed our analysis can be. In practice, corpora range in size from tens of thousands running words to billions running words. We can visualise the corpus sampling process by imagining a large canvas of different shades of colours representing different varieties of language, different genres and registers, and different contexts in which language is used. This could be an informal conversation around the dinner table, a public lecture, a TV programme, a newspaper article, a fiction book, a tweet, a text message, and so on.
We then have a small container our corpus to be, which moves into different corners of language and includes texts and speech that represent that particular use of language. When the exercise is finished, in an ideal case, the colour of the container would match the colour of what we call the population. That is, language out there. This is what we call representativeness of a corpus. Representativeness of a corpus is the ability of the sample, the corpus, to reflect the population on a small scale in crucial aspects relevant to our research questions. So how do we achieve this in practice? The first step in the process of building a corpus is the corpus design.
We have to decide which types of texts we want to include in the sample and what proportions of these we need. This is called corpus balance. Balance is often determined by the tradition looking at similar corpora and making our corpus comparable with them, and also by our specific research questions and the broader needs of the research community we want to share the corpus with. I will show you an example of the British National Corpus 2014, a general corpus of spoken and written British English that was developed at Lancaster University. It is a relatively large corpus, containing 100 million words. This table shows the major genres and registers in the BNC 2014, as well as their proportions.
We can see that 90% of the corpus is formed by writing across different registers, such as academic prose, fiction, newspapers, magazines, e-language, and other written registers, and 10% is spoken language. BNC 2014 is an example of a large general corpus representing British English with a broad range of genres and registers included. However, especially if we want to focus on a single genre or register or a particular context. Such a corpus is called a specialised corpus, and is typically smaller in size than a general corpus. An example of a specialised corpus can be an academic language corpus, either written or spoken. Another type of corpora are L2 corpora, sometimes also called learner corpora.
These are corpora, such as the Trinity Lancaster Corpus, which capture language of L2 speakers, speakers of a second or additional language. Those researchers interested in the history of language can use historical, sometimes also called diachronic corpora. These sample language at different historical periods. An example of a historical corpus is The Times corpus. Finally, there are also parallel corpora, sometimes called translation corpora. These consist of translations of the same texts into different languages and help us understand languages in a contrastive perspective. This is useful for translation studies and many other types of research. Once we decide on the design, we need to go out and collect the data.
The method used for this is stratified random sampling, where possible, to avoid bias and to increase the representativeness of the corpus.
There are four main sources of data in the BNC 2014: web crawling, data from publishers, data from authors, and public participation in scientific research. That is, volunteers who, for instance, record their conversations and share them with corpus builders. However, on a small scale, corpus data can be easily collected manually, especially if we want to focus on a single genre or register, or a particular context. So, we have come to the end of the lecture. The tide is coming in, reminding us of the sea metaphor which we started with. We now have time for a quick recap before we have to move out of the way. In the vast sea of language, corpora, the samples of language, make linguistic analysis manageable.
They are like beakers, lighthouses, that help us navigate through the large range of genres and registers that our language includes. Corpora are representative of a language, or language variety, they were designed to sample. We also aim at balanced and comparable corpora.
There are different types of corpora: general and specific, academic language corpora, L2 corpora, historical corpora, and parallel corpora. We always need to select a corpus according to the particular aim of our research, guided by our research questions. The size of a corpus is dependent on how much evidence we need to answer our research question. The more detailed the research question is, the more evidence we need. Thank you for listening to this lecture.

This short lecture introduces corpora as samples of language. It is a lecture from new programmes offered by Lancaster University: MA programme, PGCert in Corpus linguistics.

The programmes include a specialised module on ‘Corpus design ad data collection’, which can be taken for credit either separately or as part of the MA or PGCert. The module develops key skills in corpus design and data collection. Building on a long tradition of corpus development at Lancaster University and providing specific examples from recent projects such as the British National Corpus 2014, Guangwai Lancaster Corpus of L2 Chinese or Trinity Lancaster Corpus, the module offers both theoretical knowledge and practical skills for the students to be able to build their own corpus.

This article is from the free online

Corpus Linguistics: Method, Analysis, Interpretation

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now