Skip main navigation

Statistics meets corpus linguistics

A lecture on statistics
So in our first lecture, we will be looking at the basic concept of statistics in corpus linguistics. So statistics meets corpus linguistics. So where do we start? Well, we can start with the definition of statistics and what statistics can do for corpus linguists. I’ll offer two definitions of statistics. One is by Diggle and Chetwynd that says statistics is a science of collecting and interpreting data. The second one that is more informal is my own definition that claims that statistics is a discipline which helps us make sense of quantitative data. Whenever we have numbers, statistics can help us make sense of them. There are three basic things that we can do with statistics– generalise, find relationships, and build models.
And I will go through these one by one giving you some examples from corpus data. So, generalising. In this example, I am looking at the use of adjectives by fiction writers. In this example, I’ve taken 11 randomly sampled writers from the British National corpus and looked at the frequencies with which they use adjectives in their writing and ordered these from the lowest to the largest frequency. Obviously, I can be listing all these numbers. But instead of these, I can give you one number as a representative of the whole group. This number is called the average. There are different types of averages in statistics. One of them is called the mean.
The mean is calculated by taking all these values together and dividing these by the number of observations. In this case, the number of fiction writers. Our mean for this dataset is 591.41. However, there are different types of averages. For instance, the median. The median is a value that sits right in the middle of the dataset. In this case, it is the value of 567 that represents the whole dataset. By using average, we can generalise over very large datasets. Another thing that we can do with statistics is to find relationship between different variables. In this case, I’m following the example of adjectives by British fiction writers. But in addition, I am also interested in the use of verbs.
So in this case, the first line represents the frequencies of adjectives in British fiction writers, and the second one represents the frequencies of verbs. Again, we are dealing now with a fairly complex matrix. But instead of giving you all these numbers, I can draw a simple line like this one. In statistics, this is the so-called regression line, or the line of the best fit that sits in the middle of our data cloud. In the graph that I’m showing here, we have the number of adjectives on the x-axis and the number of verbs on the y-axis. The circles are the individual fiction writers, and the line represents the trend that we can observe in the data.
What this line tells us is that there’s an inversely proportional relationship between the use of adjectives and verbs. The more adjectives the fiction writers use, the fewer verbs they do, and vice versa. Finally, with statistics, we can build models. Models are complex equations that we can use to describe the reality that is out there. Let’s take this example. If we are interested in the area of Great Britain and how to calculate this– obviously, we can look this up very easily, but let’s say that we want to find a principle mathematical way of finding the area of Great Britain. In this case, we would use geometrical modelling.
So we can apply different geometrical models such as a rectangular model, which doesn’t fit that well. We can use a circle. Again, there are some problems with this model. Or the triangle, which seems to be the best model for this particular exercise. It’s not a perfect model, and models never are, but it captures the area of Great Britain best out of the three options. We know that there’s a simple equation for calculating the area of a triangle, which is height times base over that height divided by 2. And when we take the measurements from Google Maps and plug them in the equation, we get the number of 234,000 square kilometres.
When compared with the reality, we can see that we actually got pretty close. And this is the whole point of mathematical modelling. We can take relatively simple equations to model very complex reality that is out there. In corpus linguistics, the reality of language. So what can statistics actually do for us? There are two areas of statistics– descriptive and inferential statistics. So we can describe and we can infer. Let’s look at some of the key concepts within each of those areas. So for descriptive statistics, we will be talking about datasets, frequencies, dispersions, graphs, and collocations. Whereas for inferential statistics, we will be talking about statistical tests, p-values, 95% confidence intervals, null hypotheses, and so on.
Let’s have a look at statistical testing as an area of inferential statistics because it is probably the most visible area of statistics when you read different articles, and you might come across these. So statistical testing is an area where we take a hypothesis and try to look at this hypothesis, whether we can actually be confident about it. So the first step is to formulate a hypothesis. In our example, we are taking a socio-linguistic hypothesis that men and women use language differently. Our next step would be to formulate what is called a null hypothesis, a hypothesis that contradicts the hypothesis that we have just stated. It’s a reverse hypothesis. There is no difference between how men and women use language.
Then, we would go out and collect the data into corpora and investigate this hypothesis. So imagine we have two corpora, male corpus and female corpus, and we find out the frequencies of the variables in question. Let’s say that we come up with two numbers, 16 and 14. And then we ask the question, is there a real difference between these two corpora? Well, at the face value, 16 is larger than 14, but they are fairly close to each other. So is there a real difference? And this is where inferential statistics can help us.
So the question that we are asking is, is the difference that we observe in our datasets in those two corpora due to chance, or is it statistically significant? What we do is we run a statistical test, and the test depends on the shape of the data and our research design, but each of those tests will produce what is called a p-value, a probability value. The p-value is the probability of seeing values at least as extreme as observed if the null hypothesis were true. Conventionally, our cutoff point for p-values is 0.05, or 5%.
So if the p-value is smaller than our conventional 5%, we can claim that the result is statistically significant, that there is statistical significant difference between the male and female use of language in our example. On the other hand, if the p-value is larger than 0.05 or 5%, we have to conclude that there is not enough evidence in our data to reject the null hypothesis. So the null hypothesis still may be true. We never accept it. But at that stage, we need to go back to the data and perhaps collect more data to be able to see whether there is a real difference out there.
Now, let’s look at statistical testing in a larger picture, and let’s look at different levels of statistical analysis. Or we can think of these as different steps which we can take in order to explore and investigate data. The first dimension is data exploration. The key question to ask here is, what are the main tendencies in the data? This is the area of descriptive statistics where we look at graphs, means, and standard deviations. The second area that I have already mentioned is the area of inferential statistics, where we are at the amount of evidence that we have against the null hypothesis. The question that we are asking is, do we have enough evidence to reject the null hypothesis?
So is the effect that we see in the sample, in the corpus, due to chance, or does it reflect something about the population, about how language is used out there? For that, we are carrying out statistical tests, test statistical significance, come up with p-values, confidence intervals, and so on and so forth. The third area is the area of effect sizes. In addition to inferential statistics, we traditionally report effect sizes. Effect sizes measure the amount of effect that we can see in our data in terms of a standardised measure, such as Cohen’s d or r. We are interested in what the effect is.
Effect is something of interest to the scientists that we can observe in the data and then we can– based on this– contextualise our findings. And finally, as linguists and social scientists, we are interested in the linguistic and social interpretation of our findings. So the question that we are asking here is, is the effect that we observe socially and linguistically meaningful? So we connect this with social and linguistic theory and build the overall picture based on our data. What is also important– and I’ll give you just some examples here– is having the tools of data exploration and data visualisation. One of the tools that is probably most effective is a graph that we call 95% confidence interval error bars.
This graph shows us the mean values for different corpora that we would want to compare and contrast and 95% confidence intervals as the red error bars in each case. What we can see is whether these error bars overlap or not. And based on that, we can make a judgement about statistical significance. If these error bars do not overlap at all, we can be certain that a statistical test would turn out significant. If these overlap to a large extent, then again, there wouldn’t be any statistical significance. If there is a tiny overlap, we would need to run a statistical test to confirm. However, with data visualisation, we can be fairly creative.
The second graph I’m showing here is an example of geo mapping. We take the corpus reality and map it onto a map of the world. In this case, I’m searching for places where people in the British National Corpus go or travel to. Obviously, you can see in terms of the popularity the more frequent, the larger the dot on the map. London would be the number one place, followed by many places across the UK. But then outside of the UK, it would be Paris, Rome, and New York.
And obviously, there are many different ways in which we can visualise data that you can learn about in the book. So finally, to summarise, corpus linguistics is a scientific method. And for that, it needs the tools of statistics. Successful application of statistical techniques in corpus linguistics depend on the use of well constructed and unbiased corpora. Statistics uses mathematical expressions to help us make sense of qualitative data whenever we have numbers. Effective visualisation summarises patterns in the data without hiding important features. Most visible of all are the p-values, part of inferential statistics, but they form only a small part of statistics.
Statistical significance, practical importance, and linguistic meaningfulness are then three separate dimensions which should be considered separately and understood as separate, and interpreted separately.

This short lecture introduces basic concepts of statistics in corpus linguistics. It is the first lecture of a lecture series on statistics from a new module on Statistics and data visualization offered by Lancaster University as part of an MA programme, PGCert and also individually for credit.

The module offers a practical introduction to the statistical procedures used for the analysis linguistic data and language corpora. The module provides an overview of the main statistical procedures (e.g. Correlation, cluster analysis and factor analysis, T-test, ANOVA, chi-squared test and regression models) used in the field of corpus linguistics together with examples of application of these methods.

This article is from the free online

Corpus Linguistics: Method, Analysis, Interpretation

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education