Skip main navigation

New offer! Get 30% off your first 2 months of Unlimited Monthly. Start your subscription for just £35.99 £24.99. New subscribers only T&Cs apply

Find out more

In conversation with Paul Baker

Paul Baker discusses his work on corpus based approaches to discourse analysis.
OK. I’m happy to have with me Paul Baker to talk about his research in corpus linguistics. Now what’s the focus of your research in corpus linguistics? And why did you choose that focus? Well, my focus is more socially-motivated research. It’s about linking things like discourse analysis and critical discourse analysis and then using corpus techniques in order to investigate social research. I got into this by kind of a strange way. I started off doing my PhD, but it wasn’t corpus-based at all. OK. It was looking at representations of gay men and also language using that gay men were using as well. Is that your work on Polari? Yeah. I didn’t use any corpus linguistics in it.
But later I wanted to extend that research and to focus more on how gay men are being represented. So I had various very large amounts of text and lots of newspaper articles. I had legal debates. I had magazine articles and personal adverts and things like that. And I wanted to think of a way that I could do a very, not a quick analysis, but something which would cover all of that data. Comprehensive. Yeah. But I turned to corpus linguistics . So it’s the volume of data in some ways that drove you to that. Yeah. Right. Yeah. Well, what kinds of research questions, then, did you want to ask of data like that? I think it’s quite an interesting question.
You can start off with almost no questions at all or very general or vague ones– such as, how is this group or this concept represented in this corpus?– which is a very vague question that can go lots of different ways. I think it’s OK to have maybe more specific questions. So one, we looked at representations of Muslims. We started off by asking how were they represented in newspaper data. But then we also wanted to know, based upon the way the corpus was set up, whether there was any change between, difference between newspapers and whether it was maybe a change over time. But they were kind of high level, top questions.
And as we started to engage with the data more, questions emerged as we went on as well, which we didn’t really think of to begin with. For example, when we looked at collocates that were Muslim, we found that men and women came up as very strong collocates. And that kind of led to a question about how is gender represented within this data, because men and women seem to be such steady concepts. So you, in some way, get close to the corpus-driven approach to start with, in that you don’t necessarily start with lots of hypotheses. But then they emerge from the text. Or sometimes you do start out with hypotheses, yet things can still emerge from the text. Yeah.
I think the nice thing to have is not to say there’s one set way we’re doing this. If you want to keep your options open, you can. But if you have a very specific question or hypothesis that you want to look at, that’s fine as well. But I think all the way through I think there should be this allowance to kind of add something in that you haven’t thought of at the start. It’s quite important. To be open to being surprised. Yes. Yes. So what type of corpora have you worked with or indeed built? Because I know you’ve built some corpus? I think we keep coming back to newspaper corpora. Part of the reason, we’re familiar with it.
And I feel even though people are not maybe buying newspapers as much they’re still reading them a lot online. And also, in terms of availability and ease of access, it’s easier to get those corpora, I think. We have at Lancaster something called Nexus UK, which is a large database. So we can use that to, you know, type in search terms for various concepts and get a lot of data very quickly. That’s something I’ve used a lot. But I’ve also looked at other types of data. I do try to still stick to what’s available online. It’s quite difficult, I think, sometimes when you just have lots of prints. You scan it in yourself.
And I have done that in the past, but it takes– That is very time-consuming. It is, yes. And you’ve got to go and check everything, because most of our software doesn’t work everything out anyway. Yeah. Hansard is quite a good thing online as well. OK. You have House of Commons, House of Lords debates and then looked at debates on fox hunting, on equalisation for the gay men and things like that as well. OK. And again, that’s quite useful to get data from. Right. Well, I’ll make sure that the URL is online for our viewers so that they can have a look at Hansard themselves.
But it strikes me that a lot of your research questions are very social-science-oriented, that they’re all socially engaged and motivated.
Does that present you with particular problems in corpus-building or in using theory? Not really so much. I don’t think that’s a problem at all, no. OK, good. So no particular problems, though, for socially-motivated research and building corpus? I guess you could say there’s the ethical issues. But one corpora that I kinda built is open access or it’s publicly available. We can’t even change people’s names. It’s available online and– Yeah. Yeah. I think if you did want to look at, say, someone’s letters or diaries and they were private, there would be obviously ethical and permissions issues around that. But I think those are things that I haven’t had to really deal with. OK.
So the data’s there, so you dodged lots of issues of ethics. Also, the theory, to some extent, is there that you can use on occasion. What other methods, though, might you use with corpora or have you used with corpora? In terms of methods, again, I like to keep my options open. And sometimes the data itself will drive certain methods. But I find things like keywords and frequent texts are a good place to start, particularly if you don’t have those big research questions at the outset. And they help to give you a focus of a way into data. I think keywords are very good at acting as signposts for what’s salient within a particular corpus of a specific text.
If I’m looking at a particular group, say Muslims or gay men or something like that, actually I do searches on that word or that term or related term. Yeah. So I look at collocates of that term, and then I look at concordances. I look at concordances of that term that collocate. I look at concordances of the term anyway to look for things like discourse prose, as well. So I think those are probably the main methods that I use, yeah. So if there’s a shriekingly obvious set of words which seem to link to your research question, you can start with those.
But if you’re more interested in the text’s collection and getting it to talk to you, maybe keywords is a better way. Yeah. Though, again, I think it’s good to keep your options open. You can have a list of search terms to start, yes, I think. But keywords are quite good, and they tell you some terms maybe you hadn’t thought of. Right. That’s a good point. You mentioned keywords. You mentioned collocation, concordances, of course. Everybody watching should now be familiar with those, because they’re using AntConc. What packages have you used? Do you always use the same one? Or do you stay flexible there also? Yeah. I started out using Wordsmith.
That was the one that was available at the time, and it was the only one that I knew about where you could actually load your corpora into it. That was about 10 years ago. And since then, there are more different types of software available. I think Wordsmith is a very good tool. Although you have to pay a little bit to use it. And sometimes it can be a little bit slow and clunky, complex for a new user to use. Idiosyncratic, should we call it? There are lots of windows and things like that. AntConc, I think, is a sort of slimmed down version.
Maybe it can’t handle as much data as quickly, and I don’t think it has the full-around functionality as Wordsmith yet. But if I’m doing something maybe where there’s only a small amount of data, a reasonably small corpus, maybe 200,000 words or something, then I’ll probably just use AntConc. But I’ve also used things like the COHA, Mark Davies’ Corpus of Historical American English, which had enormous amounts of data. It’s all an online interface, which is amazing. You can’t get access to the full text, and you are a bit limited in terms of what you can do with it. And I’ve also used Sketch Engine as well, which is another great tool which is online-based.
You have to load your own corpora into it, which can be a bit tricky at times, though. They do hire people that can help the interface and talk to them and email them. And then if you had– A problem. That’s the best time. –come to do that. And one of the nice things about Sketch Engine, particularly for my research, I’m sometimes quite interested in agency and positioning in certain social groups. It tells the data, and then it sort of tell you, collocates, lists them in tables, so it will tell you all the collocates which are verbs which position something as an agent. Then it will collocates which are verbs which position something as a patient.
So it tells you who’s doing what– To whom via collocates. So that’s really quite helpful with that. Right. Are there strengths of using the corpus approach to socially-motivated research? Definitely. And that’s one of the reasons why I got into it in the first place. It does allow you to handle these enormous amount of data, which would be really difficult, I think, if you were just doing traditional CDA-based research, which tends to be very qualitative. And unless you have a huge team and lots of time, you can only really look at a few texts in detail. And the issue about looking at a few texts in detail is that sometimes there can be questions about why those texts were chosen.
Were they cherry-picked in order to prove an antilogical point? Or are the findings within them somehow not very representative of kind of a wider pattern with a certain text type? You kind of get round logical thing by looking at as many texts as you can that cover that point. And also, I think corpora are very good at showing this kind of cumulative effect of discourse, where you have quite subtle patterns which might not be very obvious.
But when you see them again and again, kind of drip, drip, drip effect over time, a certain positioning, or certain collocates associating a certain person in a slightly negative or positive way, you can really get to see that, I think, with a corpus approach. Yeah. You mentioned CDA, though. That’s Critical Discourse Analysis. Yes. Yeah. Which has been very influential. I don’t want to be disparaging CDA. And a lots of the techniques that I’m using in corpus linguistics have come from CDA. And I think also some of the techniques, CDA on its own, in its own right, I think, are really, really helpful. Yeah, OK. Well, nothing’s perfect, to talk about some strengths there.
Are there any potential weaknesses for using corpora, the type of research you do? Yeah. I think it can be quite easy to do maybe a bad corpus-based discourse analysis, in that you can maybe underinterpret or overinterpret patterns you see if you don’t pay enough attention to concordance lines in detail. Sometimes when you’ve got to really expand the concordance lines to the whole paragraph or even go to the whole text. And it’s quite easy sometimes to miss patterns, to see something and just maybe focus on perhaps the immediate left-hand or right-hand collocates but not see– The much broader pattern there. Yeah. Right.
That maybe actually someone’s using that collocate in order to negate it, to say, actually, people say this phrase, but actually it’s wrong. If you only look at the first words to the side of the search term you may not get that so much. Like in a paragraph with da da da da da da, but people who say that are completely wrong. Yeah. You need to sense it to that broad context. Yeah. And also, the corpus data, it’s very kind of taken out of context. So if you’re looking at something which is highly visual, unless you tagged all these visuals or you’re doing some sort of multimodal analysis, you’re going to miss those as well. So there’s an issue there.
I think also the analysis is only really as good as the tool and also the person who’s doing the analysis, which I suppose is something you can say for any type of analysis, really, as well. Yeah. But I do sometimes feel like the tools maybe force us down certain routes. With keywords, for example, I think when we start doing the keywords analysis we’re automatically in a kind of difference mindset. We’re thinking, what’s different from this to this? And sometimes maybe the similarities are more important than the differences. But because the tool makes us think about keywords, we may miss that kind of pattern, and maybe we can overstate the differences as a result of that.
Sometimes, I’ve kind of, with keywords in particular, I’ve tried to be aware of that and maybe sometimes go up to the third or a fourth reference corpus to compare against two other corpora. And that will give you the differences and the similarities as well. So there are ways around it. I think as long as you’re aware of the limitations in the method. I’m guessing from what you said, though, you probably advocate some type of hybrid between corpus work and some close reading of individual texts where you can to get that broader context. Yeah. Definitely. As I was saying before about reading concordance lines, I think it’s a really important aspect.
And thinking about maybe some wider issues around context that you can’t get from just a corpus analysis, so thinking about things like conditions of production and reception of different texts, which can be a bit difficult if you’re using a very large reference corpus where there’s thousands and thousands of texts and they all have different conditions, permission, reception. But things like that, I think, can be quite important. So conditions of direction and reception, then, it’s about who is producing the text and processes around producing it. And then reception, what happens when people read it? Is that the pattern? Yeah. Some of production are things like, what can people say in the society?
What rules and regulations, say with a newspaper, what are the guidelines that are set up in that country for a newspaper, in terms of what they can and can’t say? And are they trying to push the boundaries? With reception, maybe, you know, have people complained about certain articles? With newspaper texts, for example we can go up to the PCC, the Press Complaints Commission’s, website and look around which articles have been complained about in corpus, which is really quite interesting. That’s interesting. A bit further back in the conversation, you were talking about cherry-picking and things like that.
Do you think, then, that the corpus and the sort of principles of corpus linguistics, total accountability, no prior selection of text, do you think that this can be used almost like an amulet or charm to say that we’re moving away from research or bias to some degree in our analyses? I think we’re moving away, but we’re not taking it away altogether. I think it’s really important to make that point. We’re always biassed. And I think when we engage with the corpus analysis, the tools themselves are biased, because they’ve been graced by humans, and we’re biassed when we use those tools. So we make decisions about what techniques we’re going to look at and what our research questions are.
We’ve chosen often to look, I think, at a topic or a group when we do this type of research because we maybe suspect there’s something going on there interesting, in terms of people manipulating readers, that sort of thing. So I think, it’s there, you know. You can say, I’m going to be an objective researcher. But I think maybe that’s a bit of a fallacy. I think even the idea of an objective researcher, in terms of poststructuralism, people would say that is it itself as a position to be objective, which is a different position. Yeah. So I think you can’t really get away from that. And that’s not a problem, I think, as long as you’re more legit.
Our technology and the corpus techniques can reduce the bias, but they can’t completely remove it altogether. For example, another thing, we have to decide about where we impose cutoffs, say, when we’re looking at keywords or collocates. And that can introduce a level of subjectivity into analysis. Yeah, absolutely. Also, sometimes we get so many keywords and so many collocates we have to make decisions about which ones we’re able to look at, which ones we’re going to analyse. And our eyes might be drawn to certain things because we think, yeah, that’s interesting. Or I know why that keyword’s there. I’m going to talk about that. And maybe you can neglect some of the other keywords and not focus on them as much.
Simply because of where you’ve chosen to look. Yeah. Or maybe even something as mundane as word count limits on a journal article. You can’t look at every keyword you get. You have to be a bit cavalier and say only the top 10 or the top 20 or something like that. Should we say economical rather than cavalier? OK. Well, that’s good, then. So this sort of removal of bias is a matter of degree rather than an absolute. Yes. OK. So we’ve accepted that. We’ve accepted the approach. We’ve accepted it’s relatively hybrid. Can we now answer all research questions using this approach? Many of them I think we can. But again, I think we should be careful about speaking in absolutes.
So, I think, as I was saying earlier, it’s often useful to consider things outside the corpus and think about the social and historical and political. And if they’ll make no other contexts– All of that on the side which is forming itself. Yeah. And try to bring it in. Because the corpus can’t really explain the result. It can only really give us interpretations on things we hadn’t thought of. But it doesn’t really explain why we’re seeing this certain patterns or a certain word that’s another word. We have to look outside to society. And that’s something I’ve tried to do more and more with my research. I didn’t do it so much as I start with.
But I think the more I’ve tried to engage with CDA and bring CDA in– and that’s something that they are very strong on, thinking about in the simpler context. So you’re looking outside of the corpus into the society that created it in order to explain observations that you made within the corpus. Is that right? Yes. And try and link certain events. So it has sort of a historical aspect to this, looking at changes in society. Because laws and those things like that can be very useful. Household surveys? Things like that? Yeah. Panel surveys? Yeah. All types of social scientific information. Definitely. Things like attitude surveys, so attitudes towards, say, a certain social group. Has that changed over time as well?
And also demographic information as well, you know. Anything that linked through to looking at the collocates and things like that so you might see if the collocates have kept in step with the change in attitude reported in some survey, for example. Yeah. I’m just trying to understand. Yeah. Or really enormous events which have happened. Something like 9/11, so if you’re looking at representations of Muslims in the press, can help enormously to explain certain things. Yeah. Yeah. That’s really very interesting. Is there a problem, then, to get people outside of academia interested in your work? Because a lot of what you’ve just been talking about, it sounds socially relevant.
It also sounds as though people in various halls of the government should be potentially interested. They produce a lot of that extra data you’ve been talking about. Any challenges there or opportunities? Lots. I think as social scientists we should be trying to engage a great deal with people. We don’t want to just do our research and it’ll only be read by a few other academics. An academic vacuum. Yeah. And not to have any impact, I think that’s terrible when that happens. But it does happen an awful lot, I think. Yeah. I think one of the problems with this approach is it’s actually still such a new approach, so people aren’t aware of it, even within academia.
And so they’re trying to get it out there and tell other people about it. I think people can find it possibly a bit too technical. And there are lots and lots of terminology words within it, which can be a bit impenetrable, I think. So we need to think of ways, I think, in making our language not dumbed down, but just accessible to people who don’t know about terms like collocation or what a keyword is or mutual information or things like that. And I think we need to find ways of engaging with interested parties to do that. Because, I suppose, sometimes this might be members of the public.
Your work on Muslims and Islam, for example, many people in the UK, potentially, I suppose, would be interested in reading about that. They might have read the press and thought, that doesn’t sound quite right. But maybe your research has the capacity to, if you like, in some structured way explain what they experience. Yeah. And we have started working with a group called Engage who are looking at the media and representations.
But you know, some of the feedback we’ve been getting sometimes is that the public don’t often want very long length and detailed explanations of methodology and theory. They want results. And they want results to be sort of snappy and short and kind of visually interesting as well. I think using maybe a visual means to get a point across, using graphs and things like that or pie charts with pictures, I think, in order to push a point across is quite important. Right. Yeah. So I think we need to learn new ways to engage with the outside world. And Engage, isn’t that an organisation which encourages Muslims to engage with civic society in the UK, British Muslims? Yes. That’s very positive.
And don’t they work with the All-Party group on Islamophobia in the Parliament– Yeah. –so you get some sort of connection through to policy there? Yeah. Yeah. And that would be great, if that could happen. Have you done that at all? Yeah. Yeah. We gave a presentation at Engage’s launch event at Parliament in November. Oh, right. So we’re working with that. Well, look, thanks very much. That was really very interesting. Absolutely fascinating, I’m sure, especially for social scientists watching. Thank you. Thank you, then.

This presentation is an optional activity.

Paul Baker discusses his work on corpus based approaches to discourse analysis. Find out more about Paul’s research

This article is from the free online

Corpus Linguistics: Method, Analysis, Interpretation

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now