Skip main navigation

New offer! Get 30% off your first 2 months of Unlimited Monthly. Start your subscription for just £29.99 £19.99. New subscribers only. T&Cs apply

Find out more

In conversation with Adam Kilgarriff

Professor Tony McEnery speaks with Adam Kilgarriff about Corpus Linguistics
13.8
OK, so I’m delighted today to be talking to an old friend, Adam Kilgarriff, and thanks for having me to your home for this conversation. I think we first met maybe 20, 22 years ago– something ridiculously long, anyway. How did you first get interested in the study of language because I think, if I remember right, your first degree wasn’t linguistics. Is that right? Yeah, that’s right. I started with philosophy as an undergraduate, and I feel quite like a philosopher, but jobs in philosophy aren’t easy to come by. So after I few years in the wilderness, I went back to university to study artificial intelligence, thinking I’ll get on well with computers, which I did moderately.
56.6
But my course wandered across to that part of artificial intelligence is how we represent the kind of knowledge we have in the human head on a computer. And part of that knowledge is lexical knowledge, knowledge about words. So my route to linguistics was really– that was my route to linguistics. My PhD thesis started being about meanings of words, and that took me to the heart of linguistics. So when you were doing philosophy, did you do philosophy of language– Wittgenstein, Austin, Searle that type of stuff? A little bit, but I found it desperately hard. I just felt irredeemably confused about what meaning were, so I failed entirely to get my head around what meanings were.
103
But now, 30 years later, I kind of feel like I’ve got a fair idea now. I’m satisfied– satisfied for myself that I’ve got a fair grasp of what of meanings are. Because meaning, in a way, the word meaning has sort of been quite a theme for your research. You did a lot of your early work on words sense disambiguation if I remember well. What’s that about? Well, the basic problem that’s usually stated is that if a computer is going to understand human language, then it’s got to work out the structure of sentences. It has work out the grammar.
141.8
It’s got to work out everything to do with the endings, so that it knows that finishing is part of the word finish. And one of the other things it needs to do is to work out which meaning of the word you’ve got, which sounds obvious when you’re looking at word. When you think, well, what about bank? It can mean a river bank or a money back, so the computer has to work out which. And then all of my thesis was about how that isn’t a constructive way of looking at it because, in fact, most of the distinctions in meanings between words are much subtler and much more a matter of opinion and a matter of careful judgement .
175.6
Because I think bank is often used as a sort of ready-made example. It’s fairly clear that this thing has quite distinct meanings. But can you give us an example maybe of– Well, I like to stick with bank that when you start thinking about banks of clouds and metaphorical banks, also things like blood banks. I remember being particular struck when I went out with my little brother right about this stage, and there’s a place he went to in the woods which he called the bullet bank because it was being bullet practise, firing practise around in the Second World War, and there were lots of bullets that he and his friends would find in the bank.
217.1
And I was kind of thinking, well, is that the kind of bank you get beside a river because it’s a big earth like that, or is it the kind of bank, like a money bank, because it’s where you store things? It’s sort of both. It’s sort of both. So even bank isn’t really an easy case. OK, well, I’ll remember that. Hopefully a clever undergraduate will shoot me down with that next time. Shoot you down would be most appropriate. It’s all about bullets.
243.5
So, well, that’s a major task. Is it one that’s sort of cracked by the computer boffins yet? Oh, not at all. It’s one that runs on and on. It was one of the earliest problems that people had been trying to solve since 1950s. One thing I did which was useful, I think, for the field was that in the late 1990s I organised an exercise called Sense About. Oh, yeah, I remember. And the idea there is that anyone who thinks they’ve got a good word sense disambiguation programme can compete. We’ll set up a competition where we’ve got people to say this is right answer here. This is the right answer here.
285
And then we’ll give them the dictionary that we use and also their computer programme is trying to work out what the right answers are, we score them all. So you sort of had a crib sheet somewhere of right answers, and they had to sort of approximate that with their system. And then we could score them all and tell them who had done better and who hadn’t, which was quite– which ever way of getting clearer about what the technologies are and which ones aren’t. If everyone does it independently then it’s hard to work out whether their results saying 93% are comparable to their results Because the data is different. Is it indelicate to ask who won?
322.9
I think that– it was a long time ago. It is indelicate, and I’ll answer it this way, because we did the first one in 1998, and then it’s been running every year since. So we clearly set up something that had legs, and so there have been lots of different winners in between time. A very diplomatic answer. One thing that was established was– the original winner, the winner of the first one– everything was debated. It’s academia; of course, everything’s debated. He was a PhD student at the University of Durham Oh good for him. But he’s not working in the field anymore, and some people would say that his method didn’t follow the rules.
361.7
But it wasn’t very clear what the rules were in that particular topic. I suppose the fact that it’s still going is evidence that this is an enduring problem. ie there hasn’t been a knockout winner, and there’s no cause to carry on, pack up and go home. No, it’s very much because you realise the more you look into it that the big problem is with the list of senses that you’re going to disambiguate between. Do you just have one money bank and one river bank? Or do you have lots of others, including banks of clouds and bullet banks? And what you tend to need– it’s quite likely you’ll need a different– we call that the list of senses, the sense inventory.
401.6
It’s quite likely you’ll need a different sense inventory for each task, so it becomes a very task-specific problem. I suppose also people approaching your competition with a system may have actually developed a different sense inventory, did you call it, themselves, so they would have the problem of soft of retooling that to work with your sense. And what might work with their sense– I can see it’s quite a complicated thing. I don’t think I’ve ever going to enter the competition. One way that that statement to deal with the sense inventory problem is that we did have one simple task back in 1998.
436.6
Now there’s a variety of tasks, and some of them ask questions like, which of these sentences are most similar? And another approach to it is, which of them have the same translation in another language? So these are all trying to get around the difficulty of agreeing on one sense inventory. And what would people use these systems for once they’re developed? Say they did produce the knock-out system. What type of applications are there, other than perhaps dictionary building? Yes, well, there are all sorts of big players who are interested.
469.8
Google would love to have word sense disambiguation so it could disambiguate your query to suit the– disambiguate all the documents on the web so that they can match them up more accurately. And– Big scale queries as well it can be helpful, so you could say what type of bank you wanted to visit. And then quite a lot their effort applies to not so much the ambiguity of ordinary English words, but the ambiguity of names as well. I mean have you got the same James Taylor here and there– things like that. So some of the same techniques have being used, but applied to a slightly different domain of ambiguity names.
509.4
I wonder whether this is all linked then to another major interest I think that you’ve have had over the years, which is in the raw data itself and the corpora because my suspicion is that not much of that work would be possible without large bodies of examples from which to draw, so corpora I’d call them. And you’ve done a lot of really interesting work, say for example, in areas as diverse as web as corpus I know you did a lot there. And also a lot of work on measures for comparing and contrasting corporate. So tell me a bit about that.

In this video Professor Tony McEnery speaks with Adam Kilgarriff about Corpus Linguistics. You can view the full conversation on YouTube

This article is from the free online

Corpus Linguistics: Method, Analysis, Interpretation

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now