CQPweb: Creating subcorpora

Watch Andrew Hardie explain how to prepare CQPweb for the comparison of parts of the Shakespeare corpus (e.g. female versus male characters).
Hello, again. This time we’re going to look at how we can use CQPweb to compare different parts of the Shakespeare corpus. So comparing female characters to male characters, for instance, or comparing tragedies to comedies, or comedies to histories, or just taking one of the plays and comparing it to everything else to see what’s unique about it. For example, if we take King Lear and see what’s different in King Lear versus all the rest. This is a two-step procedure. I’m only going to be able to cover the first step in this video. First, we define the parts of the corpus that we want to compare.
And for this we use a function called Create/edit Subcorpora, which is here on the Main Menu. Once we’ve done that and created our subcorpora– That’s what a subcorpus is. It’s a section that we’ve carved out of the corpus for a particular analysis. Once we’ve done that, we can then compare them using the Keywords function. But let’s look at creating subcorpora, first of all. Now there are many different methods to create subcorpora. I’ve got no hope of going over all of them, so we’re just going to look at the most commonly used one, which is Select By Text Metadata If I press Go. It gives me the new corpus screen.
And what this does is it gives you a copy of the restricted query list of text categories, and then lower down we have speaker categories. So what this allows us to do is pick out parts of the corpus according to the tick boxes that we select. So let’s create a subcorpus of just the tragedies. We tick the box we want, and we go just tragedies, give it a name, and then we press Create Subcorpus. There it is. There’s my subcorpora now on the list down here. So it’s got 12 texts and it’s about a third of the whole corpus, because the whole corpus is one million words, roughly. So that’s easy. Let’s try female characters. That said women.
And then we’ll go down here to where the speaker data is. And we’ll go to sex, and we’ll select F. OK, let’s create that as a subcorpus. There we are. Now the size of that is measured in utterance units rather than texts, because the women are spread out across many, many texts. So we can also create one for men so that we can compare men to women. Let’s call this, the men.
Down to the bottom. Men.
There we go. And I also said we would try comparing King Lear to everything else. So I’ll create a subcorpus that is just King Lear. So we go down to Play Code section because we want to select just one play. We select them by code. This is listed in the corpus documentation. KL is the code for King Lear, so there it is. Find Create Subcorpus. That we are. So we’ve got just King Lear, just the tragedies, the men and the women.
Now before we can use the Keywords tool to compare these sections of the corpus, we need to make sure that we have frequency lists– the frequency list of statistical data on what actually is contained within these sections of the corpus that we’ve outlined. And we need to have these available in order to do the comparisons, though not available to start with, which is why we have this button that says Compile. We just click on it and wait, hopefully not too long. And the compile changes to available, super. Let me just make these available.
The ones that involve types of speaker typically take longer because their utterances have to be pulled out of lots of different texts. OK, that’s it. These are our subcorpora. We’ve got the sections of the corpus. We’ve got the data we’re going to use. Now we can go to this Keywords place and actually do the comparison, but that’s going to be in the next video.

We strongly advise you to listen to Andrew Hardie’s talk in one window of your computer, and open up his program, CQPweb, in another, so that you can practice what he is saying as he goes along. Obviously, you will need to pause his talk periodically.

In this talk Andrew Hardie explains how you can use CQPweb to compare different parts of a corpus (e.g. male characters versus female). The first step, covered in this video-talk, is to create “subcorpora” containing the parts of the corpus you are interested in. In the next video-talk, we will cover how to do the actual comparison (i.e. a keywords analysis).

The corpus documentation that Andrew refers to in his video can be found here:

As usual, put any issues or concerns or simply interesting observations in the comments.

