Skip main navigation

CQPweb: Searching for words (part 2)

Watch Andrew Hardie elaborate yet further on how to conduct increasingly sophisticated word searches.
We’re still on the topic of the kinds of additional queries that we can do using this query box. But now we’re going to start looking at how we can search for different spellings of a particular word. Now in the First Folio Plus corpus, all of the spelling is very nicely curated. The old fashioned 17th century or late 16th century spellings are still there, behind the scenes, but we’re able to search it using standard modern spellings. It’s all very nicely curated. But the Shakespeare project included the creation of a corpus where the spelling is a little less well behaved.
And this is what we call the EEBO TCP segment, which is a selection of books from EEBO, which is a very large online collection. And this is a very, very large corpus. It’s 100 times as big as the Shakespeare corpus and it’s books and other written published documents from around the period of Shakespeare. The example here that I’m going to take is the word lock. Let’s search for the word lock.
Big corpus queries take a little longer. We’ve got 1,000– nearly 2000 examples from 300 million words, which is the size of this corpus. It’s 300 times as big as the Shakespeare corpus. That’s nice, isn’t it? But hang on. We know that in the 1600s, spelling was not yet standardised. So there are various ways in which the word could vary. Let’s take a look at another possibility. Locke with E on the end. We know that typesetters would often add or leave off E’s at the ends of words in the early modern period basically to get the line to the right.
If we’re in locke, then all of the examples of locke with an E will have been linked to the standard spelling lock without an E. So let’s see. No. We’ve got 32 examples where the regularisation, the linking of non-standard spellings to standard spellings, just hasn’t worked. The reason it hasn’t worked is simply because it was done by computer, whereas the ones in the Shakespeare corpus were done by hand and that’s why they’re better. So we’ve got a problem then. What about all the things that might end upon the end of lock? Well to do this, we start using what’s called a wildcard search.
And a wildcard search uses a special symbol to indicate something that can vary in the search term. We’re going to use the star. The star means anything. So if we search for lock star, we will find the string L-O-C-K. We’ll find that word, but we’ll also find L-O-C-K-E and then anything else that might be added onto the end of the word lock. So let’s take a look. Here we are. We indeed have plenty of examples of lock. Do we have examples of locke with an E? Yes. There’s one. Lock with an E there. But we’ve also got lots of other things as well as, of course you can see. We’ve got lots of locks, sometimes with an extra E.
We’ve got locked spell the right way– the standard way or the modern way. We got locks spelled not the modern way. Normally, we would want to have all of the different spellings of the word lock, but none of these different things, if we were trying to analyse the word lock. Just that word, not the head word. Just the word itself. We would want to get lock and locke with an E and then other variant spellings, but not the other forms locks and locked. So let’s have another go in query. How can we do this? Well, what we can do is we can use an “or” query.
If you put something inside square brackets in a CQPweb query, then it interprets as find me this or this. And the two things need to be separated by a comma. What do we want to have at the end of our L-O-C-K? Well we want to have either nothing or an E. So that says find me L-O-C-K with either nothing or and E after it. And here we go. And there we are. It’s getting the two different variant spellings. Just what we wanted.
And you can see here that by using a non-fixed search, it allows us to start to grapple with some of the spelling issues that would otherwise get in the way of us finding all the examples that we’re interested in. That’s it for now. Thanks very much.

This talk is a continuation of the previous one. Again, Andrew Hardie hones your skills in Word searching, but this time he is dealing with what is probably the greatest impediment to searching for a word (if, for example, one wants to try and antedate it), namely, spelling variation.

As usual, put any issues or concerns or simply interesting observations in the comments.

We strongly advise you to listen to Andrew Hardie’s talk in one window of your computer, and open up his program, CQPweb, in another, so that you can practice what he is saying as he goes along. Obviously, you will need to pause his talk periodically.

This article is from the free online

Shakespeare's Language: Revealing Meanings and Exploring Myths

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education