Skip main navigation

Big data: Enhanced Shakespearean EEBO-TCP

Watch Jonathan Culpeper elaborating on big data - the Enhanced Shakespearean EEBO-TCP.
This video talk is the final one in our series on big data. This time we focus on a segment of Early English Books Online EEBO, that has been adapted for Shakespearen research. The problem with using the whole of EEBO for Shakespeare related studies is that it is too broad. For example, for most purposes, we don’t need language from the late 15th century to better understand Shakespeare’s language when he wasn’t even writing until about 100 years later. So as part of the Encyclopaedia of Shakespeare’s language project, we set about creating a specially tailored subset of EEBO-TCP, enhanced for the study of Shakespeare. This work was led by Sean Murphy. Clearly we needed to reduce the date range.
We selected texts between 1560 and 1640, in other words 80 years in total. This encompasses two 40 year periods, or what would have been considered two generations at that time. I for short. This period encompasses Shakespeare’s lifetime, 1564 to 1616, and within that the probable date of his first play, around 1589, 1591. And his last, around 1613, 1614. The 40 year periods for either side of the turn of the century, and that date is important because it is close to 1603 when Elizabeth first died, and James the first became King. And it is also said by some scholars to have been a turning point in Shakespeare’s career.
We decided to exclude Shakespeare’s works for the obvious reason that if they were there, we would be comparing Shakespeare’s works with Shakespeare’s works. We also excluded works by the playwrights already included in the enhanced Shakespeare and corpus of comparative plays. If we wanted to compare results against the plays corpus with results against the EEBO-TCP corpus, we wouldn’t want those results skewed by the fact that the same plays appear in both. Regarding size, the enhanced Shakespearean EEBO-TCP corpus amounts to 5,697 texts, or just over 300 million words. Importantly every text is placed in a genre, that is part of a dedicated genre categorization scheme. We will look at that in a minute.
You may remember in our opening talk in this series I spoke about the word bastard. I drew conclusions about whether the word was colloquial, information, formal, instructional, and so on, according to the frequency with which it was used in colloquial context, in informational context, and so on. To understand the flavour of a word, or any linguistic structure, we need to know how it is used, and knowing in what it is used is a big part of that. You may be wondering what evidence we used to decide what genre a particular text should be placed in. One method, and one pioneered by Tony McEnery at Lancaster University, was to use the information in the book titles.
If the title begins the Sermon On Blah, Blah, that would be pretty good evidence that the book belongs to the genre of sermons. Similarly, if through the book’s dedication or title page we can gather information about the readership and author, we can begin to gain further clues. If the title page for example declares the book to be for the Education of Hugonot French Refugees, we know it is likely to be an instructional text. Let’s take a look at the genre categorisation scheme as devised by Sean Murphy. In the left hand column, you can see the broader domains within which the genre is for.
These are very useful for understanding the stylistic and social flavour of particular words or other linguistic items. That is, whether they are literary, religious, administrative, instructional, or informational. The genres are mapped onto these domains. Know that the genres represent the surviving printed text from that period. And also the ones that the short title catalogue decided to include. Do not assume that these five domains are equal in size. The most populous domain is almost certainly religious followed by literary, although proper counts are lacking. Two final points. First, it goes without saying that sometimes texts belong to more than one genre simultaneously. For example, you could have a pamphlet that is also espousing a particular religious doctrine.
Second there are, of course, many subgenres that belong within the labels that are given for particular genres. Let’s wrap this up. At the beginning of these video talks, I explained how we might benefit from a very large collection of language data when studying Shakespeare’s language. The two particular points I made are that it would enable us to place Shakespeare’s language in its fullest linguistic context, and that it would afford us insights into what his contemporaries thought the language meant. I then went on to describe the history of early English books online, EEBO, from short titled catalogue to digital text. I describe some of the characteristics of EEBO and EEBO-TCP, notably the size content and availability.
And finally I described the enhanced Shakespearean EEBO-TCP corpus, how it is tailored for comparative study of Shakespeare, his characteristics, and in particular its genre scheme.

The problem with using the whole of EEBO for Shakespeare-related studies is that it is too broad. For example, for most purposes, we don’t need language from the late fifteenth century.

As part of the Encyclopedia of Shakespeare’s Language Project, we set about creating a specially tailored subset of EEBO-TCP enhanced for the study of Shakespeare. In designing this corpus, tricky decisions need to be made about what to include.

For example, what time period should it cover? From Shakespeare’s birth to his death? But then he wasn’t even producing plays in his early years. One of the particular enhancements we made to this corpus was to label every text as belonging to a particular genre. In fact, it was this that enabled us to diagnose the word “bastard” as a rather informational, technical term in early modern English. We could see the genres in which it tended to appear. Designing a genre classification system and actually applying it to a huge number of texts in a corpus presents its own challenges, of course.

How do you think the notion of genre can be used to shed light on Shakespeare’s language? Put your thoughts in the comments.

This article is from the free online

Shakespeare's Language: Revealing Meanings and Exploring Myths

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education