Big data: EEBO and EEBO-TCP

Watch Jonathan Culpeper elaborate on the dataset or corpus of early modern rented works, namely, Early English Books Online.
In the second of our series of talks on big data approaches we focus on a massive diverse data set, or corpus, of early modern printed works, namely early English Books Online, which is often referred to by its acronym EEBO. I’ll describe its history, some of its characteristics, and also recent data develop. EEBO has its roots in the English short title catalogues. These catalogues are essentially large bibliographies not of all printed books but of very many. They include information about the title, various editions and publications, along with dates, the format of the book, and so on.
30 years ago I was using these catalogues to understand if I was looking at the earliest edition of a book or to get a clue about where a copy might be located. It’s now available online. Try thinking of an author and exploring it sometime. One of the key short title catalogues was Pollard and Redgrave’s, A short-title catalogue of books printed in England, Scotland, and Ireland, et cetera, et cetera, 1475 to 1640, which commenced in 1927 and had updates over decades. This catalogue is key because it provided a list of works for the EEBO compilers to target for inclusion. It is also key because it covers the years of Shakespeare’s life.
The first significant step towards creating EEBO was an initiative led by Eugene Power in 1938 onwards to microfilm the books and materials listed in the short-title catalogues. This initiative, which resulted in the microfilm collection early English books, was driven partly by concern for these sources given that World War two was about to commence and also commercial interests. Power founded University Microfilms International, a private enterprise that could bring valuable materials to researchers and charge for it. Here is an image of some of the microfilming in progress. You can see the man standing on the left scanning historical works. On the right, the man sitting in front of a microfilm reader. Those microfilm readers were commonplace in libraries, especially University libraries.
Readers could load in a microfilm for a book they were interested in and then scroll through it. The penultimate, major step and the journey to EEBO was the digitization of early English books on microfilm. Digital facsimile images were created by scanning the microfilms. At this point University Microfilms International changed their name to ProQuest. They launched Early English Books Online in 1998. The final step was a collaboration between EEBO and the Text Creation Partnership, EEBO-TCP. Remember that I said that the digitization of Early English Books involved images, photographs of the pages. You can’t get a computer– well not easily– to search a photograph for text.
So the Text Creation Partnership, TCP, was formed a not for profit international collaboration amongst more than 150 libraries and ProQuest. They set about the task of producing fully searchable electronic texts from the EEBO images. These texts would be made freely available in batches. The home of EEBO can be seen here. To find it just do a Google search on EEBO. If you follow the search link on this page it will take you to this one.
Here you can fill in the boxes to construct what you want to search for, words, periods, and so on. You’re allowed access to both the digital images and the digital text, but there are two downsides. One is that you or your library has to pay for access. The other is that it is not designed for exactly what linguists would want. I have a solution to this which I will outline in the next talk. Let’s consider some of the characteristics of EEBO and then EEBO-TCP.
Its material covers the years 1473 to 1700. It contains approximately 132,600 titles, that’s more than 17 million pages. The source material was taken from 220 libraries. As you can see the dimensions are extraordinary. And when it comes to identifying language patterns, size matters. So this is all good news. One thing that has intrigued me is how many words are in EEBO. For reasons that I will discuss next week, counting words is very difficult. Spelling variation is one issue but there are others too. Nevertheless some of us here at Lancaster decided to hazard a guesstimate. We would say that it is at least 1.2 billion words in size.
For comparison, you might wish to consider that Shakespeare’s complete works amounts to about a million words. What is in it? Well, they say that the books are drawn from over 300 genres or forms, for example periodicals, sermons, poetry, prayer books, legislation, humour, fiction, dictionaries, biography, and drama. However, not much can be said about this as no coherent approach was taken to genre. We’ll talk more about genre in the next talk. The EEBO-TCP project, not surprisingly given the hard labour involved in transcribing all those digital images, has produced rather fewer texts, well there’s still a huge number. In phase I, 25,000 texts were released in 2015. They can be freely searched via the web link you see.
Or you can even download the texts and do what you wish with them. In phase II, a further 44,000 titles were released in 2020. In this video talk I have described the history of Early English Books Online, EEBO, from the short-title catalogue to digital text. I described some of the characteristics of EEBO and EEBO-TCP, notably the size, contents, and availability. In the next talk, I will explain how a subsection of EEBO-TCP has been tailored for comparative study with Shakespeare.

