Skip main navigation

Part 5: Building your own corpus – tips

A few tips are provided for building own corpora.
So all these examples were used to demonstrate different aspects of the process of corpus building. We looked at design, development, and corpus annotation. We would like to now look at some of the general principles and draw some conclusions for building your own corpora. Here are five tips that you might like to follow if you are thinking about building your own corpus. First, start with corpus design. Think carefully about the type of language your corpus should represent. Should it be a general corpus, or should it be a specialised corpus? Second, keep notes. Throughout the process of the corpus design and corpus development, keep notes about your decisions– what you included and what you excluded and why.
You might remember today, you might remember tomorrow, but in a week’s time or in a year’s time, these notes will be really important. In terms of practicalities of saving the data, it is advisable to save texts as separate files so that you can look at the distribution of linguistic features in different types of texts and in different components of the corpus. Four, always check accuracy of the data, be the spoken data or the written data. If it is spoken, you might like to re-listen to the recordings to make sure that the transcriptions are done accurately. If it is written data, you might like to look at the type of data.
If you’ve downloaded the data from the internet, for instance, you might like to make sure that you don’t include any of the HTML code or the boilerplate, for instance. And finally, select a suitable tool that will be useful for the analysis of the corpus. In the practical sessions, we will be looking at #LancsBox and how to use #LancsBox to build and analyse your own corpus. So these were the tips for building corpora. Thank you very much for listening to this lecture.

Finally, Vaclav Brezina provides five tips for corpus building.

Update your journal

After the video, don’t forget to update your journal! Keep a record of what you are learning. You will find it really helps as the course proceeds if you keep clear, structured notes of what you have learnt.

This article is from the free online

Corpus Linguistics: Method, Analysis, Interpretation

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education