Vaclav Brezina

Vaclav Brezina

Corpus linguist, lead developer of #LancsBox.

Location Lancaster University


  • @JorgeDisseldorp Hi Jorge - a great observation! It is always good to critically evaluate the methods in the field. If you are interested in the equation used for computing log ratio in #LancsBox or any other equation implemented in the tool, you can simply view/edit this when you click on the statistics button at the very bottom of the window.


  • Hi Kristin - great points/questions! From my perspective, human production in natural contexts as sampled in traditional corpora will always be a benchmark for any AI production. However, because AI is being increasingly more frequently incorporated in different media through which we communicate (think of various auto complete options in email clients, text...

  • I'm really glad you like this functionality, Susana - we are very proud of it because it makes corpus research easy ;)

  • I'm glad you found this useful.

  • Yes, you can do that. In fact, you can load as many corpora and wordlists as you like and keep them for later as well.

  • That's great - do let us know how #LancsBox is handling Swahili - I am very curious ;)

  • That's wonderful - do let us know if we can help with any of the details.

  • Glad to hear that!

  • That's great, Carmen - let us know your thoughts about the software.

  • Wonderful - I hope you will find the tool useful for your research.

  • Yes, these are two different terms used for the same thing.

  • Hi Robert, it seems you have installed #LancsBox into a folder where #LancsBox doesn't have full read and write privileges (e.g. Program Files). You might like to re-install #LancsBox into a different folder (Users\yourName or Desktop) or grant the current folder full privileges.

  • Yes, we will let Future learn know about this issue which seems to be related to the internet connectivity in your region. You should be able to download the video from the link above and play it on your computer locally in the meantime. I hope this helps.

  • Hi Jennifer - thanks for your question - if you simply wish to search your data there is no need for pre-processing because #LancsBox will be able to deal with this automatically. If you need to treat any parts of the transcript differently then you need to decide if these need to be separated in some way...

  • Hi Yan Li - #LancsBox offers a split-screen view that allows you to compare and contrast concordances for parallel corpora. However, in the current version #LancsBox does not include an alignment feature that would allow extracting translations automatically. I hope this helps.

  • A very warm welcome everyone - I very much hope that you will enjoy this course and learn a variety of methods, which you can apply in your own research contexts! We have a team of mentors who are ready to guide you through this process - so let us know how you are getting on!

  • Hi Evelyne, the reason why teacher resources are password protected is because they include correct answers, which should not be available to students prior doing the tasks for themselves. So the reason is entirely pedagogical. If you are an educator, simply email us for to obtain the password.

  • Many thanks for your kind words. Absolutely, you can upload your own corpus and Wizard will do the rest of the analysis for you ;)

  • Hi all - some clarification of this point: The data the lecture is based on includes the original British National Corpus 1994. This dataset, although focusing on the variety of English spoken in the UK includes Irish speakers from both sides of the border (i.e. Northern Ireland and the Republic of Ireland). It just shows that geopolitical boundaries...

  • Thanks for your kind words, Aleksandra and welcome!

  • Welcome Adriana - the social dimension of language use is very important, as you say. We also have dedicated sessions in the course for using corpora in the classroom context.

  • Welcome to this course, Martha! I hope you'll find it helpful for your research.

  • The broken link has been fixed.

  • Yes, the corpus is intended to be freely available for research purposes and comparable with the 1994 version. Currently, a balanced subset (BNC2014 Baby+) is accessible via #LancsBox

  • @MaryEllenKerans Hi Mary - very interesting questions:
    1) BNC2014 Baby+ (5M) is a mirror corpus to the original BNC Baby (4M) with the addition of 1M words of e-language. All major written and spoken genres/registres are represented (newspapers, fiction, academic writing, informal speech and elanguage)- more info:...

  • These are already available thanks to Dr. Dana Gablasova

  • This issue has now been fixed and a fully working v. 5.1.2 is available for mac

  • @LawrenceLam Hi Laurence, there seem to be an issue with the tagger file in version 5.1.2 on mac, which we are trying to sort out asap. V. 5.1.1 - which is available from the website should be fine.

  • You need to change the Unit to lemma, search again and then switch the view option in the top right corner.

  • Absolutely - You can load files in any format (txt, docx, pdf etc)

  • :)

  • Hi Sean - Which operating system are you using? Please make sure that you are installing #LancsBox in a location where you have read and write privileges such as the users folder on Windows.

  • You are welcome, Elisabeth. I hope these will be useful in your research.

  • A warm welcome to all who are joining us at this stage - with this course it is never too late to join. You can also invite your friends who might be interested.

    As you'll see in the discussions, on the corpus MOOC it is really true that the more the merrier!

  • @AmirHosseinMojiriForoushani Thank you very much and welcome to the course!

  • Thanks for your kind words, Gail. I'm glad you found the lecture useful.

  • Hi Antonio - that's great. Here's a link to Lancaster Stats Tools online, where you can explore the topic further:

  • New version 5.1.2 (just released) fixes the issue with CQL e.g. [word="visuali[sz]e"], which now works

  • Hi Saman, as for all university programmes in the UK, there is a language proficiency requirement for this programme. This is to ensure that the students can benefit from the modules and are able to write the dissertation successfully. There is still plenty of time to take one of the tests (IELTS academic, Trinity ISE etc.).

  • Hi Abbas - yes we do support right-to-left languages - please see page 5 of FAQ for more details

    Also, we offer the users full flexibility to localise #LancsBox for their own language

    I hope this helps.

  • This session on 31 October might be of interest to anyone considering the programmes:

  • In this situation, the use of the chi-squared test is not entirely appropriate (due to a violation one of the basic assumptions of the test). I know that the chi-squared test has been used for collocations but I would recommend trying a different association measure such as log Dice.

  • Hi Fiona, this depends on whether your security settings allow you to install a new app. You don't need do downgrade them or switch the firewall off. An alternative would be to install #LancsBox on a virtual machine. I hope this helps.

  • A great point, Alison! Indeed, many corpora consist of text samples (parts of texts) rather than whole texts. These usually tend to be balanced for the beginnings, middles and ends of texts and, as you say, this has both theoretical and practical implications.

  • Welcome, Ana - Indeed, we can learn a lot from each other.

  • Many thanks for your kind words, Andrew: welcome to the course!

  • Thanks for joining the course, Adriana - I hope you will enjoy it!

  • Many thanks for your kind words, Halyna! Indeed, being exposed to a variety of practical examples of different data sets and the statistical techniques to analyse them is the best way to gain experience and confidence in this area.

  • Hi Halyna - You can use the application form for PgCert and indicate that you want to take only a specific module. will be happy to assist you in this process.

  • Hi Tassos - you can indeed take the individual modules separately for credit.

  • @MonikaSau Hi Monika, I think the problem is connected with the incorrect tagger file being activated. To fix this, go to the LancsBox folder resources/tagger/bin and delete the tree-tagger file and and delete the suffix in tree-tagger.lin

  • :)

  • Hi Beatriz - Yes you can - you need to define these using specific words that define thesis statements.

  • Hi Andrew - this is a great example of applying CL in the classroom. The challenge is always to come up with tasks that capture students' imagination and you have come up with a very creative way to show how adjectives are ordered. Well done!

  • Thanks, Brian! We are here to help.

  • Hi Brian, you can access past searches by pressing down arrow on your keyboard. I hope this helps.

  • Hi Steve, you need to use the correct character (pipe) for the search to work, i.e. /research|study/

  • Hi Mary, you can compare collocation graphs in different corpora by splitting the window, but that will be graph based on corpus 1 in the top panel and graph based on corpus 2 in the bottom panel. If you want to amplify the evidence and base a single graph on multiple corpora, you simply need to load them as one combined corpus. I hope this helps.

  • A great reply - a few more details.
    1. JJ.* also includes comparatives (e.g. 'better') and superlatives (e.g. 'best')
    2. [word="visuali[sz]e"] is currently broken because #LancsBox autocorrect function kicks in. We'll see to fix in the next release.

  • Also, the installation instruction (pdf above) show in detail how to adjust the security settings on mac.

  • Welcome to the course, Adi!

  • A very warm welcome, Elisabeth!

  • Hi Alberta - welcome and let us know in the discussions if #LancsBox works for the purposes of your research.

  • Hi Dragica - welcome to the course. I hope you'll find it useful for your research. Do let us know in the discussions ;)

  • In the new version 5.0 and above, all corpora are displayed in the same window. Restricted-access corpora have a small icon of a padlock next to them. For these corpora (e.g. BNC2014-Baby) the text feature is not available due to copyright.

  • @RobinGill It would be best to check residential requirements with

  • Great example of the application of 95% CIs - looking forward to reading your study.