Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

Find out more

Where did the BTK data come from?

about BTK data

Where did the BTK viewer plots and analyses come from?

The BTK web viewer is only a viewer. Behind the scenes, the BTK pipeline retrieves each genome assembly and the associated read sets from the public genomic databases at the National Centre for Biotechnology Information (NCBI) and the European Nucleotide Archive (ENA), which are members of the International Nucleotide Sequence Data Collaboration (INSDC).

Only genome assemblies that are publicly available at https://www.ncbi.nlm.nih.gov/genome are run through the BTK pipeline. There might be other genome assemblies available on other data sharing sites such as zenodo.org or on websites belonging to individual laboratories, but these are not considered “public” as they are not registered with the INSDC, and so they might have restrictions on their use. If you want to use the BTK pipeline on your own genome assembly, or on a genome assembly that is not in the INSDC, then we will show you how to do that at the end of Week 3 of this course.

The BTK pipeline does the following steps. This means you as the BTK user don’t have to do these steps when you want to look at a public genome assembly.

  1. Retrieves read datasets linked to the genome assembly and maps the reads back to the assembly to calculate sequencing coverage/depth of each contig in the assembly. Some genome assemblies don’t have any linked read datasets, so this step is skipped.
  2. Finds simple repetitive regions in the genome using a program called windowmasker.
  3. Calculates some basic statistics (like GC content, number of unknown bases) for the whole genome.
  4. Runs BUSCO, a software package that assesses the completeness of the genome assembly by looking for known single-copy genes.
  5. Searches for the best amino-acid matches of the genes that were found by BUSCO in the Uniprot reference proteome database which stores high quality protein-coding genes from over 1 million species. These results have the prefix buscogene_ in the Filters menu.
  6. Searches for the best nucleotide matches of all assembly contigs against the UniProt reference proteomes (blastx). Contigs over 1Mb are subsampled by retaining only the most BUSCO-dense 100 kb region from each chunk. Any assembly contigs with no matches to this database are also nucleotide searched against the NCBI nt database using blastn. These results have the prefix buscoregion_ in the Filters menu, and are more sensitive (i.e. they have more hits) but less specific than the buscogenes_ hits in the previous step.
  7. Calculates some numeric features of the genome such as the GC content, number of unknown bases, and the sequencing depth and repeat regions, across multiple windows of the genome assembly.
  8. Collates all these results and analyses together into a single directory that we call the “BlobDir”, and makes it available on the BTK viewer website.

Running the BTK pipeline can take several hours, and sometimes several days for very large datasets. As more and more genomes are sequenced and made publicly available, there will always be a few genome assemblies that don’t have BTK plots as the pipeline catches up. In the last week of this course, you will learn the steps to run BTK pipeline and make plots for your own genome assemblies.

© Wellcome Connecting Science
This article is from the free online

Eukaryotic Genome Assembly: How to Use BlobToolKit for Quality Assessment

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now