Skip main navigation

Reading a BTK plot

This article introduces the concepts for interpreting a BTK plot

Let’s look at the parts of a blob plot

We’ll continue to use the Didymella arachidicola example.

1. Each point is a sequence in the genome

Each circle is a single contig or scaffold or sequence in the genome assembly. For our purposes, a genome assembly is made up of one or more sequences. If it is a very high quality long-read assembly, there will only be a few chromosomal length sequences. However, if it is a low quality short-read assembly it is likely to have thousands or even hundreds of thousands of contigs/sequences.

2. Length

In BTK’s default settings, the size of a circle is proportional to the size of the contig. This means we can see at a glance what kind of assembly we have: does it have many short reads (small circles) or is it highly contiguous with just a few large circles?

3. GC content

We can also quickly visualise the GC content of the assembled contigs. The X axis location of each circle represents the GC content of each contig. We get the GC content by counting all the Gs and Cs and dividing by the total length of the contig. For example, in this tiny 10 nucleotide contig GCATGGCCAT, GC content = 0.6 or 60%. You can see that the X axis runs from 0.3 (30% of the bases are Gs or Cs) to 0.7 (70% of the bases are Gs or Cs).

4. Sequencing Coverage or Depth

These graphs also depict the depth of sequencing that supported their assembly. The Y axis location of each circle represents the sequencing depth / coverage of each contig. The Y axis is plotted on a logarithmic scale because the range for coverage can be several orders of magnitude. To calculate sequencing depth BTK needed both the final assembly and the data used to generate it. BTK takes the raw sequencing reads for this genome and maps them back to the genome assembly and calculates the average coverage of each assembled contig/sequence. In the example figure below, the black lines represent the reads mapping back to the assembled green contig, and represent an average sequencing coverage depth of just under 5.

black lines representing the reads mapping back to the assembled green contig, representing an average sequencing coverage depth of under 5

You can look at the coordinates of the centre of each blob and read off the coverage and GC for that contig, as shown in the figure below:

5. Best matches

Lastly, the colours represent taxonomic assignments by Blast, an algorithm which compares sequences to known biological databases. By default, the taxonomic assignments shown in BTK are at the level of phylum, rather than species or genus or family (but you will learn how to change that later in the course)

The legend on the top right shows you which colour represents which taxon. It also gives you the number of sequences matched to each taxon (count), the total span of all the contigs of that taxon (sum_length), and the N50 of the contigs belonging to that taxonomic group. sum_length is the total size of the genome present, and N50 is a measure of how long the contigs are in this assembly.

© Wellcome Connecting Science
This article is from the free online

Eukaryotic Genome Assembly: How to Use BlobToolKit for Quality Assessment

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now