Skip main navigation

Advanced assembly cleaning

After removing the most obvious contaminants, Slimane made several more changes to make a complete genome assembly ready for publication.

The process outlined in the previous step improves the plot a lot, but it also removes some very high coverage and low coverage contigs labelled Ascomycota (circled in yellow in the figure below).

Partially cleaned BTK plot for B cinerea with high and low coverage Ascomycota contigs highlighted Click to expand

It is possible that the high coverage (coverage near ~ 7000, GC near 0.45) light blue Ascomyscota contig, near the top of the plot, is a repeat region in the fungus present at very high coverage. Alternatively, it might be the mitochondrion of the fungus, as that is usually present in several hundred times more copies than the nuclear DNA in the sample. Similarly, the low coverage light blue Ascomycota contigs at the bottom right (coverage <10, GC between 0.5 and 0.55) might be Botrytis cinerea, but they might also be a different fungus (i.e., a contaminant).

Therefore you need to examine the contigs with coverage >2000 and the contigs with coverage <12 to see what they are, in more detail before you try to remove them. To do that, first set the coverage filter B_cinera_112_1.BCREADS_cov as before to a minimum of 12, and a max of 2000.

Now, you can invert the coverage filter by clicking on the invert filter icon next to the coverage tab. Your filter and plot should now look like this:

BTK plot for B cinerea after inverting the coverage filter

Figure in BTK viewer

By inverting the coverage filter, you have selected only those contigs with very low and very high coverage.

To examine these contigs in more detail, let’s switch on ONLY the Ascomycota contigs in this view. To do that, go to the bestsumorder phylum category filter, and click on all phyla to disable them except Ascomycota (or, to be quicker, click on Ascomycota to disable it, and then click on the invert icon within that filter so that Ascomycota is enabled, and all other phyla are disabled).

Blob plot for B cinerea after filtering to show only the high and low coverage Ascomycota contigs

Figure in BTK Viewer

You can now see that there is only one very high coverage Ascomycota contig (likely to be the mitochondrion), and several low coverage Ascomycota contigs at low coverage in a separate blob.

To see if the lower right corner blob is a separate organism or if it is the same as Botrytis cinerea, you can check the blast hits at the family taxonomic level rather than the phylum taxonomic level by selecting the bestsumorder_family filter and clicking the colour icon next to it:

Blob plot for B cinerea showing the Ascomycota hits coloured according to family instead of phylum

Figure in BTK viewer

The lower right blob is now mostly purple (other) and dark blue (Nectriaceae) whereas the upper lone contig is Sclerotiniaceae (the family that Botrytis cinerea belongs to). Tip: To see the scientific names of the class order or family of any species, you can look up the wikipedia article for that species. For example, in this case, https://en.wikipedia.org/wiki/Botrytis_cinerea shows that the family is Sclerotiniaceae

Making a final list of clean contigs

All contigs with coverage <12 and > 2000 should be removed, except 1 Ascomycota contig at coverage ~7000, which should be kept.

To do this, start with the original BTK plot. First filter the low coverage (<12X) sequences (as before). The only cobionts that remain now are Arthropoda and Proteobacteria. Under bestsumorder_phylum, these two categories can be de-selected so that they disappear from the plot. The final table then only has high-coverage Ascomycota and no-hit sequences.

Remember, there is more than one way to achieve the same goal. Trying out different ways is beneficial to see whether they give you the same final list of clean contigs.

Filtering the data or reassembling the genome

At this stage, Slimane had a list of contigs that belonged to the fungus Botyris cinerea. He could have proceeded in two ways:

  1. Select only Botyris cinerea contigs from the preliminary assembly, remove the rest, and declare this as the final genome assembly. In this option, you don’t change the assembled contigs: you simply remove (or filter out) the contaminant contigs, and report the rest.
  2. Sometimes you get a better assembly if you reassemble only the reads that belong to one target organism from a single blob, because the assembly software tries to estimate the coverage and gets a consistent value.

Slimane actually chose Option 2 and you can see his final assembly at https://blobtoolkit.genomehubs.org/view/B_cinera_112_2/dataset/B_cinera_112_2/blob#Filters. Notice that the contig identifiers are now different, and the contigs are no longer the same as before and have very slightly different GC and coverage values.

If you would like to know more about reassembly, there are two research papers linked in the ‘See also’ section showing how Option 2, reassembling with a restricted set of reads at the right coverage, was the most effective approach.

© Wellcome Connecting Science
This article is from the free online

Eukaryotic Genome Assembly: How to Use BlobToolKit for Quality Assessment

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now