Skip main navigation

Quality control in phylogenetics

Article about quality control tools for phylogenetics and how to address interpretative challenges
Decorative illustration of a phylogenetic tree
© COG-Train

Even though there are different quality control (QC) steps followed before performing the phylogeny, there are several factors that influence the SARS-CoV-2 phylogenies and will be a bottleneck to interpreting them.

Some of the challenges in interpreting the SARS-CoV-2 trees with a huge number of samples are:

  • It is difficult to infer a reliable phylogeny due to a large number of sequences in conjunction with the low number of mutations.
  • Rooting the inferred phylogeny with confidence by applying novel computational methods to the ingroup phylogeny may not be credible.
  • Phylogeny can be numerically challenging because of the large number of highly similar sequences and lead to a low phylogenetic signal.

To overcome these shortcomings there are some QC steps that you can perform to generate high-quality phylogenies:

1) Mid-point root the tree before taking it for further analysis.
2) Remove the sequences which are relatively highly divergent from the rest of the cluster.
3) Remove sequences in which the divergence is substantially greater or less than expected.
4) Examine the methods used or reiterate the bioinformatics steps when a divergent sequence is predicted.
5) Use tools like IQ-TREE 2, and UShER that have been developed to handle SARS-CoV-2 genomes and build phylogenies considering homoplasies, convergent evolution, and potential recombination events.

Illustrative image of an unrooted phylogenetic tree indicating a long-distanced outliner as a result of poor sequencing alignment

Click here to enlarge the image

Figure 1 – An unrooted maximum-likelihood phylogeny generated from an alignment of complete and incomplete SARS-CoV-2 genomes, showing a spurious long branch where one genome was misaligned relative to the others. This is a very extreme case. Small misassembles or short misalignment zones can lead to a terminal branch that is longer than typical. Different node colours represent different lineages.

For more information and further QC steps on inferring phylogeny please read these articles:

Phylogenetic Analysis of SARS-CoV-2 Data Is Difficult

Genomic sequencing of SARS-CoV-2: a guide to implementation for maximum impact on public health

Do you have any experience with QC for phylogeny? What tools and methods do you use and how well do they work for you? Please use the discussion area to talk about your experiences.

© COG-Train
This article is from the free online

Making sense of genomic data: COVID-19 web-based bioinformatics

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education