Skip main navigation

Quality control in phylogenetics

Article about quality control tools for phylogenetics and how to address interpretative challenges
Decorative illustration of a phylogenetic tree
© Shutterstock

Even though there are different quality control (QC) steps followed before performing the phylogeny, there are several factors that influence the SARS-CoV-2 phylogenies and will be a bottleneck to interpreting them.

Some of the challenges in interpreting the SARS-CoV-2 trees with a huge number of samples are:

  • It is difficult to infer a reliable phylogeny due to a large number of sequences in conjunction with the low number of mutations.
  • Rooting the inferred phylogeny with confidence by applying novel computational methods to the ingroup phylogeny may not be credible.
  • Phylogeny can be numerically challenging because of the large number of highly similar sequences and lead to a low phylogenetic signal.

To overcome these shortcomings there are some QC steps that you can perform to generate high-quality phylogenies:

1) Mid-point root the tree before taking it for further analysis.
2) Remove the sequences which are relatively highly divergent from the rest of the cluster.
3) Remove sequences in which the divergence is substantially greater or less than expected.
4) Examine the methods used or reiterate the bioinformatics steps when a divergent sequence is predicted.
5) Use tools like IQ-TREE 2, and UShER that have been developed to handle SARS-CoV-2 genomes and build phylogenies considering homoplasies, convergent evolution, and potential recombination events.

Illustrative image of an unrooted phylogenetic tree indicating a long-distanced outliner as a result of poor sequencing alignment

Click here to enlarge the image

Figure 1 – An unrooted maximum-likelihood phylogeny generated from an alignment of complete and incomplete SARS-CoV-2 genomes, showing a spurious long branch where one genome was misaligned relative to the others. This is a very extreme case. Small misassembles or short misalignment zones can lead to a terminal branch that is longer than typical. Different node colours represent different lineages.

For more information and further QC steps on inferring phylogeny please read these articles:

Phylogenetic Analysis of SARS-CoV-2 Data Is Difficult

Genomic sequencing of SARS-CoV-2: a guide to implementation for maximum impact on public health

Do you have any experience with QC for phylogeny? What tools and methods do you use and how well do they work for you? Please use the discussion area to talk about your experiences.

© COG-Train
This article is from the free online

Making sense of genomic data: COVID-19 web-based bioinformatics

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now