Skip main navigation

Introduction to phylogenetics: tools and models

Tutorial describing how to use bioinformatic tools to build and read SARS-CoV2 phylogenetic trees
Decorative illustration of the phylogenetic tree of SARS-CoV-2 variants
© COG-Train

Within the first year of the coronavirus disease 2019 (COVID-19) pandemic, almost 400,000 whole or partial severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genomes were generated and shared openly, marking an unprecedented worldwide response in pathogen genome sequencing.

Phylogenetic tools have proven increasingly important in the public health management of a variety of viral epidemics, but the COVID-19 pandemic is the first global health emergency in which large-scale, real-time genomic sequencing and analysis have guided public health decisions. SARS-CoV-2 has been evolving at an estimated nucleotide substitution rate ranging between 10e−3 and 10e−4 substitutions per site and per year. The global epidemiological and virological situation changed constantly during the pandemic, and genomic sequence analysis was crucial in tracking the changing scenario (Figure 1).

Decorative illustration describing potential models for transmission studies where phylogenetics plays an important role. a) Phylogenetic approaches estimate the rate of international lineage introductions and distinguish introductions from community transmission. b) Genome sequences and phylogenetics support outbreak analyses by identifying or refuting links between local cases; this can lead to the identification of outbreak sources and drivers or assessment of nosocomial transmission. c) Phylodynamic techniques using epidemiological demographic models, such as the susceptible–exposed–infected–recovered (SEIR) model, allow us to compare transmission rates between lineages bearing different key genotypes (for example, variants of concern (VOCs) and pre-existing lineages). d) Relative timing of variant and lineage emergence from the global (or regional) phylogeny, and scattering of case genomes across clades can distinguish persistent from repeat infections in some scenarios

Click here to enlarge the image

Figure 1- Phylodynamic approaches to the investigation of SARS-CoV-2 transmission. Source: Nature Reviews Genetics

Phylogenetic tools like Pangolin and Nextclade can unlock information from sampled genomes combined with epidemiological data like:

  • Quantifying international virus spread
  • Identifying outbreaks and transmission chains in specific settings
  • Estimating growth rates and reproduction numbers
  • Identifying and tracking mutations of interest
  • Discovering and analysing variants of concern
  • Investigating intra-host virus evolution

Different approaches to building the phylogenetic tree:

Parsimony approach – Maximum Parsimony is a character-based approach that infers a phylogenetic tree by minimising the total number of evolutionary steps required to explain a given set of data assigned on the leaves. Tools that use parsimony include IQTREE though it should be noted this is a tool to generate ML trees and USHER.

Maximum likelihood approach- Maximum Likelihood (ML) is a method for the inference of phylogeny. It evaluates a hypothesis about evolutionary history in terms of the probability that the proposed model and the hypothesised history would give rise to the observed data set. The supposition is that history with a higher probability of reaching the observed state is preferred to history with a lower probability. The method searches for the tree with the highest probability or likelihood. ML is implemented on FasTree, RAxML.

How to read a phylogenetic tree:

There are 2 parts to the phylogenetic tree. The nodes and the branches (Figure 2). The nodes on the tip of the tree represent the cases having some ancestry that existed as a putative virus in an individual. And the internal nodes between the branches make up the ancestor cases that are carrying the mutations from which further diversion was possible. The branches represent the transmission chains, which are a combination of numerous transmission events taking place and is represented by the branch length.

Three illustrative phylogenetic trees in black, purple and green, respectively. Detailed explanation in the body text

Click here to enlarge the image

Figure 2 – Different graphic representations of a phylogenetic tree.

Bootstrapping: Bootstrap values in a phylogenetic tree indicate that out of 100, how many times the same branch is observed when repeating the generation of a phylogenetic tree on a resampled set of data. If we get this observation 100 times out of 100, then this supports our result.

© COG-Train
This article is from the free online

Making sense of genomic data: COVID-19 web-based bioinformatics

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education