Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

Find out more

Variant Annotations: predicted molecular consequence

Article illustrating tools for molecular consequence prediction

Introduction

The first stage in interpreting genomic variation is to predict a variant’s impact on functional elements in the genome (e.g. for your transcript of interest, does the variant lead to the gain of a stop codon or does it cause a change in amino acid).

Prediction tools will identify any genes or regulatory elements a variant lies within or overlaps, substitute the base(s) seen in the reference genome for each alternative allele and calculate whether the sequence change may be of consequence at molecular level. Examples of predicted molecular consequences are shown in Figure 1. This information is important as particular diseases are often caused by variants with specific molecular consequences, as the predicted consequence is indicative of the disease mechanism.

Graphic representation of hypothetical gene exons (orange boxes) and introns (interconnecting lines) with predicted mutations and molecular consequences. List of hypothetical mutations and consequences: T/G = 5’ UTR variant; ATG/ATT = start loss variant; T/C = splice donor variant; T/C = splice polypyrimidine tract variant; G/C = splice acceptor variant; TCA/TAA = stop gained variant; CAG/CCG = missense variant; TT/T = frameshift variant; TGA/GGA = stop loss variant. At the bottom two parallel purple lines represent respectively: transcript ablation = deletion and transcript amplification = duplication. Click to enlarge

Figure 1. Examples of predicted molecular consequences for different variants located in different parts of a transcript. The sequence change is displayed above the transcript (e.g. T/G denotes that T is changed to G; ATG represents the start codon and TGA is the stop codon in the original sequence). The predicted molecular consequence of the sequence change (e.g. 5’UTR variant) is displayed below the transcript. The purple bars represent copy number changes that affect the whole transcript.

Classification

Classification of predicted molecular consequences can be complex so attempts have been made to create standardised descriptions, such as the Sequence Ontology which enables the recording of variant effects using a tree of terms ( for example both missense variants and stop lost variants share the parent term ‘protein-altering variant’, see Figure 2). Application of such an ontology enables more powerful querying as well as standardisation of interpretation, comparison and interoperability between different precision packages.

Screenshot of the webpage https://www.ebi.ac.uk/ols4/ontologies/so/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FSO_0001583?lang=en depicting an ontology tree, i.e hierarchical classification where each class is subdivided into two or more classes. Details in the legend of the image Click to enlarge

Figure 2. Sequence ontology tree view from the Ontology Lookup Service showing the term SO:0001583, ‘Missense_variant’, defined as ‘A sequence variant, that changes one or more bases, resulting in a different amino acid sequence but where the length is preserved.’

A short variant can be predicted to cause the complete loss of the gene’s product by changing a start codon (‘start_lost’ variant), so no protein is produced. Alternatively, it can introduce a premature stop codon and truncate the protein (‘stop_gained’ variant), in a region where nonsense-mediated decay (NMD), a process which removes potentially deleterious mRNAs, is likely to degrade the protein. A short variant in another location might disrupt translation and change the protein produced by either replacing one amino acid with another in the protein (‘missense_variant’) or deleting/inserting a number of bases that are not a multiple of 3, thereby introducing a shift in the reading frame (‘frameshift_variant’). When structural variants are analysed in the clinic, the proportion of the functional element impacted should be reported, along with the predicted molecular consequence; for example, if an extra copy of the transcript sequence is created (‘transcript_amplification’) or a transcription factor binding site is removed( ‘TFBS_ablation’).

Transcript sets

The choice of reference data used has a major influence on variant annotation. It is generally recommended to use an up-to-date annotation set from an expert source, such as the Ensembl/GENCODE or NCBI RefSeq transcript sets. These groups have nominated a default transcript per gene to aid reporting, but as they do not cover all exons or represent the only high-quality or highly expressed transcripts, these should be prioritised for reporting only after the impact of a variant on all transcripts has been considered. When evaluating the predicted consequence of a variant across a set of transcripts for a gene, it is important to consider the level of evidence supporting the transcript as well as the severity of the predicted impact.

For high quality, easy usage and access to detailed annotations, a reference gene set mapped to a reference genome assembly sequence should be used, but this does have limitations, which will be reduced in coming years, as seen in the step What is reference human gene annotation?. The reference genome assembly is a composite which does not represent the DNA sequence of any specific individual. The Human Genome Reference Consortium is creating haplotype-resolved assemblies from individuals from diverse ancestries and plans to create a reference pangenome which will enable improved discovery and interpretation of genomic variation across global populations. Basic gene annotation is already available on the component assemblies but it is not yet as comprehensive or well supported by additional annotations as the reference gene sets. Reference gene sets cover transcripts seen in a wide range of tissues but do not capture expression specific to disease states. Long-read transcriptome sequencing can help fill these gaps, but caution must be taken with quality control. Also, the supporting annotations present in reference transcript sets will not be available.

Tools

A variety of tools (Table 1) have been developed to predict variant molecular consequences and provide comprehensive annotation including population frequency information, phenotype associations and scores from pathogenicity prediction algorithms. These have downloadable reference data and tools to format access-controlled data for use. Flexible annotation options and filtering tools are available and the listed examples support the use of novel gene sets.

Tool Open Source (licence)
Annovar No (free for non-commercial use)
Ensembl Variant Effect Predictor Yes (Apache)
SnpEff Yes (MIT)

Table 1. Examples of variant annotation tools.

© Wellcome Connecting Science
This article is from the free online

Interpreting Genomic Variation: Overcoming Challenges in Diverse Populations

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now