Skip main navigation

What is reference human gene annotation?

Article introducing the concepts of reference sequences and gene annotation

The value of the human genome to society depends on our ability to interpret its sequence, and especially to elucidate its gene content.

Since the first human reference genome was published in 2001, there has been extensive efforts to describe or ‘annotate’ human genes. The major reference annotation catalogues (often described as ‘genesets’ or ‘genebuilds’) are Ensembl-GENCODE, largely produced at EMBL-EBI in the UK, and RefSeq, produced by the NIH in the USA. Both catalogues consist of transcript ‘models’: computational representations of individual mRNAs found in the cell. The terms ‘GENCODE’ and ‘Ensembl’ tend to be used interchangeably to refer to what is actually a single resource.

A key goal of gene annotation is to interpret the likely functionality of each transcript model, (i.e. to judge what a transcript does in the cell in physiological terms). A transcript might be translated into protein; the corresponding model would then be annotated with a coding sequence (CDS) and a protein sequence. A protein-coding gene is therefore a locus that contains a protein-coding transcript model. Not all genes are protein-coding, however; they can also be long non-coding RNAs (genes that are transcribed but not translated), pseudogenes (‘broken’ protein-coding genes that lack function) or small RNAs (tRNAs, etc) (Figure 1).

Illustration representing a gene annotation and transcript models. The transcript Model 1 is built of green boxes (exon) interconnected by green lines (intron) under the label ‘Coding sequence’. At each end, there are red boxes representing 5’ and 3’ untranslated regions, respectively. Below Model 1 there are other 4 alternative Models (2-5) made of green and red interconnected boxes yet in different sizes and positions. Click to enlarge

Figure 1. An annotated gene may contain numerous transcript models, representing mRNAs found in the cell with distinct exon-intron combinations. In this locus, the first three models have distinct coding sequences as shown in green, with ‘untranslated regions’ in red. Models 4 and 5 are annotated as ‘non-coding’.

Transcriptome complexity

Gene annotation is a difficult task because the transcriptome (the full range of messenger RNA molecules expressed by humans) is highly complex. For example, alternative splicing, whereby a gene produces a number of distinct mRNAs that can have different functions, and might be expressed in distinct tissues or developmental stages. Human gene annotation remains a work in progress, with both Ensembl-GENCODE and RefSeq regularly producing updated annotation releases.

However, as annotation projects strive to capture our ever-expanding knowledge of the nature of the transcriptome, the associated increase in the complexity of these resources can cause complications for downstream users. To compound matters, there is a parallel need to improve functional annotation. For example, it is often not certain whether a given transcript is protein-coding; annotation can be an interpretative process, and different projects can reach alternative conclusions. This lack of certainty in specific annotations is be passed on to downstream users.

MANE Select: high-quality standardised models

Today, Ensembl-GENCODE and RefSeq are working together to solve such problems, especially via the MANE (Matched Annotation from NCBI and EMBL-EBI) project. The first goal is to achieve standardisation between these two catalogues, via the creation of a set of transcript models held in common. Over 99% of human protein-coding genes contain a ‘MANE Select’: a single high-confidence protein-coding model considered to be most representative of the biology of the locus. MANE was designed with clinical applications in mind, and these transcript models are now used as the default in interpretative projects, such as ClinVar. Each model is identical between Ensembl-GENCODE and RefSeq in terms of its coordinates and annotation, specifically on the GRCh38 reference genome. They are also ‘stable’ (i.e. these aspects will not change from one release to the next). The major advantage of using only MANE Selects is that you are specifically working with high-quality standardised models because substantial numbers of functionally ambiguous models are removed from the analysis. Nonetheless, variants of interest for a given gene might also be found in exons not incorporated into the MANE Select.

The catalogue will be expanded to include transcripts containing additional sequences of known clinical significance; specifically, where it is known the MANE Select does not include all known pathogenic or likely pathogenic variants. For example, the SCN5A locus includes a pair of mutually exclusive ‘cassette exons’ (i.e. one exon is spliced into the mRNA but not both). However, each exon is known to be clinically relevant, so in order to incorporate both exons, the catalogue includes a MANE Select model and a second model designated as ‘MANE Plus Clinical’ This is depicted in Figure 2.

Illustration representing the gene SCN5A composed of a double set of interconnected green boxes. A zoom-in box expands a gene region where MANE Plus Clinical and MANE Select indicate green boxes for Exon 6a and 6b respectively. Click to enlarge

Figure 2. SCN5A has a MANE Plus Clinical model as well as a MANE Select mode, which differ in their incorporation of one of two copies of exon 6. Copy 6a is known to be expressed in the neonatal heart; version 6b is in the adult heart.

In such cases, the question of whether to use either of both of these models for analysis should be considered carefully. This project is still ongoing and the set of MANE transcript models does not currently provide a perfect intersection with the catalogue of variants of known or suspected clinical interest. In the meantime, users can choose to work with the full, unfiltered complement of Ensembl-GENCODE and RefSeq models.

© Wellcome Connecting Science
This article is from the free online

Interpreting Genomic Variation: Overcoming Challenges in Diverse Populations

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now