Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

Find out more

What is a human reference genome?

An article introducing the Human Genome Project and related resources

The first ‘reference’ human genome sequence (i.e. an official assembly), was released over twenty years ago, produced as part of the Human Genome Project (HGP). This first assembly was a composite of sequences from a dozen or so anonymous individuals.

A reference genome is necessary so that researchers and clinicians can work on the same assembly, which ensures better coordination and standardisation in the interpretation of genome-based scientific data, including the interpretation of variants. This reference genome has been improved over time, fixing regions that were found to be incorrect and filling in incomplete ‘gap’ regions with new sequence data. Over the past ten years, there have been three updates: GRCh38 (an improvement over the early assemblies), T2T-CHM13 (which includes telomeres and centromeres) and, in 2024, the first draft of the human pangenome.

The majority of projects use GRCh38. However, usage of GRCh37 (an earlier assembly) also persists because many large-scale projects – including variation interpretation workflows in the clinic were established using this assembly (see step Genomic variant interpretation workflow).

The use of different reference human genomes complicates scientific and clinical workflows; therefore, it is important to be clear about which assembly you are using, and whether data of interest have been built or interpreted on the same assembly. Special care needs to be taken with genome coordinates: for example, chr17:43,104,133 is in a coding exon of BRCA1 on the GRCh38 assembly, whereas it falls within an intron of DCAKD on the GRCh37 assembly. In practice, resources often provide both GRCh37 and GRCh38 coordinates (e.g. for the position of a variant in ClinVar) while tools such as Ensembl VEP work on either assembly. There are also apps available to convert coordinates between assemblies. Meanwhile, equivalent gene annotations (i.e. catalogues of in silico transcript ‘models)’ are available on both GRCh37 and GRCh38. Positional information based on transcript models will be largely identical between these assemblies, e.g. NM_001197293.3:c.272del refers to the same change on GRCh37 and GRCh38. However, where there are underlying sequence differences in a gene sequence on the two assemblies, it is possible to get discordant variant interpretations. Finally, for CNVs, the choice of reference assembly will likely make an even greater difference.

These early reference genomes did not reflect the genetic diversity of human populations. However, the Human Pangenome Project, based on 47 human genomes, over half of which are derived from individuals from African populations, aims to change this. The pangenome is based on diploid genomes (i.e. assemblies were separately generated from both sets of parental chromosomes found in the cell). It also has improved the representation of CNVs. The increase in diversity in a reference genome is anticipated to make it easier to work with experimental data from different individuals on the genome sequence and will aid the interpretation of DNA variation in clinical settings. It’s anticipated that it will become the standard human reference genome in due course.

© Wellcome Connecting Science
This article is from the free online

Interpreting Genomic Variation: Overcoming Challenges in Diverse Populations

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now