What is the reference genome?

The Human Genome Project was a step into the unknown - the first time we had attempted to determine the complete sequence of the human genome. As such, it presented a monumental task: to identify every one of the three billion bases in the human genome from scratch.

Nowadays, having sequenced many human genomes, we have a blueprint against which to compare. The reference genome is like the photograph on a jigsaw puzzle box – it can be used to figure out where the pieces should go. Reference genomes are therefore invaluable to scientists, as they act as a template.

The human reference genome doesn’t represent the genetic sequence for any one individual, but is made up of a combination of several people’s DNA. When we sequence a patient’s genome and compare it to the reference genome, we assume that the reference represents the ‘normal’ sequence, and therefore any identified variation could be responsible for problems.

The human reference genome is continuously evolving. Since the Human Genome Project it has been updated in line with new information and knowledge thanks to the collaboration of scientists around the world, and it is currently in its 38th iteration.

There are still gaps in the human reference genome sequence, because some areas of the genome are particularly difficult to sequence. For example, only the most recent version of the reference genome included the centromere sequences – the portion of DNA that links sister chromatids together, and plays an important role in mitosis and meiosis, which we learned about in week 1. Their inclusion is indicative of the continuous progress being made by scientists in understanding and sequencing the genome.

Another area of particular interest to scientists is sections of DNA in which variation is found between people, with no adverse effect. Scientists have been working to ensure that the human reference genome, and associated bioinformatic tools, contain information regarding these alternative sequences to make identifying significant variants as easy as possible.

Although constantly evolving, the reference genome has been criticised for being based on a small number of individuals, and therefore not being representative of the global population. Efforts are underway to address this and to ensure that the true extent of genomic diversity across the world is represented.

Of course it isn’t just the human genome that is of interest: many other organisms have also been sequenced and had reference genomes built by scientists over the years, and this information is useful when comparing sequence similarity in related genes between species.

Solving one problem, creating another

It is important that the reference genome is continually kept up to date, but frequent changes can pose a challenge to those trying to identify disease-causing variants. Newer information inevitably contradicts and challenges older studies, and for this reason it is paramount that scientists and clinicians stay up to date with the latest advances. Equally, when information and findings are shared, it is important that the reference genome that has been used for comparison is shared too so that any conclusions can be seen in context. There is now an effort to establish stable sequences for regions of the genome linked to disease called the Locus Reference Genomic (LRG) project, with the hope that scientists can work from the same, definitive reference sequences.

Share this article:

This article is from the free online course:

Whole Genome Sequencing: Decoding the Language of Life and Health

Health Education England

Get a taste of this course

Find out what this course is like by previewing some of the course steps before you join: