Looking at DNA: Sequencing technologies
DNA sequencing is the process of determining the order of the bases – adenine, guanine, cytosine and thymine – in a molecule of DNA.
In the mid-70’s, a scientist called Fred Sanger developed a DNA sequencing method, eponymously known as Sanger sequencing, which revolutionised molecular biology. Unravelling the genetic code allowed a vast breadth of scientific applications to take place, from basic science through to translational applications such as diagnostic testing and targeted drug therapy.
Improvements over the years to Sanger’s original method allowed scientists to sequence sections of DNA up to around 600 bases in length. Because scientists could only sequence one small section of DNA at once, the length of time, and cost, required to sequence whole genomes remained prohibitive. Next generation sequencing (NGS) methods solved this problem by allowing hundreds of thousands of fragments of DNA to be sequenced at the same time - known as massively parallel sequencing approaches.
The term “next generation sequencing” (NGS) refers to many different methods used to sequence DNA. However, all methods follow the same basic principles:
- Sample DNA is processed into smaller fragments for sequencing.
- The sequence of bases in many different fragments of DNA is read at the same time using NGS technology. The number of fragments sequenced at the same time ranges from hundreds to millions, depending on the type of sequencing being undertaken.
- A computer file is generated containing the base sequences derived from the DNA fragments. Each individual length of sequence, which arose from the original DNA fragment, is known as a “read”. Read length is usually between 50-300 bases long, but can be longer depending on the NGS method used.
- Specialised software analyses the reads and matches them back to the specific place in the genome they arose from, using a reference genome sequence as a template. This is known as “alignment” or “mapping”.
- Differences between the sample DNA and the reference DNA are identified. This is known as “variant calling”.
- The likely effect that a genetic variant will have on a protein is identified. This is known as “variant annotation”.
It is possible to sequence the whole human genome quickly, and relatively inexpensively, using these techniques. When we sequence the whole human genome, we identify the full extent of human variation: + 5-10 million genetic variants per person including 20, 000 “coding” variants which fall within transcribed genes
For many applications of NGS we are trying to find a single genetic variant relevant to a specific disease or trait. Finding the one variant we are interested in, amongst this vast amount of genomic data is akin to trying to find a needle in a haystack. Methods have therefore been developed which allow us to sequence smaller regions of the genome. This results in less variation to analyse. For instance, we could look just at the 1-2% of the genome which codes for proteins - the “exome”, or we could only look at the regions of the genome which harbour specific genes we are interested in - “gene panels”. This allows us to sequence just the portion of the genome we think is most likely to yield the relevant variation, and ignore the rest. Methods which allow us to identify smaller regions of the genome for sequencing are known as “target enrichment”, “capture” or “pull down” techniques.
Now we can analyse both the chromosomes and DNA sequence at high resolution, do we have the tools to diagnose all genetic susceptibility to disease?
© St George’s, University of London