Genome reference sequences and resequencing
In the early days of genome sequencing, little was understood about the structure of genomes and it was a very expensive exercise. The first human genome sequence, finished around the year 2000, cost between $500 million and $1 billion. It would now cost only around $1000. The technology that was common at the time (Sanger sequencing) produced fairly long reads (~500 bases) and was highly accurate. The focus was on producing high quality assemblies of different species causing important diseases, such as Yersinia (plague), Mycobacterium tuberculosis (tuberculosis), Salmonella typhimurium (food poisoning), rather than multiple sequences of one species. Because it was laborious and expensive, there was much more to gain by sequencing entirely new species than sequencing genomes from different individuals of the same species.
When sequencing an entirely new species, there is no pre-existing, similar sequence to help us out in determining where genes are located. The goal is to produce what is known as a ‘reference genome sequence’. To produce a reference sequence, we use a process termed assembly. Assembly is the process of putting together individual sequencing reads, much like a puzzle, to make longer stretches of sequence called contigs. The best possible result is to end up with a single sequence that represents the entire bacterial chromosome. In reality, we cannot always complete the puzzle and there are pieces that we cannot put together. Contigs are assembled using computer programs that look for overlaps between sequencing reads. The image below shows how genome sequencing is used to produce either reference genome assemblies or to identify differences between closely related bacteria (resequencing).
Assembly and resequencing (Click to expand) © Adam Reid 2017
Once a reference assembly has been produced, the next step is to find the genes, a step known as genome annotation. There are computer programs designed to do this. They find parts of the genome sequence which look like they might encode protein sequences. This is important because it identifies the functional toolkit of the bacterium. It also helps us understand differences between bacterial genomes. When we look at changes in the genome that might be related to antimicrobial resistance, for example, we can identify which genes are involved.
Different sequencing technologies have different roles
Over time, more genome sequences have become available for different species. Developments in genome sequencing technology have also made sequencing easier and cheaper. Some new genome sequencing technologies allow us to improve reference sequences because they have much longer sequencing reads (e.g. PacBio and Oxford Nanopore can produce 10,000-50,000 bases), which help to join contigs together and produce more complete reference sequences. Other technologies produce relatively short reads (Illumina produces 75-150 bases) but generate a very large number of them at low cost. The ability to produce lots of short reads has allowed us to identify similarities and differences between bacteria of the same species cheaply and efficiently (resequencing). This has been crucial in helping us track the spread of bacterial disease.
Sequencing a new example of a species for which a reference genome already exists is known as resequencing. Discovering what makes the new bacterial genome different from the reference is much easier than assembling a new reference genome. We already know what the genome generally looks like and the location of the genes. Instead of assembling a new genome sequence, we use read mapping. For each sequencing read from the new bacterium, we look for the most similar part of the reference genome and place it there. We can then look for differences between the mapped reads and reference genome, which represent mutations in the genome of the new bacterium.
© Wellcome Genome Campus Advanced Courses and Scientific Conferences