Skip main navigation

New offer! Get 30% off your first 2 months of Unlimited Monthly. Start your subscription for just £35.99 £24.99. New subscribers only T&Cs apply

Find out more

The Process of Assembly

In this video Andy Brass will take you through assembling the genome.
So we’ve reached the stage in the process where we’ve taken a sample from the patient, DNA’s been extracted from that sample, the DNA’s been fragmented, and those fragments have been sent to the next generation sequencing machine. Now, what we get back from the sequencing machine is a set of files, each one containing the sequence of a single fragment. And here is some of those sequences that will be represented in those files. So what we now need to do is make sense of this data. We’ve got all the pieces of the jigsaw, we need to fit them together.
So if we push the analogy of the jigsaw a little further, if you try to do a jigsaw, then the task is made much easier if you’ve got a picture. It makes it much more straightforward to work out where the pieces go. Now, in our case we don’t actually have a picture, as such. But what we do have is a reference genome. Our patient’s genome won’t be exactly the same as the reference genome, but it’s going to be very similar. So what we can now do is try and map each of these different reads against its reference genome to find the place where they’ll git best.
So what we can do is we can take one of these reads and run it along the reference genome to find the place where it best fits. So, for example, if we take this one here, actually it look as if it fits best in this position here. So we now take the next read and repeat the process. So let’s say this one, run it along, and it looks like it’s going to it. Yeah, that looks about right. So we take this one and look to see where will this fit on it. So we run into along the sequence trying to find a place that it might fit. That looks like it might go there. And continuing on.
And so we go through this process of slowly trying to map the reference reads, each of these different fragments, against the reference genome. And I think you can already see from this process here that actually doing this is quite a fiddly job and quite time consuming. And for real data, we would have hundreds of millions of these reads that we would be trying to fit to a reference genome that’s about three billion bases long, so we need to use some very smart algorithms if that process isn’t going to be impossibly slow. So what we do is we keep going until all the reads have been successfully added to the reference genome in the form of this multiple-sequence alignment.
So we’ve now completed the process. All the reads have been mapped. What’s this data telling us? Well, the first thing to notice is that we don’t have the same amount of information for each of the bases in the reference genome. For example, in this position here, we only have a single read that covers that base. We’ve only got a single read that covers this base. By the time we’re to here, we’ve got maybe two reads that cover the base. And the reads in here, we’ve got round about eight bases that cover the read. Now, the more reads we have at a base, the more confident we are in the quality of the data we can derive from the alignment.
So, for example, in this position here, we’re going to be very confident that the patient’s genome contains an A in that position. So let’s look along the reference genome and compare it to what we’ve gone in the multiple sequence alignment. So for most of this region as we go along, you can see that we’ve got very good agreement between the reference genome and what we have in the patient data. Until we get to this position here.
Here all the patient data is an A whilst the reference genome has a G. So we could safely decide that at this position we have a variant between the patient genome and a reference genome. The patient is an A, the reference genome is a G. But if we go now to the end of the alignment, we can see another position where we have a difference between what we see in the alignment and what we see in the patient data. Could we confidently label this as a variant? Well, probably not.
We’ve got very few reads in this area, and the change is right at the end of this sequence here where– and it’s at the end of the sequences you often see the data qualities less. So actually the most likely explanation for what’s going on here is that we have some sort of sequencing error towards the end of this read. Some areas are a little bit more complicated than this to work out. For example, in the position here we have a mixture of Gs and Cs. Is this evidence of a sequencing error? Should both of those Gs be Cs?
Or is it actually evidence of a heterozygous position where we have different bases on the genome strands, one base coming from each parent? So culling the variance from this form of multiple sequence alignment isn’t straightforward. It’s a function of quite a few different factors, including things like the read coverage. And we need some fairly sophisticated variant culling software if we’re going to do this well across an entire genome.

In this video Andy Brass will take you through assembling the genome.

Andy Brass photo He will cover a small section of the genome and a selection of reads of up to 20 base pairs in length to demonstrate, albeit on a smaller scale, the process of aligning the fragmented data against the human genome reference data.
This article is from the free online

Clinical Bioinformatics: Unlocking Genomics in Healthcare

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now