We use cookies to give you a better experience. Carry on browsing if you're happy with this, or read our cookies policy for more information.

Skip main navigation

Designing the analysis pipeline

We asked Sanjeev about the steps taken to identify a specific variant within the sequencing data associated
The reference genome is an amalgamation of sequencing around 100 individuals, that’s the reference we use. And we are not using the latest reference, which is GRCh38, we’re using 37. And that was built as a scaffold to allow people to determine what are the most common alleles present within the population. And a lot of that work was done on the Caucasian population. And bear in mind, so sometimes we have certain alleles more prevalent based on your ethnic background. When we talk about reads, reads can construed as a very short string of letters, consisting of four different types of bases, A,G, C, T.
What we tend to do with each read is try to see where it best fits across a larger string of letters, around 300– 3 billion– sorry– characters long. So we tend to look for either where does that read fit and does fit into three different categories, which is a perfect match, a best match, or no match. Of course, with perfect matches, we’re not interested in that because what we’re really interested in is to look for a best match or no match. And we tend to look in those regions where you have the best match for the differences that are to the original reference, or the three billion character long string.
We then try to see whether those best matches or no matches are within your 114 genes, within your panel. And we then look to see, based on the clinical referral, as to whether actually those best matches occur within the set of genes that we expect based upon that. Once you determine which genes are relevant based on your clinical diagnosis, we look at the variants and where they lie in relation to a gene, and more importantly, where they lie in relation to a transcript. Once we’ve determined that the variant is either a stop– which is, in other words, truncating a protein– or missense, which would be changing a protein.
We then look to see if there’s any prior evidence that would suggest that it’s pathogenic. So we tend to use a suite in silico tools. And we use existing repositories, like Ensemble, which does a lot of pre-calculations of pathogenicities, if it is a missense. So that tends to aid interpretative decisions. We also like to use a suite of conservational tools and splice site prediction tools to see whether certain variants affect splicing.
Within the video, provided by the Manchester Centre for Genomic Medicine, Sandy describes the process of aligning the individual short reads back to the human reference genome, and calling variants, which means identifying where there is a different nucleotide in comparison to the reference sequence.
Finally he discusses variant annotation which is the process of discovering where the variant is located in terms of chromosome and gene and also what kind of variant it is (i.e. synonymous/non-synonymous).
This article is from the free online

Clinical Bioinformatics: Unlocking Genomics in Healthcare

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education