Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only T&Cs apply

Find out more

Designing the analysis pipeline

We asked Sanjeev about the steps taken to identify a specific variant within the sequencing data associated
The reference genome is an amalgamation of sequencing around 100 individuals, that’s the reference we use. And we are not using the latest reference, which is GRCh38, we’re using 37. And that was built as a scaffold to allow people to determine what are the most common alleles present within the population. And a lot of that work was done on the Caucasian population. And bear in mind, so sometimes we have certain alleles more prevalent based on your ethnic background. When we talk about reads, reads can construed as a very short string of letters, consisting of four different types of bases, A,G, C, T.
What we tend to do with each read is try to see where it best fits across a larger string of letters, around 300– 3 billion– sorry– characters long. So we tend to look for either where does that read fit and does fit into three different categories, which is a perfect match, a best match, or no match. Of course, with perfect matches, we’re not interested in that because what we’re really interested in is to look for a best match or no match. And we tend to look in those regions where you have the best match for the differences that are to the original reference, or the three billion character long string.
We then try to see whether those best matches or no matches are within your 114 genes, within your panel. And we then look to see, based on the clinical referral, as to whether actually those best matches occur within the set of genes that we expect based upon that. Once you determine which genes are relevant based on your clinical diagnosis, we look at the variants and where they lie in relation to a gene, and more importantly, where they lie in relation to a transcript. Once we’ve determined that the variant is either a stop– which is, in other words, truncating a protein– or missense, which would be changing a protein.
We then look to see if there’s any prior evidence that would suggest that it’s pathogenic. So we tend to use a suite in silico tools. And we use existing repositories, like Ensemble, which does a lot of pre-calculations of pathogenicities, if it is a missense. So that tends to aid interpretative decisions. We also like to use a suite of conservational tools and splice site prediction tools to see whether certain variants affect splicing.

Within the video, provided by the Manchester Centre for Genomic Medicine, Sandy describes the process of aligning the individual short reads back to the human reference genome, and calling variants, which means identifying where there is a different nucleotide in comparison to the reference sequence.

Finally he discusses variant annotation which is the process of discovering where the variant is located in terms of chromosome and gene and also what kind of variant it is (i.e. synonymous/non-synonymous).

This article is from the free online

Clinical Bioinformatics: Unlocking Genomics in Healthcare

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now