Skip to 0 minutes and 13 secondsThe reference genome is an amalgamation of sequencing around 100 individuals, that's the reference we use. And we are not using the latest reference, which is GRCh38, we're using 37. And that was built as a scaffold to allow people to determine what are the most common alleles present within the population. And a lot of that work was done on the Caucasian population. And bear in mind, so sometimes we have certain alleles more prevalent based on your ethnic background. When we talk about reads, reads can construed as a very short string of letters, consisting of four different types of bases, A,G, C, T.
Skip to 0 minutes and 53 secondsWhat we tend to do with each read is try to see where it best fits across a larger string of letters, around 300-- 3 billion-- sorry-- characters long. So we tend to look for either where does that read fit and does fit into three different categories, which is a perfect match, a best match, or no match. Of course, with perfect matches, we're not interested in that because what we're really interested in is to look for a best match or no match. And we tend to look in those regions where you have the best match for the differences that are to the original reference, or the three billion character long string.
Skip to 1 minute and 39 secondsWe then try to see whether those best matches or no matches are within your 114 genes, within your panel. And we then look to see, based on the clinical referral, as to whether actually those best matches occur within the set of genes that we expect based upon that. Once you determine which genes are relevant based on your clinical diagnosis, we look at the variance and where they lie in relation to a gene, and more importantly, where they lie in relation to the transcript. Once we've determined that the variant is either a stop-- which is, in other words, truncating a protein-- or missense, which would be changing a protein.
Skip to 2 minutes and 22 secondsWe then look to see if there's any prior evidence that would suggest that it's pathogenic. So we tend to use a suite [INAUDIBLE] of tools. And we use existing repositories, like Ensemble, which does a lot of pre-calculations of pathogenicities, if it is a missense. So that tends to aid in specific types of decisions. We also like to use a suite of conservational tools and splice site prediction tools to see whether certain variants affect splicing.
Designing the analysis pipeline
Within the video, provided by the Manchester Centre for Genomic Medicine, Sandy describes the process of aligning the individual short reads back to the human reference genome, and calling variants, which means identifying where there is a different nucleotide in comparison to the reference sequence.
Finally he discusses variant annotation which is the process of discovering where the variant is located in terms of chromosome and gene and also what kind of variant it is (i.e. synonymous/non-synonymous).
© University of Manchester