When Genome Sequencing just isn’t enough
Although next generation sequencing (NGS) allows for a comprehensive analysis of DNA from individuals with suspected monogenic disease, it does not always identify the causative mutation. So where are these missing mutations hiding?
The first thing to remember is that whilst ‘whole genome sequencing’ is a widely used term it is in fact a misnomer. The most commonly used short read technologies can only call mutations in 88-95% of the genome. The remaining 5-12% of the genome is either not sequenced to a high enough quality to allow for mutation detection or is impossible to map as a result of repetitive DNA sequences. But technology is moving fast and single molecule real-time sequencing platforms that generate reads of around 10,000 bases are filling in some of these gaps and identifying new variants.
The other issue is that the number of times each base is “read” varies according to the CG richness of the region, the size of the targeted regions (e.g. small/large gene panel, exome or genome) and the number of sequence reads obtained per patient. At least 20 reads are required to have a 99.9% chance of detecting a heterozygous base substitution.
Can we detect all mutations?
The sensitivity of mutation detection depends upon the mutation type as well as the read depth. Base substitutions (SNVs) are most easily detected but insertions and deletions (InDels) are more difficult because of capture bias (in targeted methods but not genome sequencing) and mapping issues. Bioinformatic tools to detect copy number variants (CNVs), chromosome rearrangements (structural variants: SVs) and insertions of e.g. LINE (Long Interspersed Nuclear Elements) elements are still in their infancy and their sensitivity has not been established.
Are we looking in the right place?
Even if it were possible to accurately call all variants in the entire genome it is still likely that some mutations would be missed as we might just not be looking in the right place. For example some mutations may arise spontaneously after conception (so called “post-zygotic”, ‘somatic’ or ‘acquired’ mutations) which will result in varying levels of the mutation between tissues. An example of this is seen in some patients with hyperinsulinaemic hypoglycaemia, where the mutation which causes the unregulated secretion of insulin from the beta cell is confined to the pancreatic tissue. For these individuals sequencing DNA extracted from the blood would not detect the causative mutation. It is therefore important to consider the most appropriate source of DNA for sequencing studies, although in many cases blood is likely to be the only feasible option.
Looking for a needle in a haystack
Failing to identify a causative mutation may also result from difficulties with the interpretation of data rather than a failure in detection. We learnt earlier that a human genome contains around 3-4 million variants. In order to search for the causative mutation various strategies can be employed to reduce this number. However, sometimes these processes lead to the causative mutation being removed from the data set. For example there are many pathogenic mutations listed in control sequencing databases which are used to filter out common benign polymorphisms. This is a particular problem for mutations that are recessively acting, as individuals who are heterozygous for the mutation will be clinically unaffected. It is therefore important to consider the likely carrier frequency of a given disease prior to setting a ‘cut-off’ for excluding variants present in control populations. Even once you have correctly filtered your data for common polymorphisms, the number of variants for follow-up analysis is still likely to be large. For example a consanguineous patient will have approximately 3,000 rare or novel homozygous variants. So how do you find the disease-causing mutation? Assigning causality becomes easier when you identify a mutation residing within a good biological candidate gene or within a known regulatory region. Co-segregation studies within large pedigrees or replication studies in unrelated individuals with the same phenotype are extremely valuable in helping with the interpretation. However, variants are often identified in non-annotated regions in the genome or within a gene of unknown function. The absence of strong replication studies and/or a plausible hypothesis for the mechanism of disease will make it extremely difficult to provide convincing evidence for disease causality.
Searching for something that is not there to be found
It is important to remember that genetic disease does not always result from a change in the DNA sequence. A number of diseases are known to result from defects in the methylation status of DNA; an epigenetic mechanism used to control gene expression. These abnormalities in methylation cannot be detected by genome sequencing and a different analysis approach is required. Finally, it is important to keep an open mind about the classification of disease. How certain are you that the disease is monogenic? Is it possible that environmental factors have been at play or that the phenotype in the patient just represents the extreme tail of normal physiology or common polygenic disease?
© University of Exeter