Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

Find out more

Predictive scores

Article introducing predictive scores concept and application

Gene level predictive scores can be used to determine if a gene is likely to be associated with disease and identify the molecular mechanism. These scores are often calculated from population data.

They are based on the concept that genes that are crucial for the function of an organism will be depleted of variants in these datasets as they will have been removed from the population by natural selection (i.e. these variants are detrimental so are not passed to subsequent generations). Non-essential genes will tolerate the accumulation of variants. Genes that have fewer variants than expected are known as constrained genes (Figure 1).

Graphical representation of ‘Constrained gene’ vs ‘ Unconstrained gene’. Two sets of blocks represent individuals carrying constrained (red set) or unconstrained (green set) genes. In each block (both green and red) a star represents a mutation. Over time, the set of constrained genes accumulated only one star (mutation) while the unconstrained set accumulated several mutations. Click to enlarge

Figure 1. The concept of constraint. Most mutations that arise in a constrained gene are not passed to subsequent generations as they are detrimental, therefore over time, there are few mutations within constrained genes. Mutations in an unconstrained gene are tolerated and, therefore accumulate over time. Stars represent mutations, lightning bolts represent that the mutation is removed by natural selection.

Multiple predictive scores

Sequence variant data from gnomAD has been used to calculate several different predictive scores, which relate to different molecular mechanisms. It is important to note that the scores are calculated for transcripts, not genes, so a gene is associated with more than one score. These scores are based on sequence-context mutational models which predict the number of expected rare variants per transcript.

There are two scores which predict haploinsufficiency i.e. intolerance to heterozygous loss-of-function variation. These scores are Probability of Loss-of-function Intolerance (pLI) and Loss-of-function Observed / Expected Upper bound Fraction (LOEUF).

pLI is based on modelling to assign transcripts to one of three categories:

  • Null, where heterozygous or homozygous loss-of-function variation is tolerated.
  • Recessive, where heterozygous variants are tolerated, but homozygous ones are not.
  • Haploinsufficient, where heterozygous loss-of-function variants are not tolerated.

Genes with high pLI scores (pLI ≥ 0.9) fall into the latter category and are extremely loss-of-function intolerant. It should be noted that the pLI score scale is not continuous.

LOEUF is a conservative estimate of the observed/expected ratio. LOEUF scores are continuous with low scores indicating strong selection against predicted loss-of-function variation (constrained) and high scores suggesting a relatively higher tolerance to inactivation. Therefore, genes with smaller values (closer to zero) are more intolerant of mutations. Genes with a LOEUF score < 0.35 can be considered constrained. LOEUF was introduced when the gnomAD dataset became sufficiently large to support this type of analysis and gnomAD recommends using this score.

Predictive scores are also available for missense variants and are provided as Z scores which represent the deviation of observed counts from the expected number. Higher Z scores indicate intolerance to variation (increased constraint), whilst negative scores indicate transcripts that had more variants than expected. A missense constraint Z score >3.09 is equivalent to a p-value of 10−3, and transcripts with a Z-score >3.09 are considered to be significantly constrained. Z-scores for synonymous variants are also provided by gnomAD.

Copy-number data ascertained by microarrays have also been used to generate the following predictive scores:

  • Probability of haploinsufficiency (pHaplo; i.e., deletion intolerance)
  • Probability of triplosensitivity (pTriplo; i.e., duplication intolerance). These scores are most relevant for whole gene duplications. Where a breakpoint occurs within a gene it should be considered that the function of that gene is disrupted.

These scores were generated by a machine-learning model trained on 145 gene-level features to predict the likelihood that whole-copy loss (i.e. complete deletion) or whole-copy gain (i.e. complete duplication) of each gene would be enriched in a cohort of individuals affected by severe, early-onset diseases as compared to the general population. pHaplo scores ≥0.55 indicate an odds ratio ≥2. pTriplo scores ≥0.68 indicate an odd ratio ≥2.

Alternative predictive scores are available and new models for calculating predictive scores are being developed. Regional constraint metrics, whereby regions of a gene have different scores, are also available.

Do you have links to other articles or websites that would be helpful to other learners? Share them below and explain in your comment why they’re helpful. Let others know if you find their suggestions useful.

© Wellcome Connecting Science
This article is from the free online

Interpreting Genomic Variation: Overcoming Challenges in Diverse Populations

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now