Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

Find out more

Specific databases for specific countries

Article showcasing different genetic databases
Decorative cartoon of a globe icon surrounded by six coloured database icons
© Canva

What is a genomic database of variants?

A genomic database is a structured set of variants held in a computer database and can be accessed in several ways, most often through a web interface.

What are the uses of genomic databases/references?

We will limit our discussion to the diagnosis and discovery of monogenic (Mendelian) disorders here. Allele frequency is the most clinically useful information derived from genomic databases. Rare variants cause rare monogenic disorders and hence would not be seen in healthy populations. This is particularly true if the disease is severe (seen early in life, in children) and has a high penetrance and an early age of onset. Mode of inheritance is also taken into consideration as variants that result in an autosomal dominant condition would be absent, and variants that underlie autosomal recessive disorders would be very rarely seen in healthy carriers (heterozygotes), but unlikely in a homozygous state. The carrier frequencies for several genetic diseases can also be estimated from these data. The ability to diagnose patients and determine the carrier status would have a multitude of direct benefits to the patients and society: treatment, genetic counselling, family planning, precision medicine and prenatal diagnosis.

Why do we need genomic information for every ethnicity?

Genome databases are often skewed towards the populations represented. The gnomAD (v4) is the largest aggregation of allele frequencies available to the public and 77% of individuals represented in gnomAD v4 are from European ancestry. This means most populations are underrepresented, including those that make up the majority of the world’s population. Even the disease‐causing variations are known to be population‐specific for common and rare diseases. Genetic data inequality hampers the diagnosis of rare diseases across diverse (all) populations.

How do databases of diverse populations help in determining the variant pathogenicity for monogenic diseases?

While assessing the pathogenicity of a variant, we generally and safely assume that a disease-causing variant would not occur in unaffected individuals, especially for severe conditions with high penetrance. This is done by checking the allele and genotype frequencies in the population.

  • Although the presence of a variant more frequently in a local population than expected would favour the benign nature of the variant, the absence would not favour disease causation.
  • In the gnomAD, an average individual carries about 200 coding rare variants (<0.1%) in his/her exome. The new coding variants are higher in non-Europeans in line with their poor representation and need more evidence to rule out pathogenicity, underscoring the need for aggregating variant data in the non-European populations to improve the diagnosis of rare monogenic disorders in these populations.
  • Alleles causing Mendelian diseases should be rare in all ethnicities (they do not discriminate). If a population is underrepresented, more variants in that population are likely to be labelled ‘possibly disease-causing’. Also, some variants that are rare in the most represented European population, might be common in other populations and are likely to be incorrectly assigned a pathogenic score. Hence it is crucial to have a wider representation of all the populations in reference databases.
  • Variants with an allele frequency of less than 1% in gnomAD would usually be prioritised for interpretation. Filtering out variants that occur with a high frequency within the same population as the patient can reduce the number of variants considered from 200 to 50.
  • Several non-benign variants in ClinVar can be classified as benign if the variants are seen in a healthy local population. This is more efficient for autosomal dominant conditions with high penetrance, as the occurrence of the variant even in a single healthy individual would be an argument against the pathogenicity.

Here are some important databases:

Database Individuals Comments Reference
The Genome Aggregation Database (gnomAD) 807,162 (730,947 exomes and 76,215 genomes The largest and the most popular database A genomic mutational constraint map using variation in 76,156 human genomes
The 1000 Genomes Project Consortium 2504 Healthy individuals from 26 populations A global reference for human genetic variation
Greater Middle East Variome 1111 No comments Characterization of Greater Middle Eastern genetic variation for enhanced disease gene discovery
Singapore Genome Project 4810 Singapore Chinese, Malays, and Indians Large-Scale Whole-Genome Sequencing of Three Diverse Asian Populations in Singapore
GenomeAsia 100K Project 1739 219 population groups and 64 countries across Asia The GenomeAsia 100K Project enables genetic discoveries across Asia
Exomes of the Indians with rare diseases 1455 Data compiled from patients with rare diseases and their family members A data set of variants derived from 1455 clinical and research exomes is efficient in variant prioritization for early-onset monogenic disorders in Indians
Genome of the Netherlands 769 individuals from 250 families No comments The Genome of the Netherlands: design, and project goals
Japanese population reference panel (1KJPN) 1070 No comments Rare variant discovery by deep whole-genome sequencing of 1,070 Japanese individuals
Collaborative Spanish Variant Server 267 No comments 267 Spanish Exomes Reveal Population-Specific Differences in Disease-Related Genetic Variation
Human genetic variation database 1208 exomes Additional 3248 genotype arrays Human genetic variation database, a reference database of genetic variations in the Japanese population
Korean Variant Archive (KOVA) 1055 No comments Korean Variant Archive (KOVA): a reference database of genetic variations in the Korean population
ABraOM 609 Brazilian database Exomic variants of an elderly cohort of Brazilians in the ABraOM database
SweGen 942 Swedish Genomes SweGen: a whole-genome data resource of genetic variability in a cross-section of the Swedish population
Kuwaiti exome variants 291 No comments Assessment of coding region variants in Kuwaiti population: implications for medical genetics and population genomics
Vietnamese human genetic variation database 105 genomes and 200 exomes No comments A Vietnamese human genetic variation database
Iranome 800 No comments Iranome: A catalogue of genomic variations in the Iranian population
Italian genomic variation 926 No comments A bird’s-eye view of Italian genomic variation through whole-genome sequencing
Finnish isolates 19292 No comments Exome sequencing of Finnish isolates enhances rare-variant association power
Italian exomes 1686 No comments Functional and clinical implications of genetic structure in 1686 Italian exomes

Caution while using these databases

Do not assume that population databases include only data on healthy individuals, as it is known that they contain several pathogenic variants. Penetrance of the disease and age-of-onset need to be considered when assessing the allele frequency. Population databases can also contain more than one family member, thus giving skewed allelic data. Finally, do not forget to check the quality of the variants (to avoid considering poor quality variants and variants in pseudogenes) in such resources.

Why is equity in human genomics important to all ethnicities?

Mendelian diseases are caused by pathogenic variants irrespective of the ethnicities in which they occur. Similarly, variants that frequently occur in a small ethnic group will be benign across all diverse populations. If databases are not inclusive, a rare variant might be assigned as pathogenic by mistake even in large populations. Moreover, different mutations in the same gene might be responsible for the same disease in different populations (exemplified by founder mutations in consanguineous populations). For instance, cystic fibrosis is commonly caused by a different mutation in patients of European-descent (deltaF508 (c.1521_1523delCTT)) versus patients of African-descent (3120+1G>A (c.2988+1G>A)). Under-representation of diverse populations in genomic databases thus limits our ability to fully understand the genetic architecture of human rare and complex diseases and also exacerbates health inequalities.

Have you collaborated on similar projects or do you know other projects that are not on the table? Share with us in the comments.

© Wellcome Connecting Science
This article is from the free online

Interpreting Genomic Variation: Overcoming Challenges in Diverse Populations

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now