Primary Databases diagram with INSDC in the centre, DDBJ, ENA/EBI and NCBI in an outer ring
Primary Databases

Introduction to primary databases

In this article, you will learn about primary databases and their importance in storing and making sequence data available.

Primary databases (also known as data repositories) are highly organised, user-friendly gateways to the huge amount of biological data produced by researchers around the world. The primary databases were first developed for the storage of experimentally determined DNA and protein sequences in the 1980s and 90s. In those times, proteins were sequenced one amino acid at a time and DNA sequencing was in its infancy, so repositories contained a limited number of sequences. However, with the arrival of automatic DNA sequencing, these data banks started to grow exponentially. Nowadays, sequence submissions are made by individual laboratories, as well as “in bulk” by sequencing centres around the world, and DNA submissions now greatly outnumber protein sequence submissions. Most protein sequences found in databases are the product of conceptual translation of the genes and genomes determined using DNA sequencing.

There are three nucleotide repositories or primary databases for the submission of nucleotide and genome sequences:

  • GenBank hosted by the National Center for Biotechnology Information (or NCBI).

  • The European Nucleotide archive or ENA hosted by the European Molecular Biology Laboratories (EMBL).

  • The DNA Data Bank of Japan or DDBJ hosted by the National Centre for Genetics.

Together they form the International Nucleotide Sequence Database Collaboration and luckily for the users, they all “mirror” each other. This means that irrespective of where a sequence is submitted, the entry will appear in all three databases.

Once data are deposited in primary databases, they can be accessed freely by anyone around the world. For example, researchers are working on a Staphylococcus aureus strain that was isolated from a patient. After some investigations, the researchers suspect that this strain might be genetically different from previously identified strains. They decide to sequence it and, after comparing the DNA sequences already placed in the public repository (“known” strains), they conclude that indeed their strain is different. The research community will benefit from having this new sequence in the public repository so that the next time a researcher finds the same strain, he/she will be able to recognise if their isolate is a novel one, or if it is somehow related to strains previously sequenced.

The accumulation of collective knowledge in public databases enables rapid and efficient access to data by individuals and institutions. The rapid identification of a virulent strain of microbial pathogen based on its sequence, and sharing of results and experiences among researchers and clinicians could help put restrictions in place to prevent a pathogen spreading in the community. In other situations, the correct identification of the disease-causing pathogen can aid the choice of antibiotics enabling a better and quicker resolution of the disease.

Share this article:

This article is from the free online course:

Bacterial Genomes: From DNA to Protein Function Using Bioinformatics

Wellcome Genome Campus Advanced Courses and Scientific Conferences