Skip main navigation

No database is perfect: beware errors in sequence databases

The impact of poor data sharing
image of stadium seating section. All seats are white except for one red seat.

Public and private initiatives to implement and extend the sequencing capability on a global basis, particularly in developing countries are helping to find blind spots of SARS-COV-2 genetic information in Latin America and the Caribbean and Africa. For instance, Brazil has increased its sequencing capacity from both public and private initiatives and today is among the 20 countries that most contribute with SARS-COV-2 sequences in GISAID (global data science initiative) database.

However, an effective genomic surveillance system requires not only widely sampling but also an effective effort to share the data generated in a truthful and accurate way in public databases to allow researchers to track in real-time the variants and their mutations and to perform epidemiological and biological studies.

As the COVID-19 pandemics evolve, the public databases have been inundated with a massive amount of data, and SARS-COV-2 genomic information is mostly available at three main databases: NCBI GenBank; GISAID’s EpiCoV Database and EMBL-EBI COVID-19 Data Portal. The raw reads of the sequencing can be also deposited in proper databases such as The International Nucleotide Sequence Database Collaboration (INSDC). Automated pipelines for quick deposition and processing of new genomic information have become available and help researchers to submit really large files quicker and easier than before.

Before releasing a new sequence, all submitted data are reviewed and curated by a team of curators. Also, some basic information about the sequences is mandatory in all databases, such as the name of the virus, the local of sampling, and the name and contact of the submitter. While we can submit the raw data in some databases, at the GISAID platform, the most accessed database for COVID-19 genomic information, only the final assembly (not the raw reads) can be deposited. It turns out that automated assembly of reads always involves an interpretation of inevitable errors generated during the sequencing process. The sequencing and assembly processes can also be biased due to technique artefacts.

In most cases, the errors are relatively minor such as a couple of nucleotide substitutions (“sequencing artefacts”) or non-covered regions that are replaced by a sequence of “NNN”. In other cases, however, the errors are more significant, and the use of such data can lead to erroneous interpretations and even inconsistent conclusions. As an example, while frameshift mutations are easily detected in an assembled genome (frameshift: where one or more nucleotides are inserted or deleted in a way that changes the triplet reading code), missense substitutions or in-frame indels (that involves at least 3 or multiple of 3 nucleotides and do not change the reading frame) are almost impossible to be recognized. Last, but not least, there are also “annotation” errors. There is no easy way to detect misclassifications as an incorrect geographic location of sampling, sampling dating, or even clinical information associated with the sequences.

We are in the era of automation, the era of next-generation sequencing and bioinformatics advances, but it is still our responsibility to assure the quality of the data we generate. Robust Quality control processes are required prior to uploading to data repositories and caution should be taken when using data obtained from such sources. Any surprising or impactful findings should always consider the possibility of inaccurate sequence or meta-data before publication.

© COG-Train
This article is from the free online

From Swab to Server: Testing, Sequencing, and Sharing During a Pandemic

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now