Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

Find out more

Data quality metrics

Article discussing the importance of data quality

Large-scale sequencing – a limitation to the quality

Since the pandemic began multiple consortiums and collaborations between public and private sectors have been established to sequence and publish the SARS-CoV-2 genomes in real-time. As of 19th May 2022, there were 10,900,329 genome sequences available at GISAID which is the largest number of whole-genome sequencing (WGS) carried out for any organism today. Even though there is an increased number of WGS that are being carried out, there is no universal agreement between different consortia as to what are the quality criteria that have to be followed when sequencing or using the publicly available SARS-CoV-2 genomes for analysis.

Multiple laboratories, protocols and strategies are impacting the data generated

Currently, there are several sequencing strategies, with multiple protocols that are being used that are conserved for that consortium or the country as a whole. Public health laboratories follow their own sample selection criteria, library preparation and sequencing platforms, bioinformatics workflows, and data interpretation, resulting in inconsistent data quality standards and biases among SARS-CoV-2 sequences generated in public databases.

Illustration with a central circle with “data quality” printed in it. In the periphery, 4 other circles are equidistantly positioned from the central one. The words “accurate”, “complete”, “consistent” and “timely” are indicated in the peripheral circles

Click here to enlarge the image

Figure 1 – Data quality should have the following attributes: accuracy, completeness, consistency and timeliness. Source: Response Source

While these issues are not unique to public health laboratories, providing basic guidance on SARS-CoV-2 sequencing and data sharing practices will improve coordination among laboratories that have been conducting sequencing and set expectations for public health laboratories that are currently expanding.

Following WHO standards is the key

In order to provide clear expectations and a basic standardisation across public health laboratories, WHO recommendations and protocols should be followed worldwide. The WHO recommendations for the Data Quality and Sharing Parameters guidelines are discussed below

Submission to multiple databases

Submitting to multiple public databases ensures public health and the broader research community access to SARS-CoV-2 sequencing data. To support this effort, public health laboratories are encouraged to submit SARS-CoV-2 consensus assemblies to the GISAID EpiCoV™ repository, the National Library of Medicine’s National Center for Biotechnology Information (NCBI), GenBank, and the NCBI Sequence Read Archive (SRA).

Decorative illustration of the sequencing workflow from generating sequencing data to uploading it into public databases such as GISAID and NCBI

Click here to enlarge the image

Figure 2 – Workflow from generating sequencing data to uploading it into public databases.

Best Practices for Quality Control

Quality control is the essential first step in the analysis of sequencing data before being used in any study. Several useful tools are available to help detect ambiguous bases, indels and frame-shifts, including the Nextclade QC metric feature, CoV-GLUE and Pangolin that can be used to evaluate the assemblies for basic quality metrics. QC methods can be carried out at several stages to identify multiple characteristics that may be linked to low-quality sequences.

  • Removing sequences with ambiguous bases, indels or frameshifts based on unaligned/aligned sequences.
  • Removing sequences with > 10% Ns in regions of interest may be appropriate in the first instance.
  • Sequences with suspected underlying sequencing errors (for example, induced by misassemblies) should be investigated, and usually removed.
  • Sequencing errors can manifest as high divergence compared with other sequences or as high numbers of substitutions in short regions that may indicate local misassemblies.
  • High numbers of non-ACGTN bases may be indicative of mixed viral populations as a result of contamination.
  • The sequences should have 90% or greater genome coverage.

Further information:
WHO Genomic sequencing of SARS-CoV-2: a guide to implementation for maximum impact on public health

© COG-Train
This article is from the free online

Making sense of genomic data: COVID-19 web-based bioinformatics

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now