# Why high-quality data is important

Article discussing data quality control

A growing number of biological topics may now be understood, thanks to high-throughput sequencing methods. These technologies have continued to advance, and, most recently, long-read sequencing has expanded its applications in genomics, transcriptomics, and metagenomics by surpassing earlier accuracy constraints. This means that generating a huge amount of data requires curation before any downstream analysis. Next-generation sequencing’s most popular raw data format is identified by the FASTQ file extension. Sequence data and a quality score for each site are included in the data format created by sequencing platforms.

## FASTQ Format

The FASTQ file contains four lines of information:

1) The header line begins with @ and has information about the unique instrument name, run ID, flow cell ID, flowcell lane, tile number, X and Y coordinates of clusters in the tile, paired(1) or mate-pair(2), status about the read is filtered or not (Y – filtered, N-not), control sample status (0 when none), and index sequence
2) A, T, G and C of the read
3) A line with (+) sign
4) Quality score of each base in ASCII format

## Quality control

Quality control generally involves calculating the number of reads, GC content to identify overrepresented sequences, discarding reads with Ns (uncalled bases), removing adaptor sequences, and removing or trimming low-quality bases. These are necessary to lessen the likelihood of bias in variant calling and/or new assembly before any downstream process.

There are several tools which perform quality checks for the given reads such as FastQC, Scythe and Sickle, but, FastQC is the most commonly used program. This program performs various tests and produces basic statistics indicating pass (green tick), warning (yellow) and failed (red). It produces, per tile, Sequence Quality in which the colour of the tiles indicates the read quality which can be seen in cells. Per-base Sequence Content provides the percentage content of A, T, G and C base within a read. Per base N content will provide Ns present in a given read. Reads with Ns are normally discarded. Sequence Duplication Levels show the level of duplication, a high level shows bias in the enrichment step. Adapter content will show the presence or absence of adapters.

Intensifying read lengths/GC content check > Removing reads with “N” > Cross-species contamination check > Adapter/primer removal. The output is high-quality reads for downstream analysis.”>

Figure 1 – Flowchart showing the steps involved in the pre-processing Quality Control step.

## Quality Scores

Sequencing quality scores measure the probability that a base is called incorrectly. With sequencing by synthesis (SBS) technology, each base in a read is assigned a quality score by a phred-like algorithm, similar to that originally developed for Sanger sequencing experiments.

Q Score Definition The sequencing quality score of a given base, Q, is defined by the following equation:

Q = -10log10(e)

where e is the estimated probability of the base call being wrong.

Higher Q scores indicate a smaller probability of error. Lower Q scores can result in a significant portion of the reads being unusable.

To identify low base quality Based on the Phred scale(Q), all quality scores are determined. Each base call has a corresponding base call quality that calculates the likelihood that the base call is inaccurate.

Table 1 – Relationship between sequencing quality score and base call accuracy

Quality score Probability of incorrect base call Inferred base call accuracy
0 (Q10) 1 in 10 90%
20 (Q20) 1 in 100 99%
30 (Q30) 1 in 1000 99.9%

The quality scores may become erroneous due to extreme GC biases, specific patterns, or homopolymers. To guarantee that assembly or variant calls are accurate, accurate base characteristics are a crucial component. It is assumed that Illumina data with less than Q20 is not valuable data and should be eliminated as a general rule for Illumina short reads. In case of long reads from Pacbio or Nanopore, it is around Q10, as the error rate is high in these technologies.

## MultiQC

When analysing multiple samples, checking QC for each one individually is time-consuming. MultiQC offers a solution for this stage. MultiQC aggregates reports from many experiments, including FastQC, and provides a single output.

## Contamination with other species

Sampling and DNA extraction errors may cause contaminants in the reads. There are now several tools that swiftly search reads and assign them to specific species or taxonomic groups. Tools like Kraken, Kaiju and Centrifuge are used to identify these contaminants. Kraken analysis provided the top species for each isolate, along with relatedness percentages for any other species.

Figure 1 – Per base sequence quality, which plots the Q-score of the raw sequence, reads as a box-plot for each cycle. Higher is always better, and a characteristic quality decay is seen in most runs.

Figure 2 – Per base sequence content, which plots the proportion of each base at each cycle. In a random fragment library from a “normal” genome you would expect to see all four bases equally represented. Deviation from normal base content can indicate issues with library quality, but, equally, some genomes are very GC biased and some NGS applications also introduce a strong GC bias, e.g. Bis-seq.