Skip main navigation

Data cleaning and quality control

Article discussing data quality control and its tools
Cartoon of a robot at a desk kicking out a human
© COG-Train

Machines may be less error-prone than humans, but even machines can make mistakes. Or at the very least, they are only as good as the person/team who programmed them.

The garbage in; garbage out (GIGO) concept therefore also applies to Bioinformatics, or in our case, sequenced data.

What is sequenced biological data?

After the extraction of your biological sample (RNA and/or DNA), the sample is sent for sequencing, which returns a human and machine-readable FASTQ file. From this file, you will be able to obtain information about the lengths of the reads (the stretches of nucleotides that were sequenced), as well as the actual nucleotide bases within that genome and the quality score assigned to each nucleotide.

The quality control and preprocessing of these FASTQ files are essential because they impact the accuracy of all downstream analyses. For example, you may want to know which variants are present in your SARS-CoV-2 sample. After running a variant calling tool, how would you know if the variant you are seeing is a true variant and not a result of adapter contamination, biases in the nucleotide bases, overrepresented sequences or errors caused by the library preparation?

Fortunately, you don’t have to write any code to carry out the necessary quality control and data cleaning yourself. FastQC for example is a tool which outputs a summary file which tells you the quality of your bases in the form of a table and graphs (Andrews, n.d.). It also tells you whether you have adapter sequencing which may need to be trimmed and various additional metrics related to your reads.

Illustrative graphs - example of a FastQC output showing good quality data

Click here to enlarge the image

Figure 1 – Two sample graphs from a FastQC output. The top image shows us the average quality of the bases. Ideally one would want the yellow bars above the quality of 20. The bottom image shows that except for the first few bases, the average composition of each base/nucleotide is about 25%

Figure 1 is an example of good quality data, where the yellow bars representing the average base qualities are above 20 (called Q20) (top image). Q20 is called a PHRED quality score. It represents a logarithmic property which tells us how confident we are that the base was called correctly. A score of Q20 means that the probability that the base is called incorrectly is 1 in 100, or that it has a 99% accuracy. You may read more about PHRED scores on this resource website. Figure 1 also shows that each nucleotide represents about 25% of the total sequenced reads (bottom image). Except for certain genomes which may have a higher GC- or AT content, such as Mycobacterium tuberculosis and many insects, respectively, you would ideally want each nucleotide to have an equal representation (25% each).

Similarly, after aligning your FASTQ files to your reference genome, you will produce an aligned file called a BAM file. To ascertain the quality of the alignment (also called mapping), a tool called QualiMap BAMQC was developed to help us observe the percentages of reads that were correctly mapped to the reference genome and additional quality metrics to help us make informed decisions. The aligned reads in Figure 3 show an average mapping quality of 60 – which is a very good score.

Illustrative graph showing a QualiMap BAMQC output indicating a quality score of ~ 60

Click here to enlarge the image

Figure 2 – The average quality of the aligned reads is ~60, which is representative of a good score.

You will learn more about this later. You can learn more about FastQC report in this video and QualiMap BAMQC resources are available on this website

Using the FastQC and BAMQC reports, one can determine which bad bases can be removed or whether that data should even be used at all.

Finally, whether your final step is variant calling or generating a phylogenetic tree, you will be doing quality control and data cleaning along the way. Sticking to best practices for doing these analyses is always advised to prevent us from missing important steps that may produce false positives or other spurious results.


FastQC A Quality Control tool for High Throughput Sequence Data

Fastp: an ultra-fast all-in-one FASTQ preprocessor

Qualimap: evaluating next-generation sequencing alignment data

MinIONQC: fast and simple quality control for MinION sequencing data

Humans vs machines: Who’s winning?

© COG-Train
This article is from the free online

Making sense of genomic data: COVID-19 web-based bioinformatics

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education