Skip main navigation

The viralrecon multiQC report

looking at the MultiQC report produced by viralrecon

In this section, we’re going to focus on the MultiQC report produced by viralrecon and, in particular, some of the plots and tables that are useful when determining the quality of our sequence data

MultiQC is a tool that aggregates several outputs from the pipeline into a user-friendly interactive report usually in html format. This means that it can be opened in any web browser. Let’s go ahead and open the multiqc_report.html file that can be found in the results/viralrecon/multiqc directory in your favourite web browser. You should see something like this:

Screenshot showing the top of the MultiQC HTML file

On the left, there is a menu that contains clickable links to sections pertaining to each of the outputs from the tools aggregated by MultiQC. The first section of any MultiQC report is usually a summary table containing results from multiple outputs. Here, it’s called ‘Variant calling metrics’ and contains results pertaining to the mapping and variant calling:

Screenshot showing the MultiQC Variant calling metrics summary table

There are several sections in the MultiQC report produced by viralrecon. Many of these will look similar to the outputs you examined in the Quality Control section of Week 1, in particular the outputs from FastQC. Click on the ‘PREPROCESS: FastQC (raw reads) link and this will take you to the FastQC section. The plots will look a little different to what we showed you in week 1, because as MultiQC takes the FastQC results for each section and puts them together in single interactive plots (but it’s the same data). We can see an example below in the Sequence Quality Histogram:

Screenshot of the FastQC Sequence Quality Histogram for all five samplesClick to expand

By combining the data in this way, we can quickly identify any samples that are of lower quality and should be excluded from any downstream analysis.

We ran the viralrecon pipeline on five SARS-CoV-2 samples so we can use the MultiQC output to assess the quality of these samples specifically. Firstly, we should consider whether any of these samples failed to map sufficiently to the reference genome which may suggest that either something went wrong with the sequencing or else that the samples are not what we think they are (this is not uncommon when working with data from public databases so it’s good practice to always QC the data even if it was included in a publication). The statistic we’re interested in here is the # Ns per 100kb consensus column in the ‘Variant calling metrics’ table.

Screenshot showing the MultiQC Variant calling metrics summary table and zoomed in to show the number of Ns per 100kb consensus column

If we divide the numbers in this column by 1000 we get the percentage of missing bases in each sample. Ideally we don’t want this number to be more than 15% as this means that only 85% of the reference genome was covered by reads. This may have implications for accurately calling variants or else assigning samples to lineages. Fortunately, all of our samples have fewer than 3% Ns which means we’re covering more than 97% of the reference genome. Another useful metric to consider is the median depth of coverage across the reference genome (Coverage median) as the greater the depth of reads mapping to a given position in the reference genome, the more confidence we have in the quality of any variants we’ve identified. Ideally we want the coverage to be at least 20X and we can see that all of our samples have at least 600X.

As we used samples generated using amplicon sequencing, we can look at a heatmap of amplicon coverage across the reference genome:

Screenshot showing the amplicon coverage heatmap section of the multiqc reportClick to expand

This plot shows the sequencing depth for each amplicon in the data and is useful for identifying systematic drops in the sequencing of particular amplicons (represented by dark blue in the plot). This is sometimes a problem during PCR amplification and it’s good to know whether our samples are going to have missing data around a particular region of the reference genome. As we’ve included samples from different studies sequenced in different places at different times, we don’t expect to see systematic drop outs of amplicon in our dataset but it is something to watch out for.

© Wellcome Connecting Science
This article is from the free online

Bioinformatics for Biologists: Analysing and Interpreting Genomics Datasets

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now