£199.99 £139.99 for one year of Unlimited learning. Offer ends on 14 November 2022 at 23:59 (UTC). T&Cs apply

Find out more
Processing Resistome count data in ResistoXplorer – Part II
Skip main navigation

Processing Resistome count data in ResistoXplorer – Part II

Processing Resistome count data in ResistoXplorer – Part II

Now that data is filtered, we need to normalise it. But why do we need to normalise data? Well, metagenomic data have some unique characteristics such as large differences in library sizes, sparsity, skewed distributions, over-dispersion, and compositionality. Therefore, it is critical to normalise the data to achieve more comparable and meaningful results.

To perform this step, ResistoXplorer supports three kind of normalisation approaches:

Rarefaction: this method deals with uneven sequencing depths by randomly removing reads to the size of smallest library (non-defective) in the different samples until the sequencing depth is equal in all samples. Rarefaction has been criticised as it may entail loss of valuable information. However, it is still useful to rarefy the data when the library size of the samples varies too much (such as, >10X) or very low (<1000 reads/sample), as well as important for comparisons between samples or communities (ordination or clustering analysis).

Scaling-based: these methods account for uneven sequencing depths by deriving a sample-specific scaling factor for bringing samples to the same scale for comparison. The most commonly used approach is dividing each feature count with the total number of reads in each sample to yield relative abundances (proportions). Additionally, in case of Total Sum Scaling (TSS) or Count per million (CPM) normalisation, the resulting proportions were multiplied by 1M (million) to obtain the number of reads corresponding to that feature per million reads for easier interpretation. Though this approach has been criticised because of the biased resulting relative proportions when the samples have been dominated by few most abundant features along with heteroskedasticity.

Other scaling factors, such as cumulative sum scaling (CSS) and upper quantile (UQ) have been proposed to account for such biases. For example, CSS calculates the scaling factors as the cumulative sum of features (genes) abundances up to a data-derived threshold to remove the variability in data caused by highly abundant genes. When performing differential abundance testing, such method is recommended in controlling the FDR in data with larger group sizes. However, if we wanted to do accurate community-level comparisons (such as beta diversity or ordination analysis), TSS work better than CSS and is recommended in capturing the entire composition. One should be cautious while using these scaling-based methods as it may likely to over or underestimate the fraction of zero counts depending upon the corresponding library size of each sample.

Transformation-based: it includes approaches to deal with sparsity, large variations within the count data and compositionality. The log abundance ratio transformations such as centered log ratio (clr) is recommended because of the compositional nature of sequencing data. Other RNA-Seq based methods such as relative log expression (RLE) and trimmed mean of M (mean) values (TMM) are very commonly used and showed superior performance in identifying differentially abundant genes with smaller sample sizes.

Now we know different types of method for normalising our resistome count data. But the main question remains that: Which normalisation method should you choose?

The answer to this question is not straight forward as there is no consensus guideline with regard to which normalisation performs the best and should be used for all types of datasets. The choice of method is dependent upon the type of analyses to be performed.

For more details on normalisation methods & their performance on different type of analyses, we encourage you to thoroughly read the referenced articles.

Do it yourself:

Explore different methods and visually investigate the clustering patterns (i.e., through ordination plots, dendrogram and heatmap) to determine the effects of different normalization methods with regard to the experimental factor of interest. Which methods did you try? Share with your peers in the comments below.


Weiss S, Xu ZZ, Peddada S, Amir A, Bittinger K, Gonzalez A, Lozupone C, Zaneveld JR, Vázquez-Baeza Y, Birmingham A, Hyde ER, Knight R. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome. 2017 Mar 3;5(1):27. doi: 10.1186/s40168-017-0237-y. PMID: 28253908; PMCID: PMC5335496.

McMurdie PJ, Holmes S. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput Biol. 2014 Apr 3;10(4):e1003531. doi: 10.1371/journal.pcbi.1003531. PMID: 24699258; PMCID: PMC3974642.

McKnight, D. T., Huerlimann, R., Bower, D. S., Schwarzkopf, L., Alford, R. A., & Zenger, K. R. (2019). Methods for normalizing microbiome data: an ecological perspective. Methods in Ecology and Evolution, 10(3), 389-400.

This article is from the free online

Exploring the Landscape of Antibiotic Resistance in Microbiomes

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education