Skip main navigation

Normalization and differential analysis

description added later

Now we have annotated the ARGs present in the samples – great! This data can be then analyzed for abundance, diversity, and used to draw comparison across samples. So far, as we have seen in the previous steps, we have tried to working with standardized methods to minimize the differences between samples in terms of design, collection, extraction, and analysis. However, regardless of how well these steps were performed there will always be inherent differences across samples. Thus, to be able to draw comparisons and parallels, an important step of normalization is necessary.

Imagine that you are comparing two samples that were treated similarly during the whole project. For whatever reason, the quality of reads in one of them was much lower and those reads needed to be removed from the analysis. As a result, you have one samples with 1 000 000 high quality reads and another one with 400 000. Comparing them directly would be extremely challenging and prone to bias. So, we need to attempt to make them comparable by normalization to a certain standard between. In this regard, one method for such that is very often used is in fact normalization to the total number of reads. This ensures that there is a common standard and samples with more reads will not take over the analysis.

There are alternatives that instead of using the total number of reads, they use a specific gene or locus in the bacterial genome. This is the case for housekeeping essential genes such as gyrB, rpoB, and others. The 16S rRNA is also very often used here. However, caution is advised as we know that the number of 16S rRNA copies varies significantly between different bacterial species, thus, such aspect needs to be incorporated into the normalization step.

Once data is normalized, differential analysis is the next step. There are a multitude of softwares available for such, and which one to choose will depend on the data. Two very often and reliable resources are DESeq2 and edgeR. You can find the original publications for each in the links below:

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data

Following these steps will provide you with a count table that will be utilized for downstream analyses.


Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8. PMID: 25516281; PMCID: PMC4302049.

Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010 Jan 1;26(1):139-40. doi: 10.1093/bioinformatics/btp616. Epub 2009 Nov 11. PMID: 19910308; PMCID: PMC2796818.

This article is from the free online

Exploring the Landscape of Antibiotic Resistance in Microbiomes

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education