Skip main navigation

An expert’s view: Why microbiome data should be treated as compositional?

In this video Dr. Jonathon Baker from the J. Craig Venter Institute talks about the importance of treating microbiome data as compositional.
Hi, I’m Jonathan Baker. A microbiologist at the J. Craig Venter Institute in La Jolla, California in the United States. And today I’m going to talk briefly about compositional data. What is it? And why does it matter?
Compositional data is defined as quantitative description of parts of a whole conveying relative information. Although many types of data can be compositional, today we will focus on DNA or RNA sequencing from a microbial community. If you imagine a microbial community, for example, from the human microbiome. When you do sequencing analysis you’re only observing a small fraction or sample of what is there. And in this example, with 12 orange bacteria and 12 blue bacteria in our community. When we sequence, we may only see six of each species. And furthermore, since sequencing data is inherently compositional, what we will actually observe is a relative abundance of 50% for each species
or ratio of 1:1. So why does compositional data matter? Well, I’m going to give a few examples of situations as to where it may turn out to be very important. And starting with the community that we just discussed, we’re now going to imagine two different scenarios. Which could, for example, be treatments and an experiment. And in the first scenario, the blue species decreases in abundance while the orange species stays the same. So the community has 12 orange and six blue bacteria. In our sample we might observe eight orange bacteria and four blue. Making the new observed ratio of orange to blue, 2 to 1. In the second scenario, both species increase in abundance.
But the orange species increases at a faster rate. Here we now have 32 orange bacteria and 16 blue. But because the data we are observing is relative and we only observe a proportion. We will still see– but because the data we are observing is relative. And we only observe a proportion, we will still see a 2 to 1 ratio of orange to blue. Therefore, because our data is compositional and only gives ratios or proportions. We cannot actually tell these two scenarios which are very different biologically apart. Here are two more examples illustrating the problem and showing how you might get false positive or false negative abundance data. If you look at the absolute abundance of two microbes over time.
Where one microbe decreases in abundance and the other stays at constant abundance. Over time a smaller proportion of the sequencing reads will come from the microbe that is decreasing in abundance. Meaning that the proportion of reads coming from the other microbe will increase. This will make it appear that the taxon shown in dark blue is increasing in number, which isn’t true. Therefore, representing a false positive. Conversely, if we have an example where one microbe is increasing in abundance and the other is at essentially zero abundance. That all time points nearly all of the sequencing reads will be coming from the microbe with increasing abundance. Therefore, it will appear that the relative abundance of both microbes will be staying the same.
But this is an example of a false negative. In addition to the interpretation of compositional data there are many steps in the sequencing protocol that can detach the original microbial load from the data that you observe. And many of these steps can add bias to your results. Samples are collected from a much larger population. And a subsample is used for DNA extraction. Here differing extraction methods can bias data based on the ability of different protocols to lice and isolate nucleic acid from different organisms. Furthermore, many protocols use purification columns to isolate DNA, which can become saturated, further complicating direct correlations between DNA yield and starting microbial load.
In the case of 16 SRRNA sequencing, a subsample is used for PCR amplification. Adding further biases due to differing primary efficiency across taxa and a subset of the resulting amplicon is pooled for the library preparation. Which may include additional PCR steps and add further bias. But the time quality filtered sequencing has obtained, the sequences only reflect a small subset of the population. And are not an accurate representation of the microbial load in the original sample. So, what should you do when you analyse compositional data? Well, the first thing is to be sure that you’re using appropriate analysis tools. Using tools not designed to handle compositional data can yield up to 100% false discovery rates.
To get started, this 2017 review is a great resource on compositional data. And it even has a table listing tools to handle compositional data for each step in a microbiome analysis pipeline. In addition, this 2019 paper gives a great overview of the problem of compositional data and presents a novel way to handle it in the context of microbial communities. Finally, the most important thing you can do is understand and acknowledge the shortcomings of compositional data when you were designing experiments, interpreting results, and generating hypotheses. You are only ever going to be able to say how much the abundance of a particular microbe is changing when compared to the abundance of another microbe.
And just keeping these facts in mind will go a long way in preventing you from drawing spurious conclusions from your experiments. Thank you very much for your time.

The field of metagenomics evolves very rapidly, and resources we have available for analyses allow us to reassess the methods used in the past.

To finish our series ‘An expert’s view’, we invite Dr. Jonathon Baker from the J. Craig Venter Institute to talk to us on the importance of treating microbiome data as compositional. Dr. Baker is an expert microbiologist that bridges the fields of microbiology and bioinformatics.

This article is from the free online

Exploring the Landscape of Antibiotic Resistance in Microbiomes

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education