Skip main navigation

High-dimensionality: a challenge in metabolomics data analysis

Watch Dr Jasper Engel discuss the challenge of analyzing high-dimensionality metabolomics data.
5.4
In the previous lectures we discussed some of the analytical platforms we use to analyse the metabolome. It was shown that several processing steps have to be applied to remove systematic bias from the data, while preserving biological information. The next step in the metabolomics pipeline is to use statistical techniques to extract the relevant and meaningful information from the processed data. Typically the goal is to identify the metabolites that are significantly changing between classes of biological samples. This is not an easy step due to the wealth of information that needs to be inspected. Usually it is much easier to measure the data than it is to process, analyse and interpret it. Lets consider a typical metabolomics experiment.
50.6
Most data analyses start with the production of the data table, which will then be analysed. In this table each row corresponds to an individual spectrum and each column contains the ion intensity-values of a specific peak in the spectra. In statistical terms, the columns of the data table are referred to as dimensions, and a data table with many columns is called a high-dimensional data set. Metabolomics data is highdimensional. Additionally, most metabolomics data tables are wide, meaning that we detect a large number of peaks in a relatively low number of samples. Traditional statistical methods, however, were developed to deal with long data tables that contain many samples and a few, well-chosen, variables.
96.4
In other words, they were developed to cope with low-dimensional data instead of high-dimensional metabolomics data. Because of this reason, many traditional approaches do not scale well to the analysis of metabolomics data, or they are not applicable at all. This can, for example, be seen when we try to visualise the data. Visualization of the data in easily interpretable plots is usually the first step in data analysis. In metabolomics, we want to use such visualization to assess the data reproducibility, to detect possible outliers, and perhaps already detect interesting metabolic differences between groups of samples. Visualization is relatively straightforward when we are dealing with a low-dimensional data table with, lets say, 2 columns or dimensions.
146.9
These columns might be two peaks from our metabolomics experiment that we are particularly interested in. A two-dimensional scatter plot can be used to display the data. In such a plot each dot corresponds to a sample, and the position of the dot is determined by the ion intensities of this sample for the two peaks of interest. In this particular example, the scatter plot highlights a difference between the control samples and the treated samples. Additionally, an outlying sample seems to be visible which we might want to exclude from further analysis. Typically, in an untargeted metabolomics experiment we dont know beforehand which peaks we need to inspect to observe such interesting patterns as in this example.
189.6
Additionally, sometimes the interesting patterns can only be observed when more than two peaks are considered at the same time. Therefore, data visualization approaches are required that take all the information in the data table into account at the same time. One could imagine constructing a three-dimensional scatter plot to display the ion intensities of three specific peaks. But how should we deal with a fourth peak. Or a fifth? We cannot visualize such four or five-dimensional data in a single plot. Let alone metabolomics data, which contains hundreds to thousands of peaks or dimensions. To resolve these issues, we often make use of so-called data reduction or dimension reduction approaches.
231.9
Essentially these methods greatly reduce the number of columns in our data table such that a long, low-dimensional, data table is obtained. Subsequently, traditional data visualization and statistical analysis techniques can be applied to the reduced data table. I will discuss principal component analysis, which is one of the more popular data reduction methods in metabolomics in the next step.

Metabolomics experiments produce “big data” datasets, and contain a wealth of information that needs to be inspected.

Many of the traditional data analysis approaches are not ideally suited to analyse metabolomics data. Dr Jasper Engel explains the reasons for this and introduces the challenges of analyzing metabolomics data.

This article is from the free online

Metabolomics: Understanding Metabolism in the 21st Century

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education