Skip to 0 minutes and 5 seconds In the previous lectures we discussed some of the analytical platforms we use to analyse the metabolome. It was shown that several processing steps have to be applied to remove systematic bias from the data, while preserving biological information. The next step in the metabolomics pipeline is to use statistical techniques to extract the relevant and meaningful information from the processed data. Typically the goal is to identify the metabolites that are significantly changing between classes of biological samples. This is not an easy step due to the wealth of information that needs to be inspected. Usually it is much easier to measure the data than it is to process, analyse and interpret it. Lets consider a typical metabolomics experiment.
Skip to 0 minutes and 51 seconds Most data analyses start with the production of the data table, which will then be analysed. In this table each row corresponds to an individual spectrum and each column contains the ion intensity-values of a specific peak in the spectra. In statistical terms, the columns of the data table are referred to as dimensions, and a data table with many columns is called a high-dimensional data set. Metabolomics data is highdimensional. Additionally, most metabolomics data tables are wide, meaning that we detect a large number of peaks in a relatively low number of samples. Traditional statistical methods, however, were developed to deal with long data tables that contain many samples and a few, well-chosen, variables.
Skip to 1 minute and 36 seconds In other words, they were developed to cope with low-dimensional data instead of high-dimensional metabolomics data. Because of this reason, many traditional approaches do not scale well to the analysis of metabolomics data, or they are not applicable at all. This can, for example, be seen when we try to visualise the data. Visualization of the data in easily interpretable plots is usually the first step in data analysis. In metabolomics, we want to use such visualization to assess the data reproducibility, to detect possible outliers, and perhaps already detect interesting metabolic differences between groups of samples. Visualization is relatively straightforward when we are dealing with a low-dimensional data table with, lets say, 2 columns or dimensions.
Skip to 2 minutes and 27 seconds These columns might be two peaks from our metabolomics experiment that we are particularly interested in. A two-dimensional scatter plot can be used to display the data. In such a plot each dot corresponds to a sample, and the position of the dot is determined by the ion intensities of this sample for the two peaks of interest. In this particular example, the scatter plot highlights a difference between the control samples and the treated samples. Additionally, an outlying sample seems to be visible which we might want to exclude from further analysis. Typically, in an untargeted metabolomics experiment we dont know beforehand which peaks we need to inspect to observe such interesting patterns as in this example.
Skip to 3 minutes and 10 seconds Additionally, sometimes the interesting patterns can only be observed when more than two peaks are considered at the same time. Therefore, data visualization approaches are required that take all the information in the data table into account at the same time. One could imagine constructing a three-dimensional scatter plot to display the ion intensities of three specific peaks. But how should we deal with a fourth peak. Or a fifth? We cannot visualize such four or five-dimensional data in a single plot. Let alone metabolomics data, which contains hundreds to thousands of peaks or dimensions. To resolve these issues, we often make use of so-called data reduction or dimension reduction approaches.
Skip to 3 minutes and 52 seconds Essentially these methods greatly reduce the number of columns in our data table such that a long, low-dimensional, data table is obtained. Subsequently, traditional data visualization and statistical analysis techniques can be applied to the reduced data table. I will discuss principal component analysis, which is one of the more popular data reduction methods in metabolomics in the next step.
High-dimensionality: a challenge in metabolomics data analysis
Metabolomics experiments produce “big data” datasets, and contain a wealth of information that needs to be inspected.
Many of the traditional data analysis approaches are not ideally suited to analyse metabolomics data. Dr Jasper Engel explains the reasons for this and introduces the challenges of analyzing metabolomics data.
© University of Birmingham and Birmimgham Metabolomics Training Centre