JASPER ENGEL: Multivariate statistical models take more than one variable into account at the same time. Possible relationships between these variables can be included in the model. Because of this, subtle metabolic effects can be observed, which may be missed by univariate analyses. For example, consider a problem with two metabolites and two groups of samples. The goal is to determine if the groups are significantly different and which metabolites are related to this difference. From a univariate perspective, a small difference is observed on the first metabolite, and no difference on the second metabolite. However, since the two metabolites are related to each other via metabolic pathway, their concentrations are most likely correlated. This can indeed be observed when both variables are inspected together.
Additionally, a clear separation between the groups of samples can be observed in this case. This shows that multivariate methods that take correlations between metabolites into account are a powerful tool for data analysis. However, at the same time, these models can be extremely sensitive to factors such as noise in the data, chance correlations between specific variables or groups of variables due to group labels, and the type of preprocessing that is chosen, et cetera. Univariate approaches are more robust in this respect. Therefore, univariate and multivariate methods both have an important role in metabolomics. There are two main groups of multivariate models, namely unsupervised and supervised techniques. Supervised techniques specifically use information regarding the group structure in their data.
In contrast, unsupervised models do not use such information. Often, unsupervised models are used to explore the data and study natural clustering of the samples, while supervised methods can be used to test for significant differences between specific groups of samples. Principal component analysis, or PCA, is one of the most popular unsupervised multivariate methods in metabolomics. PCA is a so-called dimension reduction technique. In essence, it transforms the high-dimensional data space– for instance, 1,000 metabolites equal 1,000 dimensions– into a small number of dimensions, usually 2 or 3. For illustration, consider my hand, which is a complex three-dimensional object.
By illuminating my hand with a flashlight, I can inspect properties of my hand in a lower dimensional space by studying the shadows that appear on the wall behind my hand. This is exactly what PCA does to view complex data in only a few dimensions. Different properties of the data can be observed by changing the direction from which we shine the light. For example, from this direction the shadows will clearly show my thumb, my fingers, and the rest of my hand. While on the other hand, if I illuminate my hand in this way, the shadows will be quite meaningless. PCA chooses to study directions along which the shadows are the largest.
This corresponds to directions that show or explain the greatest variation in the data. A PCA model returns three outputs, namely a set of scores representing the samples in a few dimensions, the loadings representing how the metabolites contribute to the model, and the percentage of total variance in the data that’s explained by the model. A plot of the scores essentially corresponds to the shadows that the light made on the wall. Each dot corresponds to a sample, and dots that are close together in the plot indicate samples whose measured values are very similar. In this case it can be seen that there are a number of clusters in the data.
The metabolites that are mainly related to the difference between these clusters can be found by inspecting the loading plot. A scores plot is often also used to detect outliers in the data, and to determine if the quality control samples are clustered together. The latter approach assists with the assessment of analytical reproducibility of the data. Although PCA is a very powerful technique in metabolomics, the scores plots do not always show interesting results. In such cases other unsupervised techniques, such as multidimensional scaling or self-organizing maps can be used to visualise the data in a few dimensions. To study if there are clear clusters of samples present in the data, approaches such as hierarchical clustering or k-means clustering can be used.
However, often the goal of data analysis is not to test if there is a clustering in the data per se, but rather to test differences between specific groups of samples, such as healthy controls and patients with a disease. If this difference is not detected by unsupervised techniques, specific information relating to the group structure can be included in the model. In essence, the model is told that the first sample is a control, the second sample a patient, et cetera. Because of this, such supervised techniques are able to detect very subtle differences between the groups. One of the most popular methods in metabolomics is Partial Least Squares Discriminant Analysis or PLS-DA.
The output of this model is similar to that of PCA, but now the few dimensions correspond to directions along which the groups are separated, and at the same time a large part of the variance in the data is explained. Validation is an important aspect of supervised modelling, to assess whether we don’t only observe a group separation because we told the model to which group each sample belonged, but that we are actually looking at relevant differences. Techniques such as cross-validation or the use of an independent data that was not used to construct the model can be used for this purpose.
If the validation step confirms the observed group separation, the so-called variable importance in projection scores, or VIP-statistic, can be used to identify biologically important metabolites that explain the group separation. Typically, all variables with the VIP score close to or greater than one are considered to be important in the model. Although PLS has been successfully applied in many cases, the model is not fit for every problem. People should keep in mind that for each new dataset, factors such as the experimental design, the properties of the data, and the goal of the actual experiment should be considered to determine which multivariate model should be used.
For example, techniques based on multivariate analysis of variance are very powerful when the data has an underlying experimental design. Techniques such as support vector machines or kernel PLS can be very useful when the groups are separated in a complex and nonlinear manner.