## Want to keep learning?

This content is taken from the University of Birmingham's online course, Metabolomics: Understanding Metabolism in the 21st Century. Join the course to learn more.
4.7

## University of Birmingham

Skip to 0 minutes and 5 seconds In a typical metabolomics experiment a wealth of information is measured for each sample resulting in a high-dimensional, wide, data table. On the one hand this is a blessing because a large amount of biological information can potentially be observed in a single experiment. However, on the other hand this can also be a curse. For example, as mentioned in the previous lecture, many traditional approaches for data visualization and statistical analysis cannot cope with such data. A common approach to resolve these issues is to reduce the dimension of the data before data visualization and statistical analysis takes place. This essentially means that we somehow want to end up with a data table with only a few columns, a small data table.

Skip to 0 minutes and 50 seconds In metabolomics often a dimension reduction strategy is employed for this purpose. The aim is to capture the main properties of the data in a new table that is low-dimensional. Principal component analysis or PCA is one of the most population dimension reduction methods. The goal is to reduce the data table to only a few columns or dimensions, while retaining as much information as we can about the main differences between the samples. For illustration, imagine that my left hand is a complex data set. The data is high-dimensional, three dimensions in this case and very difficult to analyse.

Skip to 1 minute and 30 seconds However, by illuminating my hand with a flashlight, in my right hand, I can inspect the properties of my hand by studying the shadows that are projected on the wall, behind my hand. This way, I dont have to inspect a complicated 3-dimensional shape, but I can study something flat with only has twodimensions, namely the shadows in the wall. PCA uses mathematical projection to reduce high-dimensional data tables with, lets say, a thousand columns to low-dimensional data tables with only a few columns. Depending on how I position my hand in front of the light, different properties will become visible in the shadows.

Skip to 2 minutes and 14 seconds For example, if I hold my hand like this you will be able to see in the shadows that I have a thumb and different fingers and you will see quite a bit about my hand. On the other hand if I rotate my hand in this direction the shadows will be quite meaningless and it will be very difficult to even observe that we are looking at a hand. This shows that we need to think carefully about how we rotate our data before projecting it to a lower dimensional table. PCA aims to show the largest differences between the samples. This corresponds to rotating the data in such a way that the shadows on the wall become the largest.

Skip to 2 minutes and 54 seconds So how does this work in practice? As an example I am going to consider a study where metabolomics was used to determine patterns in metabolite profiles associated with particulate organic matter at two locations in the western English Channel, and also at two depths for each location. For this purpose, 64 water samples were analysed by liquid chromatography mass spectrometry. Subsequently, PCA was applied to reduce the initial data table with 173 columns to a table with only 6 columns. The PCA model captured 83% of the differences between the samples. To visualize these differences 2-dimensional scatter plots of pairwise combinations of the six columns in the data table were constructed.

Skip to 3 minutes and 42 seconds Here, a plot of the first column (labelled as principal component, or PC one) against the fourth column (labelled as PC four) is shown. A clear separation between the two locations (L and E) could be observed along PC one. Additionally, a clear difference in depth at location E could be observed along PC four. Traditional statistical methods were subsequently applied to the small data table that is produced by PCA. Here, we used a method called multivariate analysis of variance. This way it was confirmed that the observed differences in location and depth along PC one and PC four were indeed significant. Finally, I would like to remark that the PCA model also stores information regarding how the data was rotated before dimension reduction.

Skip to 4 minutes and 33 seconds Because of this, the differences in location and depth could be related back to specific peaks in the original data table. This way, a wide range of relevant metabolites, including amino-acid derivatives, oxidised fatty acids, and glycosylated compounds were associated to the observed differences. The original publication of this study discusses these findings in relation to the overall composition of organisms at the investigated locations and depths. To summarize, the high-dimensional and wide data tables that are encountered in metabolomics can be difficult to analyse, but the metabolomics community is now routinely applying many techniques to interrogate these large data sets and increase our understanding of the changes in metabolism.

Skip to 5 minutes and 19 seconds Principal component analysis is one example and a useful dimension reduction technique to assist in the analysis of high-dimensional and wide data tables. The output of PCA is a small data table that can be visualized in scatter plots to display the biggest differences between the samples. Because the PCA model retains information about how the dimension reduction was carried out, these differences can also be related back to peaks in the original data table. In the next few steps I will look at some of the approaches that we apply after PCA to the data.

# Data reduction and visualisation

Dr Jasper Engel discusses the data reduction and visualisation techniques that are applied in the analysis of metabolomics data.