Deconvolution: current challenges and the best tools
Data processing is the computational process of converting raw LC-MS data to biological knowledge and involves multiple processes including raw data deconvolution and the chemical identification of metabolites.
The process of data deconvolution, sometimes called peak picking, is in itself a complex process caused by the complexity of the data and variation introduced during the process of data acquisition related to mass-to-charge ratio, retention time and chromatographic peak area.
A LC-MS dataset is three dimensional, it includes three components for each metabolite feature (1) the retention time, (2) the mass-to-charge ratio and (3) the chromatographic peak area which is related to the concentration of the metabolite. We apply all of these components to construct a data matrix defining each metabolite and each peak area in each sample.
For each metabolite multiple metabolite features are normally detected in the data but to simplify the explanation let us assume that for each metabolite only one metabolite feature is present. This feature will be detected as a chromatographic peak with an extracted ion chromatogram defining the peak shape. The extracted ion chromatogram is a plot of a single mass-to-charge ratio or small range of mass-to-charge ratios across the retention time range. Each extracted ion chromatogram could by plotted manual but there are thousands of extracted ion chromatograms and this would take a long time. Instead software packages can be applied to perform this automatically. One of the most commonly used software in metabolomics is XCMS that was developed in Gary Suizdak’s group in San Diego. There are many other software packages that are freely available or can be purchased from mass spectrometry or software companies. Each of the software programs performs peak picking in different and complex ways. We will not discuss the specific details here, but they all report a metabolite feature with an associated mass-to-charge ratio, retention time and peak area. This process is relatively easy as there will be no expected variation in m/z and RT for a single sample analysis.
In different samples, the same metabolite may be detected at a slightly different mass-to-charge ratio or retention time. The next process is to integrate the data acquired from each separate sample in to a single data matrix. To do this we can reduce the variation by aligning the mass-to-charge ratio or retention time across all samples though this is never a perfect process. The data is then placed in to ‘bins’, where each bin for a mass-to-charge ratio and retention time covers a small range which is equivalent or greater than the observed variation in the dataset. The data across all samples is then binned together to ensure the same metabolite in each sample is reported as the same metabolite in the single integrated dataset. This is critical for further data analysis processes. So by applying these bins we can integrate data from across different samples to construct a single data matrix that can be used in the data analysis.
For untargeted studies this process is complex as the metabolites present are not known and so peak picking operates with no prior information. For targeted assays the metabolites, mass-to-charge ratios and retention times are known and software can be programmed to only search for these metabolites. This makes the process much simpler.
So the data matrix is constructed, representing the metabolite composition of all the samples that were analyzed. The next steps are to chemically identify metabolites and perform data analysis to define the biologically important metabolites.
© University of Birmingham and Birmingham Metabolomics Training Centre.