Hurry, only 2 days left to get one year of Unlimited learning for £249.99 £174.99. New subscribers only. T&Cs apply

# Clustering and correlation analysis

Clustering and correlation analysis

In the previous step, we have seen how we can identify and visualise patterns or groups (if present) in our high-dimensional resistome data using ordination analysis.

Another very commonly used type of approach to find any hidden clusters or groups in the data is Hierarchical Clustering analysis. As the name suggests, it involves organising the data or samples into a kind of hierarchy. One common approach (known as agglomerative) of doing that is by considering each sample as a data point and then start clustering them together into clusters one by one until our entire data or all samples are grouped in one big cluster.

This approach requires us to have a way of measuring the distance between two samples or points and a method to clustering two points to create a new point. Firstly, we can use the same distance or dissimilarity metrics as used in beta diversity analysis. Other than that, there are a number of commonly used metrics such as Euclidean (straight line distance between two points), Pearson, etc., for characterising the distance between points or samples. Secondly, there are several clustering algorithms for specifying how should we measure the distance from one point to the merged cluster of points/samples. For example, ‘Complete linkage’ is to measure the distance between two clusters as the distance between the furthest points in those clusters. While, the distance between two nearest points is considered in case of ‘Single linkage’. Unlike these methods, instead of measuring the distance directly, ‘Ward linkage’ analyses the variance of clusters. In ResistoXplorer, the results of clustering analysis can be visualised using Heatmap and Dendrogram.

Other than finding similarities or dissimilarities between samples in data, we can also find associations or co-occurrence patterns between ARGs or features. Currently, three methods to calculate pairwise correlations between features are supported in ResistoXplorer: Pearson’s correlation, Spearman’s rank correlation, and Kendall’s tau correlation. Though still commonly used, these methods do not address the issue of spurious correlations due to the compositional nature of sequencing data. As a result, more advanced methods like SparCC, SPIEC-EASI, etc. are designed. However, these methods are computationally intensive and their superior performance over others haven’t been completely established yet, and hence not supported in ResistoXplorer as of now.

Both Clustering and correlation analyses are performed on normalised data and at different functional levels in ResistoXplorer.

OK! So Now, let’s proceed to next step where you need to actually explore and perform clustering analysis on resistome data to complete a hands-on exercise.