Skip main navigation

Database harmonisation

harmonising databases output

The most widely used and regularly updated AMR databases (e.g., NCBI, ResFinder, CARD) directly incorporate new AMR data from peer-reviewed literature as well as their own genomic analyses.

However, differences in their priorities, inclusion criteria, and curatorial expertise can lead to inconsistencies in how this is done, as we briefly touched on in the previous step. These inconsistencies can be as minor as slightly different spellings/capitalisation of the same AMR gene such as “ANT(2’’)-Ia” vs “ant(2’’)-Ia”. So, it is not uncommon that the same gene has a different name or allele number between databases or that two distinct genes share the same name (Figure 1). Although there are efforts to try and standardise naming across databases (e.g https://github.com/arpcard/amr_curation/issues) this can be a slow process and is not always possible. Existing attempts to solve this problem in an automated fashion take 2 general approaches: merging or cross-linking databases.

Figure 1: The potential scenarios that can occur when trying to reconcile AMR gene names between two databases. The green ticks highlight scenarios that are consistent (or can be easily made consistent) and orange crosses indicate potentially highly misleading incompatibilities between the databases.

Current Strategies

Several meta-databases (and associated tools) exist such as MEGARes (1) and DeepARG-DB (2) which attempt to merge 2 or more AMR databases. These typically involve sequence clustering to remove redundancy followed by sequence similarity searches to try to structure annotations. However, as only a single representative of each unique sequence is retained, differences in nomenclature (naming schemes) may not be resolved and there can be a loss of contextual information from separate databases (e.g., AMR mechanism or AMR gene family). Similarly, any curation intended to optimise accurate detection of AMR genes by the database-associated tool (such as specific sequence similarity score cut-offs) is typically lost.

Alternatively, there have been a couple of recent tools created which attempt to directly infer sequence similarity mappings between AMR databases (i.e., Gene X in Database 1 is most similar to Gene Y in Database 2). This has been done by simply running one database’s tool on the other database’s sequences (e.g., ArgNorm) or graph-based approaches based on reciprocal best-hits (e.g., chAMReDb). Although a useful approach for curation and comparison of data across databases, this can lead to false mappings and misleading results when certain sequences are absent in one database but present in another (Figure 2).

Figure 2: Missing sequences can create incorrect cross-database mappings. The closest cross-database sequence to Database 2’s orange gene is Database 1’s yellow gene despite these being different genes. This is because Database 1 is missing an orange gene and Database 2 is missing a yellow gene. Incorrect mappings like these often require manual curation to resolve and can be hard to detect especially if yellow and orange have relatively similar associated contextual data (e.g., potential resistance phenotypes).

Unfortunately, both of these types of approach have their drawbacks and are typically limited to just AMR related to gene presence/absence (e.g., beta-lactamases). Other AMR determinants such as sequence variants, multicopy genes, and/or multi-component systems present as yet unresolved challenges to database harmonisation. Ultimately, this is still an open challenge that requires careful (and often manual) consideration to avoid its many potential pitfalls.

For further reading please find list of references attached below.

© Wellcome Connecting Science
This article is from the free online

Antimicrobial Databases and Genotype Prediction: Data Sharing and Analysis

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now