Skip main navigation

Why you should collect metadata

article discussing the importance of metadata
Decorative illustration of a pair of hands on a laptop with other electronic devices by they side. On top of the image there is an illustration representing an interconected flowchart
© COG-Train

To ensure that SARS-CoV-2 genomic data are as useful as possible, they should be accompanied by appropriate metadata. Curating metadata and sharing them locally or publicly can be time-consuming, but both are an integral part of any sequencing pipeline. The required resources should be allocated when the study is being designed.

Metadata should include, as an absolute minimum, the date and location of sample collection. However, the release of additional metadata greatly increases the potential applications of a genomic sequence. Where possible, therefore, information on specimen type and how the sequence was obtained in the laboratory should be included (Table 2). Duplicate samples from the same individual or duplicate sequences from the same sample should be clearly identified. Demographic and clinical information, such as age, sex, presence of co-morbidities, disease severity and outcome, and links to other sequences in the database, are encouraged where such information does not risk identifying the patient.

A global consensus on specific formats for metadata (such as date) would allow genomic sequence data from many different laboratories to be rapidly compiled into larger data sets and reduce ambiguity. Also, care should be taken if using Microsoft Excel to ensure that automatic format changes to date do not occur. Some consensus genome repositories, including GISAID, already place format restrictions on certain fields. If data repositories do not already impose formats, the format restrictions for SARS-CoV-2 shown in Table 2 are suggested. Table 2 also highlights examples of analyses that require the provision of specific metadata.

WHO strongly encourages rapid public sharing of sequences and metadata (section 4). However, it is vital to protect patient anonymity. Laboratories should carefully consider whether patients could be identified if all available metadata are shared together. Where few COVID-19 cases have been observed, there is a greater risk of patient anonymity being compromised and therefore fewer data can typically be shared. Where it is judged inappropriate to share detailed metadata via publicly available repositories, it may nevertheless be appropriate to grant access to a small number of users via secure locally developed platforms.

Where it is not possible to share all metadata without risking patient confidentiality, the data that are most useful for global studies should be preferentially shared. For example, sampling location, date and travel history are more useful for phylodynamic studies than patient age or sex (Table 2).

Some laboratories choose to add jitters (noise) to provided dates to decrease the chance that patients can be identified. This can be achieved by a number of methods, for example, by choosing a false date within 5 days on either side of the date of sample collection, or by using the sequencing date as the sample date. Such practices negatively affect molecular clock-based phylogenetic inference and should ideally be avoided. If, nevertheless, this practice is followed, information on exactly how the new date was selected should be provided as a note.

Table 1 – Sample-specific metadata format and use

Metadata type Recommended format if applicable Analyses for which the metadata are required
Date of sample collection YYYY-MM-DD; If the date of sampling is unavailable, date received by testing laboratory could be adopted as an alternative, but this should be clearly indicated Molecular clock phylogenies (including any models implemented in BEAST or BEAST2). These can provide estimates of dates of introduction, changes in outbreak size over time and evolutionary rate
Location Continent/country/region/city. For discrete phylogeographical analyses (section 5.4.3), location resolution can be low (e.g. country level information for consideration of movement between countries) but higher resolution data is preferable to allow finer-scale analyses. Continuous phylogeographical approaches typically require relatively high-resolution data (e.g. city or municipality) Any phylogenetic interpretation of global or regional virus spread (including models in BEAST or BEAST2)
Host For example, human or Mustela lutreola Host range and virus evolution
Patient age For humans, give age in years (e.g. 65) or age with unit if under 1 year (e.g. 1 month, 7 weeks). For non-human animals, juvenile or adult Descriptive epidemiology or as a possible trait for discrete phylodynamic inference
Patient age For humans, give age in years (e.g. 65) or age with unit if under 1 year (e.g. 1 month, 7 weeks). For non-human animals, juvenile or adult Descriptive epidemiology or as a possible trait for discrete phylodynamic inference
Sex Male, female or unknown Descriptive epidemiology
Additional host information No standard format. For animals, this may include context, such as “domestic – farm”, “domestic – household”, “wild”, etc. Disease surveillance in human or animal hosts
Travel history No standard format. Travel history in the 14 days preceding symptom onset should be obtained from patients where possible. Deliberate release of travel history only to a low resolution (e.g. country) may be important to protect patient confidentiality Phylogeographical or phylodynamic analyses directed at estimating transmission rates or routes between regions
Cluster or isolate name No standard format. Appropriate formats may include “Same epidemiological cluster as sample X”, “Same patient as sample X”, or “Sample from patient XYZ” (where XYZ is an anonymized identifier that cannot be traced back to the patient or used to access other patient data that might compromise confidentiality) Phylogenetic downsampling to ensure the appropriateness of phylodynamic models. Cluster investigation
Date of symptom onset YYYY-MM-DD Specialist phylodynamic applications that investigate transmission clusters
Symptoms No standard format. Appropriate degree of symptoms; may include “severe”, “mild” and “out of norm” Descriptive epidemiology
Clinical outcome if known No standard format. Appropriate formats may include “recovered”, “death” and “unknown” Descriptive epidemiology
Comments No standard format. Appropriate comments may include how samples were selected (e.g. “cluster investigation”, “randomly”), or the storage location of other data files, such as raw read data Interpretation of data quality or utility

Table 2 – Sequence-specific metadata format and use

Specimen source, sample type Recommended format if applicable Effect of cell tropism
Passage details, history No standard format. It is important to indicate that cell culture was conducted (e.g. “Cultured”); ideally, this information should include the type of cells used and the number of passages Removal of cell-cultured viruses (which may have induced genetic changes)
Sequencing technology No standard format. Ideally, this should include the laboratory approach and sequencing platform (e.g. “Metagenomics on Illumina HiSeq 2500” or “ARTIC PCR primer scheme on ONT MinION”) Sequencing artefacts
Assembly method, consensus generation method No standard format Sequencing artefacts
Minimum sequencing depth required to call sites during consensus sequence generation e.g. 20x Sequencing artefacts

Extensive data should be shared as patient anonymity is typically unaffected. However, sharing all of the information listed in this table might compromise patient anonymity. Therefore, an ethical review should be conducted to determine which metadata can be safely shared. It may be appropriate to share less data on public databases than on databases that are held and analysed locally.

Think about other examples of metadata and let us know in the comments.

© COG-Train
This article is from the free online

A Practical Guide for SARS-CoV-2 Whole Genome Sequencing

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now