Skip main navigation

What is metadata?

Basic concepts of metadata
Sample tubes scattered on a surface. The tube in focus has an uncompleted label with space for date/test/name data

Metadata is any additional information that can be collected on a sample that is being sequenced. For clinical samples, it is mostly information about the patient, testing, and sample types. This information helps us discover links between the sequencing data that would not otherwise be possible e.g. in a set of suspected hospital outbreak samples, metadata like time of sample collection and ward location of the patient where the sample was collected can help confirm or refute related outbreak clusters found in sequencing data.

At COG-UK we collected a large number of fields relating to each sample. The metadata collected could be broadly divided into the following categories.

  • Biosamples – Metadata fields collected in this category were mostly related to the actual patient the sample was collected from e.g. sample collection dates, location, age, sex, admitted hospital etc.
  • Metrics – Metadata fields in this category describe the quality of the samples. This comprised mostly of the ct value, the target gene and the platform used to assess the ct.
  • Library – This category mostly captured information on the library preparation methods for the genomic material to be sequenced like primers types, library kits, library sources etc.
  • Sequencing – This category captured information on final sequencing and data analytics like the sequencing technology and machine used along with bioinformatics pipelines for downstream data analysis.

During the COG-UK sequencing project, metadata used was anonymised data about a single sample. This information was uploaded by the sequencing centre to a database known as MAJORA which would link up the information with the sequencing data anonymously. Both open-source data could then be displayed such as ‘date of the sample collection’ and then restricted data such as ‘laboratory ID’ could also be uploaded at the same time.

By using metadata, it helps to let us identify if there are metadata subjects that have provided different patterns such as care homes or hospital environments. It helps us understand epidemiological patterns such as certain variants emerging and to determine if samples associated with the same outbreak show similar sequencing results. Metadata is crucial for creating policy or knowing where increased surveillance or testing is required.

Collecting such a large number of metadata fields per sample is always challenging and time-consuming. One key challenge is maintaining standards for each input field as various locations globally have slightly different ways of recording the same data. There is a push in international communities to standardise fields, especially now for SARS-CoV-2. Global sequence databases like GISAID have a standardised format in which data is recorded. There is work happening at PHA4GE (Public Health Alliance for Genomic Epidemiology) and you can read more about that on the GitHub resources page. The other main issue of collecting patient data is the storage of sensitive data as there are fields collected that are considered PII (Personally Identifiable Information). One should be aware while collecting the data of what fields are considered PII and secure the data appropriately in accordance with the data protection rules in that country.

© COG-Train
This article is from the free online

From Swab to Server: Testing, Sequencing, and Sharing During a Pandemic

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now