Skip main navigation

Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. T&Cs apply

Bioproject vs Biosample – what is the relationship?

Bioproject vs Biosample - what is the relationship?

Data within the repositories of INSDC are structured in an interoperable way, with each sharing a general model for how data is captured and stored.

In general the structure is based on studies, samples and sequence types.

Study / Bioproject

  • A project which contains all of the samples and sequence data.
  • They all have some feature(s) in common
  • NLM-NCBI and DDJB call these bioprojects, whilst ENA calls them a Study.

Samples / Biosample

  • In general this is the biological material that sequences are generated from. However, this can be confusing at times, depending on the context.
  • NLM-NCBI and DDJB call these biosamples, whilst ENA calls them Samples.

Sequence types

  • Can be raw sequencing reads or assembled genomes.
  • Raw sequencing reads are stored in a Sequencing Read Archive in NLM-NCBI and DDJB (equivalent to Run in ENA)

As stated above, the following sections focus on using the NLM-NCBI resources, but other repositories also have equivalent useful resources.

A bioproject contains one or more biosamples (Figure 1). Each Biosample can contain:

  • Metadata related to the strain or sample
  • Fastq files (raw sequence data)
    • There may be more than one set of fastq files per biosample, for example resequences of the same sample, use of multiple sequencing technologies.
    • These are upload to SRA (Sequence Read Archive)
  • Assembled genomes
    • These may be complete (closed genomes) or draft assemblies
    • Uploaded to genbank
Figure 1 A Bioproject is a collection of biosamples, grouped by some feature, such as a study for publication or institution or organisation. Biosamples refer to the biological material from which sequence data is generated and can contain metadata associated with the sample, this can include date of collection, host, location of collection and/or sequencing. The sequences within each sample also have their own metadata, typically information about sequencing technology used.

In addition to biosample and SRA metadata, phenotypic AST data can now also be applied to uploaded data. This will be discussed in step 3.11, but it is important to know that this phenotypic data is applied at the biosample level and NOT at the sequence level. This is a subtle but important distinction. It may be difficult to define what a ‘sample’ is, as this can have different meanings in different context.

For example, a sample in one context may indicate the physical specimen that was received by a laboratory e.g. food item – or it may indicate a colony pick from a plate. It is important to provide as much information as is possible so that the data may be used correctly.

© Wellcome Connecting Science
This article is from the free online

Antimicrobial Databases and Genotype Prediction: Data Sharing and Analysis

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now