Skip main navigation

Introduction to the Course Dataset

Data set for this course
DNA model laid on a background consisting of ATCG letters
Dataset for this course

The dataset for this course is a benchmark dataset for the WGS analysis of SARS-CoV-2. The Variants of Interest/Variants of Concern (VOI/VOC) lineages study validated a lineage-calling pipeline with 16 samples from CDC-defined lineages. The dataset has been described in detail here: https://github.com/CDCgov/datasets-sars-cov-2

In week 1, we will analyse one sample from the VOI/VOC dataset. You can download the sequence data for this sample using the following command (the sample identifier below is the accession number for the data in the European Nucleotide Archive (ENA)):

fastq-dump --split-files ERR5743893

The Wuhan-1 reference sequence, which we’ll use when we map the sequence data for ERR5743893, is attached in a file named, “MN908947.fa”

© Wellcome Connecting Science
This article is from the free online

Bioinformatics for Biologists: Analysing and Interpreting Genomics Datasets

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now