Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

Find out more

Creating a samplesheet for viralrecon

step by step guide to create sample sheet for viralrecon

Preparing the data

We’re going to use the same dataset of benchmarked SARS-CoV-2 WGS we used in Week 1 but rather than just analysing one sample, we’re going to analyse five. We’ll download the data using SRA-toolkit like we did in Week 1, but, as we’re downloading five samples worth of data we’ll use a for loop to do this. Don’t worry if you don’t remember what a for loop is for now; just know they are a very useful way of performing the same task multiple times.

First let’s activate the MOOC environment from Week 1:

conda activate MOOC

Now create a text file called samples.txt in your working directory. Copy and paste the ENA accessions below into this file:

ERR5556343

SRR13500958

ERR5743893

ERR5181310

ERR5405022

Now we can use a for loop to run fastq-dump on each of the accessions in samples.txt:

for i in $(cat samples.txt);do fastq-dump --split-files $i;done

It’s always good practice to save disk space where we can, so let’s compress the fastq files we just downloaded with gzip:

gzip *.fastq

Now create a directory called data and move the fastq files there:

mkdir data
mv *.fastq.gz data

Creating the samplesheet.csv file

Like most nf-core pipelines, viralrecon requires a CSV file containing the sample names and location of the fastq files input. The format of the samplesheet we’re going to use should look like this:

sample,fastq_1,fastq_2

SAMPLE_1,fastq_1_location,fastq_2_location

You can prepare this file in Excel or another spreadsheet program but remember to save the file as a CSV file. Alternatively, nf-core provide a useful python script that can be used to generate the samplesheet automatically. First download the python script to your working directory:

wget -L https://raw.githubusercontent.com/nf-core/viralrecon/master/bin/fastq_dir_to_samplesheet.py

Now run the python script:

python3 fastq_dir_to_samplesheet.py data samplesheet.csv -r1 _1.fastq.gz -r2 _2.fastq.gz

The name of the directory you moved the fastq files to (data) comes first followed by the name of the file we want to create (samplesheet.csv) then we specify the file suffixes of our fastq files (_1.fastq.gz and _2.fastq.gz) using -r1 and -r2 respectively.

If you didn’t get an error, you should find the samplesheet.csv file in your working directory. To check what the script did, either print the file with cat on the command line:

cat samplesheet.csv

Or else open samplesheet.csv in a text editor or Excel to check. It should look like this:

listing of the sample sheet in the format explained earlier in the step

Now we’re ready to run viralrecon!

© Wellcome Connecting Science
This article is from the free online

Bioinformatics for Biologists: Analysing and Interpreting Genomics Datasets

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now