Skip main navigation

Key steps of the analysis pipeline

DNA-Seq Analysis Pipeline
drawing caricaturing DNA sequencing

The DNA-Seq Analysis Pipeline involves a series of steps for processing and analyzing DNA sequencing data

It starts with the generation of raw sequencing data through platforms like Illumina or PacBio. The data is then preprocessed, including quality control and trimming to remove low-quality or adapter-contaminated reads. The preprocessed reads are aligned to a reference genome or assembled de novo to create a genome assembly. Variant calling identifies genetic variations like SNPs and indels. Downstream analysis involves functional annotation, where the impact of variants on genes and proteins is determined. File formats used in the pipeline include text-based formats like FASTQ, FASTA, SAM, and BED for storing raw and aligned reads, as well as binary formats like BAM and VCF for storing compressed alignment and variant data.

Take-home messages

In biological research, it is essential to begin by stating the precise problem or goal that the study seeks to solve. The design of the experiment, creation of the library, and establishment of the analytic pipeline are all based on this clearly defined biological question, ensuring that all elements are in line with the study objective. Before choosing the best sequencing platform for their project, researchers must also fully comprehend the characteristics of each platform. Different technologies have particular benefits and drawbacks, such as PCR bias in GC-rich regions, such as the short reads delivered by Illumina or the lengthy reads produced by PacBio. For projects like de novo sequencing, researchers may also think about merging different platforms to improve outcomes.

Another important aspect is the input and output files required for data processing and analysis. Depending on the type of data, companion index files are needed, such as .fa and .fai, .bam and .bai, or .vcf and .vcf.idx files. These files can be either text-based, such as FASTA, FASTQ, SAM, GTF/GFF, BED, VCF, WIG, or binary, like BAM, BCF, SFF. Additionally, it is essential to be aware of whether the file format uses 1-based indexing (e.g., GFF/GTF, SAM/BAM, WIG) or 0-based indexing (e.g., BED) to ensure accurate data interpretation and manipulation. Considering these factors will lead to a well-informed decision-making process, enabling researchers to design effective experiments, choose appropriate sequencing platforms, and handle data efficiently for successful biological analyses.


What type of biological problem(s) are you dealing with? Leave your comments in the discussion section below and see if any other learner has a similar type of question so that you can exchange your experience.

© Wellcome Connecting Science
This article is from the free online

Bioinformatics for Biologists: Analysing and Interpreting Genomics Datasets

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now