Skip main navigation

New offer! Get 30% off your first 2 months of Unlimited Monthly. Start your subscription for just £35.99 £24.99. New subscribers only T&Cs apply

Find out more

Mapping of the sample sequence

article about alignment of sample sequences against the reference genome i.e. mapping
© Wellcome Connecting Science

Alignment of sample sequences against the reference genome is conducted to determine the most likely source of the observed sequencing reads

The reference genome is a thorough, well-annotated depiction of a species’ genetic makeup. It serves as a standard template for mapping because it offers a thorough blueprint of genes, regulatory components, and other genomic properties.

There are several critical steps in the read mapping process. First the reference genome is prepared for alignment by performing genome indexing. This generates an index similar to a book index and enables mapping algorithms to search the genome and find matches with sequencing reads. However, this process is only necessary once for each mapping software, which is noteworthy. Once the genome index has been created and the reads’ FASTQ files have been input, the read mapping process is carried out using specialised software such as bwa or bowtie2. This process produces alignments in the Sequence Alignment Map (SAM) or Binary Alignment Map (BAM) formats, the latter of which is a compressed binary format. The alignment reads are sorted according to their genomic locations to ensure organised data and accelerate downstream processing. Finally, BAM indexing is done to establish an index for the alignment file, which is essential for later studies and visualisation software like the Integrated Genome Viewer (IGV).

The mapping result is a SAM format. The SAM file has a unified format for storing read alignments to a reference genome. It has 11 fixed columns and the optional key:type:value tuples. A BAM format is equivalent to SAM, but it is developed for fast processing and indexing. It stores every read base, base quality and uses a single conventional technique for all data types (Fig 1, Fig 2).

diagram Fig 1. Diagrammatic representation of fields found in the SAM file.

table with different fields in a SAM file explained Fig 2: Table explaining the different fields in a SAM file

CIGAR string

A Compact Idiosyncratic Gapped Alignment Report (CIGAR) string offers a succinct overview of the alignment’s structure, enabling effective representation of alignments and conserving storage space. It is especially helpful for highlighting complicated alignments with insertions, deletions, and mismatches. Bioinformaticians can recreate the alignment and comprehend how the read relates to the reference sequence by decoding the CIGAR string (Fig 3).

explaining the meaning of the different CIGAR strings with examples of how to interpret CIGAR string Fig 3. Explanation of the meaning of the different CIGAR strings with examples of how to interpret CIGAR string

In the next step, let’s see how to align the sequence data from the target sample to the Wuhan-1 reference.

© Wellcome Connecting Science
This article is from the free online

Bioinformatics for Biologists: Analysing and Interpreting Genomics Datasets

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now