Skip main navigation

Identifying and characterizing genetic variations in the sequence

article on identifying and characterizing genetic variations in the sequence

A variant refers to any change observed when comparing two sequences. Several types of variant exits each reflecting a different kind of alteration in the sequence. The process of identifying and characterising these variants is known as variant calling

In this step we will describe what tools to use in variant calling and the other factors that need to be considered to characterise these accurately.

Variant types

The following are the most commonly observed types of variants:

  1. Insertion (Ins): An insertion variant occurs when one or more nucleotides are added to the DNA sequence, leading to a longer sequence compared to the reference.
  2. Deletion (Del): Deletion variants involve the removal of one or more nucleotides from the DNA sequence, resulting in a shorter sequence compared to the reference.
  3. Indel: Indels refer to the combination of insertions and deletions, involving the addition and removal of nucleotides, respectively.
  4. Single Nucleotide Polymorphism (SNP): SNPs are the most prevalent type of genetic variation and involve a single nucleotide substitution at a specific position in the DNA sequence. For example, a cytosine (C) may be replaced by a thymine (T) at a particular location.
  5. Translocation: Translocation occurs when a segment of DNA moves from one chromosomal location to another, either within the same chromosome or to a different one
  6. Copy Number Variation (CNV): CNVs involve the duplication or deletion of a large segment of DNA, ranging from a few hundred base pairs to several kilobases. CNVs can affect gene dosage and may have significant implications in various diseases.
  7. Inversion: In an inversion variant, a segment of DNA is reversed or flipped in orientation compared to the reference genome.

In order to accurately call variants and interpret genetic data, researchers and clinicians must have a thorough understanding of these many variant types. This allows them to pinpoint the genetic causes of diseases and conduct population genetics research. Fig 1 shows some of the variants described above and how they may occur.

diagram Fig 1: A diagrammatic representation of some variant types

Factors to consider in variant calling

In the process of variant calling, it is crucial to acknowledge that many of the calls made might not represent genuine genetic variations. Therefore, a filtering step becomes essential to ensure the accuracy of the identified variants. False calls can arise due to various factors, including contamination during sample preparation, errors introduced during PCR amplification, sequencing errors, challenges in handling homopolymer runs, issues with mapping to repetitive sequences, and alignment errors. Additionally, false SNPs may be detected in the vicinity of indels, and ambiguous alignment of indels can further contribute to the presence of false variants. Understanding and addressing these potential sources of false calls are critical to refining the variant calling process and obtaining reliable genetic information from genomic data.

The tools used to call variants

Several tools can be used for variant calling (GATK, Samtools, Freebayes). One of the most common outputs of variant calling is the variant call format (VCF). A VCF is a tab-delimited text, parsable by standard UNIX commands and can be compressed with BGZF (bgzip) and indexed with TBI or CSI (tabix).

An overview of key fields found in a VCF file (Fig 2) include:

  1. CHROM: Chromosome or contig name where the variant is located.
  2. POS: Position of the variant on the chromosome or contig.
  3. ID: A unique identifier for the variant, often used to link to external databases.
  4. REF: The reference allele, representing the genomic base at the specified position in the reference genome.
  5. ALT: The alternate allele(s), indicating the variant base(s) observed in the sample(s). Multiple alternate alleles can be represented.
  6. QUAL: Quality score representing the confidence or quality of the variant call.
  7. FILTER: Information about any filters or criteria used to determine the variant call, showing whether the variant met quality standards or particular significance thresholds.
  8. INFO: A field containing additional information about the variant. It includes various subfields denoted by key-value pairs, such as allele frequency, functional annotations, and predicted effects.
  9. FORMAT: Defines the structure of sample-specific genotype and variant information. Specifies which fields are present for each sample, such as genotype quality (GQ), allele depth (AD), and more.
  10. SAMPLE Genotype Fields: These fields provide genotype information for each sample at the variant position. Genotypes indicate the combination of alleles carried by each sample, often using numerical codes (e.g., 0/0 for homozygous reference, 0/1 for heterozygous, 1/1 for homozygous alternate).

different fields in a VCF file names with explanations Fig 2: A snapshot of the different fields in a VCF file (https://davetang.github.io/learning_vcf_file/)

A diploid organism has two chromosomal copies, three possible genotypes (Fig 2):

RR .. homozygous reference genotype

RA .. heterozygous

AA .. homozygous reference

reference and sequence genomes with annotation for variants Fig 3: An example of how to determine possible genotypes.

Future of Variant Calling

Current approaches for variant calling in genomics heavily rely on the alignment provided by sequencing data aligners. However, these aligners process one read at a time, resulting in site-based variant calling methods that do not consider the local haplotype structure and linked sites. To overcome this limitation, local de novo assembly-based variant callers have emerged. These variant callers can simultaneously detect single nucleotide polymorphisms (SNPs), insertions and deletions (indels), multi-nucleotide polymorphisms (MNPs), and small structural variations (SVs) by performing de novo assembly of the local genomic region. By doing so, they can effectively remove alignment artifacts and increase the accuracy of variant calling. Some examples of these variant callers include GATK Haplotype Caller, Scalpel, and Octopus.

Another innovative approach involves the use of variation graphs, where sequences are aligned to a graph representation of the genome instead of a traditional linear reference sequence. This allows for a more comprehensive representation of genetic variation, including complex variants and haplotype structures. Variation graphs are gaining attention as they can enhance the accuracy and sensitivity of variant calling by capturing the complexity of genomic variations more effectively than linear reference-based methods.

© Wellcome Connecting Science
This article is from the free online

Bioinformatics for Biologists: Analysing and Interpreting Genomics Datasets

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now