Skip main navigation


Glossary of terms used in this course

Glossary of terms & acronyms used in this course

  • 0-Based Indexing: A system of numbering where the count starts from 0, used in file formats like BED.
  • 1-Based Indexing: A system of numbering where the count starts from 1, used in file formats like GTF/GFF, SAM/BAM, and WIG.
  • Adapters: Short DNA sequences used to attach DNA fragments to the flow cell and facilitate sequencing.
  • Assembly: Assembly refers to the process of taking a large number of short DNA sequences and putting them back together to create a representation of the original chromosomes from short or long sequencing read data.
  • AWS: Amazon Web Services
  • Azure: Microsoft Azure
  • Base: A nucleotide building block of DNA, represented by adenine (A), thymine (T), cytosine (C), or guanine (G).
  • Basecall File Format (bas.h5/bax.h5): A file format generated by PacBio sequencing platforms, containing base call information that can be visualized using HDFView.
  • Binary Base Call (BCL) Format: Raw data files produced by Illumina sequencing platforms (e.g., NextSeq, HiSeq, NovaSeq 6000), which can be converted to FASTQ format using the bcl2fastq Conversion tool.
  • Bridge PCR: A principle of DNA sequencing involving the use of bridge PCR to amplify and immobilize DNA templates for sequencing.
  • Checkpointing: The process of periodically saving (or writing) the execution state of an application.
  • Cluster: A localized group of DNA fragments generated through bridge PCR amplification from a single template DNA strand on a flow cell during sequencing.
  • Coverage: The number of times a specific DNA sequence has been read in a sequencing experiment.
  • CPU: Central processing unit
  • Data frame: A common data structure that organizes data in columns and rows. For NGS data for instance, this two-dimensional file usually stores experiment variables in columns and their individual entries in rows.  
  • De Novo Sequencing: Sequencing method used to assemble a genome without a reference genome.
  • Dependency: A software dependency is a relationship between software components where one component relies on the other to work properly.
  • DNA Sequencing: The process of reading the nucleotides present in DNA to determine the precise order of nucleotides within a DNA molecule.
  • Docker and Singularity: Docker and Singularity are containerisation tools. A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another.
  • Enrichment-based library preparation: The library preparation step is used to prepare samples for sequencing. In targeted approaches, enrichment is ensured by designing primers targeting areas flanking the genomic region of interest that are then used for amplification to ensure enrichment for sequencing. Multiple primers can be designed to ensure a standardized multiplexed PCR reaction to run.
  • Faceting: The faceting approach subdivides a plot into a matrix of plots representing smaller data subsets.
  • FASTQ File: A text-based format used to store raw sequence data from NGS platforms. It comprises four lines for each read: sequence identifier, DNA/RNA sequence, a separator line, and quality scores for each base.
  • FASTQ ORA Format: A binary and compressed version of the standard FASTQ file used for efficient data storage and management on Illumina platforms.
  • FastQC: A tool for quality control checks on FASTQ sequence data. It assesses various parameters like sequence quality, over-represented sequences, GC content, and more.
  • File Formats: Various text-based (FASTQ, FASTA, SAM, BED, GTF/GFF, VCF, WIG) and binary (BAM, BCF, SFF) formats used to store raw and processed DNA sequencing data.
  • Function: A set of statements allowing to perform a task in R.
  • GC-Rich Regions: DNA regions with a high content of guanine (G) and cytosine (C) nucleotides.
  • Genomics: The study of an organism’s complete set of DNA, including all of its genes.
  • GUI: Graphical User Interface
  • Illumina Sequencing: A well-known NGS platform that produces short reads using reversible dye-terminators.
  • Ion Torrent Sequencing: NGS technologies that use semiconductor-based technology to detect hydrogen ions generated during DNA synthesis, providing quick and affordable sequencing.
  • Kmer Content Analysis: A platform-dependent analysis of sequence patterns in NGS data, used to identify biases and artifacts.
  • Library: A collection of DNA fragments used for sequencing, often created by specific methods like shotgun library preparation.
  • Metagenomics: The study of the structure and function of entire nucleotide sequences isolated and analyzed from all the organisms (typically microbes) in a bulk sample.
  • MultiQC: A tool that aggregates results from multiple FastQC runs into a single HTML report.
  • Nanopore Sequencing: A sequencing method where DNA passes through a nanopore in a membrane, causing disruptions in ionic current that correspond to individual bases.
  • Next-Generation Sequencing (NGS): A set of advanced sequencing techniques that enable the rapid analysis of DNA and RNA sequences.
  • NGS Analysis Pipeline: A series of steps for processing and analyzing DNA sequencing data, including preprocessing, alignment, variant calling, and functional annotation.
  • Operators: Operators are used in R to perform basic or more advanced operations. R classifies them into arithmetic, assignment, comparison, logical, or miscellaneous operators.
  • Over-represented Sequences: Sequences that are present in a higher proportion than expected, often observed in small RNA libraries.
  • Package: A set of R functions developed for specific purposes (analysis, visualization, etc), using pre-developed code and demo data.
  • Paired end: DNA sequencing method that involves sequencing both ends of a DNA fragment separately.
  • PCR: Polymerase chain reaction
  • PCR Bias: Preferential amplification of certain DNA sequences during PCR.
  • Quality Control (QC): A crucial step in Next-Generation Sequencing (NGS) data analysis that involves assessing and ensuring the accuracy and reliability of the generated sequence data. Quality control checks include evaluating sequence length distribution, GC content, per-base sequence quality scores, and detecting potential issues like adapter contamination.
  • Quantitation: To measure or determine precisely.
  • RAM: Random access memory
  • RNA: Ribonucleic acid
  • Roche 454 Sequencing: A sequencing platform known for generating lengthy reads and facilitating de novo genome assembly.
  • Sanger Sequencing: A DNA sequencing technology that involves the chain termination method to determine the sequence of nucleotides in DNA.
  • SARS-CoV-2: SSARS-CoV-2: Severe acute respiratory syndrome coronavirus 2; this virus causes COVID-19.
  • Sequence By Synthesis (SBS): A principle of DNA sequencing involving the incorporation of labeled nucleotides and generation of light signals as bases are added.
  • Sequence Duplication Levels: The occurrence of duplicated sequences in NGS data, which can vary depending on the type of study (genomics vs. transcriptomics).
  • Sequence Trimming: The removal of low-quality bases from read sequences to improve data integrity and accuracy.
  • Shotgun Library: A collection of DNA fragments generated by randomly breaking a genome into smaller pieces for NGS analysis.
  • Shotgun sequencing: Shotgun sequencing is a method used for sequencing random DNA strands.
  • Single end: DNA sequencing method where only one end of a DNA fragment is sequenced.
  • Single Molecule Real-Time (SMRT) DNA Sequencing: A technology that uses DNA polymerase attached to a nano well and fluorescent signals to sequence DNA.
  • Tibble: An updated data frame format simplifying the output display (limited rows and columns fitting in the screen) with some differences on data manipulation compared to data frames. It is also the name of an R package allowing to manipulate this type of data.
  • Variable: An object in R that stores data, and can be used to easily manipulate it.
  • Variant calling: The process of  identifying and characterizing genetic variations or variants, such as single nucleotide polymorphisms (SNPs) and small insertions or deletions (indels), in DNA or RNA sequences.
  • Variant Calling Format (VCF): This file format stores information on variants, their position, and various other related information (read depth, genotype, etc) following a standardized format.
  • Version control: The practice of tracking and managing changes to software code.

Glossary of operators used in this course

  • Operators: Operators are used in R to perform basic or more advanced operations. R classifies them into arithmetic, assignment, comparison, logical, or miscellaneous operators.
  • <- or = These assignment operators serve to assign a certain value to a variable. The value can be in any form (vector, list, table, etc), including external files to upload.
  • ? An essential operator that is used as a help button when combined to a function.
  • “ ” Quotation marks, specifying an exact pattern that R will look for.
  • $ You can use this operator to extract, subset or simply query a specific part of a data frame. A common usage is to query one column (corresponding to one variable) of a data frame.
  • # You will see this operator often in programming languages, including R. It informs the program and/or the reader that the line associated with it should not be run. This operator helps consequently in providing line-by-line comments without interrupting the execution flow of a script.
  • + Within a long script, this operator helps make script components span multiple lines. If this operator appears spontaneously in the console when you want to execute a line of code, it means there is amissing part in your code that needs to be completed.
  • %>% With the “dplyr” package, this operator can be used as the | would be in Unix, making it easy to combine two functions.
  • %in% This operator filters specific values from a specified data subset.
  • == Comparison operator, used to search for elements having a value equal to a specified value.
  • >= Comparison operator, used to search for elements having a value greater than or equal to a specified value.
  • & Logical operator, equivalent to an AND, used for combination purposes. 
  • | Another operator in R used for combination, equivalent to an OR.
© Wellcome Connecting Science
This article is from the free online

Bioinformatics for Biologists: Analysing and Interpreting Genomics Datasets

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now