Skip main navigation

Comparative Genomics: Useful Terms and Concepts

This article works as a glossary and gives an overview of some of the most important terms in comparative genomics.

We have summarised here some of the key terms that you will come across when studying comparative genomics. The definitions are given in the context of bioinformatics.

Accession number

It is a unique identifier, often a combination of letters and numbers, that is assigned permanently to an entry in a database. The entry could be a DNA or protein sequence or other type of molecule. Accession numbers can also be assigned to experiments in databases. Accession numbers are stable through time.


The Artemis Comparison Tool. This software allows for comparative genomic analysis to be performed, and is a development of the Artemis tool.


A set of rules or a plan (description of steps) to be followed when making a computer program.

Annotation file

This is a type of text file that summarizes the different features in a genome such as genes, proteins, and regulatory regions, etc. Annotation files can be generated for single sequences as well as for entire genomes. They have strict formatting rules.

Conceptual translation

Of DNA/mRNA sequence into a protein sequence. This is the process of predicting the amino acid sequence of a polypeptide based on the sequence of nucleotides of its mRNA/DNA. The prediction is guided by the genetic code.

Conserved domain

Of a protein. It is a part of a protein that, by assuming a defined three-dimensional structure, confers a given function to that protein. Proteins can have more than one conserved domain and, at the same time, one given conserved domain can appear in different proteins. The amino acid sequence of conserved domains is less likely to change (i.e. it is more conserved) than sequences not part of the conserved domains (i.e. their structure is better maintained throughout evolution).


From the words “contiguous”. A contig is made of a consensus sequence formed by more than one read, which are overlapped to form a longer sequence of DNA. Overlapping reads provide strong evidence for what constitutes the real DNA sequence. When various reads overlap and are in agreement with each other, a consensus is called and the stretch of DNA is named a contig. Note that a contig has no gaps and the nucleotide sequence is known for the whole length of the contig.


The European Molecular Biology Laboratory is a scientific institution. EMBL has research laboratories and outposts in Germany, the UK, France, Italy, and Spain.

Enzyme Commission number

A Enzyme Commission number (EC number) is a numerical classification for an enzyme based on the chemical reactions that it catalyses. An EC number consists of the letters “EC”, followed by four numbers, separated by full-stops. For example EC = Glycogen phosphorylase.

Expected Value (E-value)

In sequence similarity searches, this parameter describes the number of hits that could be found by chance given the length of the sequence and the size of the database. The lower the E-value, the higher the chance that the observed alignment is due to homology. Learn more about e-values in this BLAST help page and in this tutorial

Flat file

A plain text file containing records with no structured interrelationship. The records themselves can have an internal structure. It’s known as a flat file database.

GC content

In genomics and genetics, the GC content is the proportion of Guanines (G) and Cytosines (C) present in a given stretch of DNA sequence. The calculation involves counting the number of Gs and Cs in a given stretch of DNA and dividing it by the total number of nucleotides/bases in that DNA stretch. The GC content can be calculated for a whole genome, as well as presented as a value for a given length of DNA. In this way, stretches of DNA with different GC content can be identified.


GenBank is a nucleotide sequence database hosted by the National Center for Biotechnology Information (NCBI). Genome sequences can be downloaded easily from this database.

Genome assembly (or sequence assembly)

Genome assembly (or sequence assembly) is the collection of scaffolds or sequenced DNA from a given organism. A genome assembly can be different from a reference genome (see below). For example, the genome assemblies of different bacterial isolates from around the world can be compared to the laboratory strain that was used to build the reference genome.


A file format that usually contains genome annotation information.

Homology annotation

In bioinformatics, this term refers to the use of evolutionary conservation as a basis for extrapolating functional characteristics from one gene or protein to another.

Percentage identity

In BLAST results, this value represent the number of residues (amino acids or nucleotides) that match exactly at the same position between the query and the subject expressed as a percentage of the whole sequence.


A region of the genome that contains a degraded gene. They originate by gene duplication and subsequent loss of function, due to accumulated mutations. Pseudogenes do not code functional proteins.

Read (in sequencing)

In the context of DNA sequencing, a read is a stretch of inferred sequence coming from the sequencing of a DNA fragment. Read length varies depending on the sequencing technology used and could be short (20-30 nucleotides) or very long (several thousand nucleotides).

Reference genome

A reference genome is the representative genome of a given species. Because all individuals in a given species can differ in their exact sequence, a reference genome is used as a standard against which all other sequencing of the same species is compared. In genome analysis, reference genomes are essential to guarantee reproducibility of results performed by different research groups.


In the context of genomes, a scaffold is a non-contiguous stretch of DNA sequence. A scaffold is formed by linked contigs that are known to be close to each other, based on sequencing information, but which are separated by a gap or stretch of unknown sequence. Although the length of the gap is often known, the sequence is not. The unknown bases are often represented with Ns.

Score (in BLAST)

This parameter describes how good the alignment between the query and the subject is. It depends on the number of “good” and ‘bad” matches. The higher the score, the better the alignment is.

© Wellcome Genome Campus Advanced Courses and Scientific Conferences
This article is from the free online

Bacterial Genomes III: Comparative Genomics using Artemis Comparison Tool (ACT)

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now