Glossary of terms used
We start by summarising some of the key terms that you will come across during the course. The definitions are given in the context of bioinformatics.
The glossary below provides definitions of the key terms used in the course. Whenever one of these terms appears, we will link you back to its definition. The Glossary is here for your reference and you do not need to read it all now. There is also a PDF version of this glossary in the downloads section at the bottom of this step.
It is a unique identifier, often a combination of letters and numbers, that is assigned permanently to an entry in a database. The entry could be a DNA or protein sequence or other type of molecule. Accession numbers can also be assigned to experiments in databases. Accession numbers are stable through time.
The Artemis Comparison Tool. This software allows for comparative genomic analysis to be performed, and is a development of the Artemis tool.
A set of rules or a plan (description of steps) to be followed when making a computer program.
This is a type of text file that summarizes the different features in a genome such as genes, proteins, and regulatory regions, etc. Annotation files can be generated for single sequences as well as for entire genomes. They have strict formatting rules.
Of DNA/mRNA sequence into a protein sequence. This is the process of predicting the amino acid sequence of a polypeptide based on the sequence of nucleotides of its mRNA/DNA. The prediction is guided by the genetic code.
Of a protein. It is a part of a protein that, by assuming a defined three-dimensional structure, confers a given function to that protein. Proteins can have more than one conserved domain and, at the same time, one given conserved domain can appear in different proteins. The amino acid sequence of conserved domains is less likely to change (i.e. it is more conserved) than sequences not part of the conserved domains (i.e. their structure is better maintained throughout evolution). https://en.wikipedia.org/wiki/Protein_domain
From the words “contiguous”. A contig is made of a consensus sequence formed by more than one read, which are overlapped to form a longer sequence of DNA. Overlapping reads provide strong evidence for what constitutes the real DNA sequence. When various reads overlap and are in agreement with each other, a consensus is called and the stretch of DNA is named a contig. Note that a contig has no gaps and the nucleotide sequence is known for the whole length of the contig.
The European Molecular Biology Laboratory is a scientific institution. EMBL has research laboratories and outposts in Germany, the UK, France, Italy, and Spain.
Enzyme Commission number
A Enzyme Commission number (EC number) is a numerical classification for an enzyme based on the chemical reactions that it catalyses. An EC number consists of the letters “EC”, followed by four numbers, separated by full-stops. For example EC 18.104.22.168 = Glycogen phosphorylase.
Expected Value (E-value)
In sequence similarity searches, this parameter describes the number of hits that could be found by chance given the length of the sequence and the size of the database. The lower the E-value, the higher the chance that the observed alignment is due to homology. Learn more about e-values in this BLAST help page and in this tutorial
A plain text file containing records with no structured interrelationship. The records themselves can have an internal structure. It’s known as a flat file database.
In genomics and genetics, the GC content is the proportion of Guanines (G) and Cytosines (C) present in a given stretch of DNA sequence. The calculation involves counting the number of Gs and Cs in a given stretch of DNA and dividing it by the total number of nucleotides/bases in that DNA stretch. The GC content can be calculated for a whole genome, as well as presented as a value for a given length of DNA. In this way, stretches of DNA with different GC content can be identified.
GenBank is a nucleotide sequence database hosted by the National Center for Biotechnology Information (NCBI). Genome sequences can be downloaded easily from this database.
Genome assembly (or sequence assembly)
Genome assembly (or sequence assembly) is the collection of scaffolds or sequenced DNA from a given organism. A genome assembly can be different from a reference genome (see below). For example, the genome assemblies of different bacterial isolates from around the world can be compared to the laboratory strain that was used to build the reference genome.
A file format that usually contains genome annotation information.
In bioinformatics, this term refers to the use of evolutionary conservation as a basis for extrapolating functional characteristics from one gene or protein to another.
In BLAST results, this value represent the number of residues (amino acids or nucleotides) that match exactly at the same position between the query and the subject expressed as a percentage of the whole sequence.
A region of the genome that contains a degraded gene. They originate by gene duplication and subsequent loss of function, due to accumulated mutations. Pseudogenes do not code functional proteins.
Read (in sequencing)
In the context of DNA sequencing, a read is a stretch of inferred sequence coming from the sequencing of a DNA fragment. Read length varies depending on the sequencing technology used and could be short (20-30 nucleotides) or very long (several thousand nucleotides).
A reference genome is the representative genome of a given species. Because all individuals in a given species can differ in their exact sequence, a reference genome is used as a standard against which all other sequencing of the same species is compared. In genome analysis, reference genomes are essential to guarantee reproducibility of results performed by different research groups.
In the context of genomes, a scaffold is a non-contiguous stretch of DNA sequence. A scaffold is formed by linked contigs that are known to be close to each other, based on sequencing information, but which are separated by a gap or stretch of unknown sequence. Although the length of the gap is often known, the sequence is not. The unknown bases are often represented with Ns.
Score (in BLAST)
This parameter describes how good the alignment between the query and the subject is. It depends on the number of “good” and ‘bad” matches. The higher the score, the better the alignment is.
© Wellcome Genome Campus Advanced Courses and Scientific Conferences