Want to keep learning?

This content is taken from the Wellcome Genome Campus Advanced Courses and Scientific Conferences's online course, Bacterial Genomes: Accessing and Analysing Microbial Genome Data Using Artemis. Join the course to learn more.
2.6

Annotation files

In this article we learn how information about genes and proteins found in genomes can be stored in files.

Genomic features that are found in the genome can be stored in files that we call annotation files. These are text files in which the information regarding different features of the genome (genes and other regions of interest such as promoters, etc) can be stored and read mainly bioinformatically (although most of the annotation files can be decoded by humans too!). Annotation files are not exclusive to genomic DNA, they can also be used to annotate single genes or single protein sequences. In the case of proteins, instead of indicating genetic regions of interest one can indicate for example secondary structure regions or catalytic residues.

Typically, a genome annotation file will have information of each gene location, the strand in which it is found and sometimes it will also include functional annotation (that is the putative function of that gene or protein). Often, genomes downloaded from public databases already contain annotation information together with the sequence data. This might be in GFF or EMBL format.

Let’s have a look at a section of an annotation.

The St.tab files, when opened in a text editor, looks something like this:

FT   CDS             190..255
FT                   /blastp_file="../old_whole_genome/blastp/St.tab.seq.00001.out"
FT                   /class="3.1.18"
FT                   /colour=7
FT                   /ec_orthologue="LPT_ECOLI"
FT                   /fasta_file="../old_whole_genome/fasta/St.tab.seq.00001.out"
FT                   /gene="STY0001"
FT                   /gene="thrL"
FT                   /hth_file="../old_whole_genome/hth/CORBA-St.tab.seq.00001.out"
FT                   /note="Orthologue of E. coli thrL (LPT_ECOLI); Fasta hit
FT                   to LPT_ECOLI (21 aa), 86% identity in 21 aa overlap"
FT   CDS             337..2799
FT                   /blastp_file="../old_whole_genome/blastp/St.tab.seq.00002.out"
FT                   /class="3.1.18"
FT                   /colour=7
FT                   /ec_orthologue="AK1H_ECOLI"
FT                   /fasta_file="../old_whole_genome/fasta/St.tab.seq.00002.out"
FT                   /gene="STY0002"
FT                   /gene="thrA"
FT                   /hth_file="../old_whole_genome/hth/CORBA-St.tab.seq.00002.out"
FT                   /note="Orthologue of E. coli thrA (AK1H_ECOLI); Fasta hit
FT                   to AK1H_ECOLI (820 aa), 94% identity in 820 aa overlap"
FT                   /product="aspartokinase I/homoserine dehydrogenase I"


Note that the CDS sequences are clearly marked and that the numbers on the same line of the CDS label indicate the position in the genome.

It is important to notice that FASTA, EMBL, GenBank, etc are essentially text files with specific formatting, which means that the file name extension (that is the .fa and .embl we add at the end of the file names) doesn’t need to be .fasta or .embl; it could be .txt, and Artemis will still be able to read those files, as long as the formatting of the text contained in them is correct.