Skip main navigation

Genome Annotation Files

Learn more about genome annotation files.
© Wellcome Genome Campus Advanced Courses and Scientific Conferences

In this article, we learn how information about genes and proteins found in genomes can be stored in files.

Genomic features that are found in the genome can be stored in files that we call annotation files. These are text files in which the information regarding different features of the genome (genes and other regions of interest such as promoters, etc) can be stored and read mainly bioinformatically (although most of the annotation files can be decoded by humans too!). Annotation files are not exclusive to genomic DNA, they can also be used to annotate single genes or single protein sequences. In the case of proteins, instead of indicating genetic regions of interest one can indicate for example secondary structure regions or catalytic residues.

Typically, a genome annotation file will have information of each gene location, the strand in which it is found and sometimes it will also include functional annotation (that is the putative function of that gene or protein). Often, genomes downloaded from public databases already contain annotation information together with the sequence data. This might be in GFF or EMBL format.

Let’s have a look at a section of an annotation.

The files, when opened in a text editor, looks something like this:

FT CDS 190..255
FT /blastp_file="../old_whole_genome/blastp/"
FT /class="3.1.18"
FT /colour=7
FT /ec_orthologue="LPT_ECOLI"
FT /fasta_file="../old_whole_genome/fasta/"
FT /gene="STY0001"
FT /gene="thrL"
FT /hth_file="../old_whole_genome/hth/"
FT /note="Orthologue of E. coli thrL (LPT_ECOLI); Fasta hit
FT to LPT_ECOLI (21 aa), 86% identity in 21 aa overlap"
FT /product="thr operon leader peptide"
FT CDS 337..2799
FT /blastp_file="../old_whole_genome/blastp/"
FT /class="3.1.18"
FT /colour=7
FT /ec_orthologue="AK1H_ECOLI"
FT /fasta_file="../old_whole_genome/fasta/"
FT /gene="STY0002"
FT /gene="thrA"
FT /hth_file="../old_whole_genome/hth/"
FT /note="Orthologue of E. coli thrA (AK1H_ECOLI); Fasta hit
FT to AK1H_ECOLI (820 aa), 94% identity in 820 aa overlap"
FT /product="aspartokinase I/homoserine dehydrogenase I"

Note that the CDS sequences are clearly marked and that the numbers on the same line of the CDS label indicate the position in the genome.

It is important to notice that FASTA, EMBL, GenBank, etc are essentially text files with specific formatting, which means that the file name extension (that is the .fa and .embl we add at the end of the file names) doesn’t need to be .fasta or .embl; it could be .txt, and Artemis will still be able to read those files, as long as the formatting of the text contained in them is correct.

You can download the full file from here (we recommend use of Chrome or Firefox browsers for downloading data files):

You may need to copy and paste the link in your internet browser.

© Wellcome Genome Campus Advanced Courses and Scientific Conferences
This article is from the free online

Bacterial Genomes II: Accessing and Analysing Microbial Genome Data Using Artemis

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education