Skip main navigation

Genome Annotation Files

Learn more about genome annotation files.
© Wellcome Genome Campus Advanced Courses and Scientific Conferences

In this article, we learn how information about genes and proteins found in genomes can be stored in files.

Genomic features that are found in the genome can be stored in files that we call annotation files. These are text files in which the information regarding different features of the genome (genes and other regions of interest such as promoters, etc) can be stored and read mainly bioinformatically (although most of the annotation files can be decoded by humans too!). Annotation files are not exclusive to genomic DNA, they can also be used to annotate single genes or single protein sequences. In the case of proteins, instead of indicating genetic regions of interest one can indicate for example secondary structure regions or catalytic residues.

Typically, a genome annotation file will have information of each gene location, the strand in which it is found and sometimes it will also include functional annotation (that is the putative function of that gene or protein). Often, genomes downloaded from public databases already contain annotation information together with the sequence data. This might be in GFF or EMBL format.

Let’s have a look at a section of an annotation.

The St.tab files, when opened in a text editor, looks something like this:

FT CDS 190..255
FT /blastp_file="../old_whole_genome/blastp/St.tab.seq.00001.out"
FT /class="3.1.18"
FT /colour=7
FT /ec_orthologue="LPT_ECOLI"
FT /fasta_file="../old_whole_genome/fasta/St.tab.seq.00001.out"
FT /gene="STY0001"
FT /gene="thrL"
FT /hth_file="../old_whole_genome/hth/CORBA-St.tab.seq.00001.out"
FT /note="Orthologue of E. coli thrL (LPT_ECOLI); Fasta hit
FT to LPT_ECOLI (21 aa), 86% identity in 21 aa overlap"
FT /product="thr operon leader peptide"
FT CDS 337..2799
FT /blastp_file="../old_whole_genome/blastp/St.tab.seq.00002.out"
FT /class="3.1.18"
FT /colour=7
FT /ec_orthologue="AK1H_ECOLI"
FT /fasta_file="../old_whole_genome/fasta/St.tab.seq.00002.out"
FT /gene="STY0002"
FT /gene="thrA"
FT /hth_file="../old_whole_genome/hth/CORBA-St.tab.seq.00002.out"
FT /note="Orthologue of E. coli thrA (AK1H_ECOLI); Fasta hit
FT to AK1H_ECOLI (820 aa), 94% identity in 820 aa overlap"
FT /product="aspartokinase I/homoserine dehydrogenase I"

Note that the CDS sequences are clearly marked and that the numbers on the same line of the CDS label indicate the position in the genome.

It is important to notice that FASTA, EMBL, GenBank, etc are essentially text files with specific formatting, which means that the file name extension (that is the .fa and .embl we add at the end of the file names) doesn’t need to be .fasta or .embl; it could be .txt, and Artemis will still be able to read those files, as long as the formatting of the text contained in them is correct.

You can download the full file from here (we recommend use of Chrome or Firefox browsers for downloading data files): ftp://ftp.sanger.ac.uk/pub/resources/coursesandconferences/Online_Courses/Course3/data/S_typhi.tab

You may need to copy and paste the link in your internet browser.

© Wellcome Genome Campus Advanced Courses and Scientific Conferences
This article is from the free online

Bacterial Genomes II: Accessing and Analysing Microbial Genome Data Using Artemis

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now