In this article we learn how information about genes and proteins found in genomes can be stored in files.
Genomic features that are found in the genome can be stored in files that we call annotation files. These are text files in which the information regarding different features of the genome (genes and other regions of interest such as promoters, etc) can be stored and read mainly bioinformatically (although most of the annotation files can be decoded by humans too!). Annotation files are not exclusive to genomic DNA, they can also be used to annotate single genes or single protein sequences. In the case of proteins, instead of indicating genetic regions of interest one can indicate for example secondary structure regions or catalytic residues.
Typically, a genome annotation file will have information of each gene location, the strand in which it is found and sometimes it will also include functional annotation (that is the putative function of that gene or protein). Often, genomes downloaded from public databases already contain annotation information together with the sequence data. This might be in GFF or EMBL format.
Let’s have a look at a section of an annotation.
The St.tab files, when opened in a text editor, looks something like this:
FT CDS 190..255 FT /blastp_file="../old_whole_genome/blastp/St.tab.seq.00001.out" FT /class="3.1.18" FT /colour=7 FT /ec_orthologue="LPT_ECOLI" FT /fasta_file="../old_whole_genome/fasta/St.tab.seq.00001.out" FT /gene="STY0001" FT /gene="thrL" FT /hth_file="../old_whole_genome/hth/CORBA-St.tab.seq.00001.out" FT /note="Orthologue of E. coli thrL (LPT_ECOLI); Fasta hit FT to LPT_ECOLI (21 aa), 86% identity in 21 aa overlap" FT /product="thr operon leader peptide" FT CDS 337..2799 FT /blastp_file="../old_whole_genome/blastp/St.tab.seq.00002.out" FT /class="3.1.18" FT /colour=7 FT /ec_orthologue="AK1H_ECOLI" FT /fasta_file="../old_whole_genome/fasta/St.tab.seq.00002.out" FT /gene="STY0002" FT /gene="thrA" FT /hth_file="../old_whole_genome/hth/CORBA-St.tab.seq.00002.out" FT /note="Orthologue of E. coli thrA (AK1H_ECOLI); Fasta hit FT to AK1H_ECOLI (820 aa), 94% identity in 820 aa overlap" FT /product="aspartokinase I/homoserine dehydrogenase I"
Note that the CDS sequences are clearly marked and that the numbers on the same line of the CDS label indicate the position in the genome.
It is important to notice that FASTA, EMBL, GenBank, etc are essentially text files with specific formatting, which means that the file name extension (that is the .fa and .embl we add at the end of the file names) doesn’t need to be .fasta or .embl; it could be .txt, and Artemis will still be able to read those files, as long as the formatting of the text contained in them is correct.
You can download the full file from here (we recommend use of Chrome or Firefox browsers for downloading data files): ftp://ftp.sanger.ac.uk/pub/resources/coursesandconferences/Online_Courses/Course3/data/S_typhi.tab
You may need to copy and paste the link in your internet browser.
© Wellcome Genome Campus Advanced Courses and Scientific Conferences