A screen image of Illumina MiSeq Output
Illumina MiSeq Output

Reading and writing DNA and protein sequences

In this article, you will learn the most basic format for the storage of DNA and protein sequences, the FASTA file.

DNA sequence data are commonly stored in text files, sometimes also called flat files. These are files that can be opened in almost any text editor. The most common type of file is called a FASTA file, in which sequences are stored in FASTA format.

The name FASTA derives from a software package written in the mid-1980s that searches quickly through large collections of sequence data - the software is called FASTA, but is also called FAST-N (nucleotide) and FAST-P (protein).

The FASTA format must, at a minimum, have a header (always preceded by a “>”) in the first line of the file, and the sequence starting in the second line. The header includes some minimal information about the sequence. For example, the hpcC gene from Escherichia coli, with accession number X81322.1, can be represented as follows:

>X81322.1 E.coli hpcC gene 
GAAGTAGAAGGCGTGGGCCGCCTGGTGAACCGAATTGTTGAGTGAGGAAACAGCGAAATG
AAAAAAGTAAATCATTGGATCAACGGCAAAAATGTTGCAGGTAACGACTACTTCCTGACC
ACCAATCCGGCAACGGGTGAAGTGCTGGCGGATGTGGCCTCTGGCGGTGAAGCGGAGATC
AATCAGGCGGTAGCGACAGCGAAAGAGGCGTTCCCGAAATGGGCCAATCTGCCGATGAAA
GAGCGTGCGCGCCTGATGCGCCGTCTGGGCGATCTGATCGACCAGAACGTGCCAGAGATC
GCCGCGATGGAAACCGCGGACACGGGCCTGCCGATCCATCAGACCAAAAATGTGTTGATC
CCACGCGCTTCTCACAACTTTGAATTTTTCGCGGAAGTCTGCCAGCAGATGAACGGCAAG
ACTTATCCGGTCGACGACAAGATGCTCAACTACACGCTGGTGCAGCCGGTAGGCGTTTGT
GCACTGGTGTCACCGTGGAACGTGCCGTTTATGACCGCCACCTGGAAGGTCGCGCCGTGT
CTGGCGCTGGGCATTACCGCGGTGCTGAAGATGTCCGAACTCTCCCCGCTGACCGCTGAC
CGCCTGGGTGAGCTGGCGCTGGAAGCCGGTATTCCGGCGGGCGTTCTGAACGTGGTACAG
GGCTACGGCGCAACCGCAGGCGATGCGCTGGTCCGTCATCATGACGTGCGTGCCGTGTCG
TTCACCGGCGGTACGGCGACCGGGCGCAATATCATGAAAAACGCCGGGCTGAAAAAATAC
TCCATGGAACTGGGCGGTAAATCGCCGGTGCTGATTTTTGAAGATGCCGATATTGAGCGC
GCGCTGGACGCCGCCCTGTTCACCATCTTCTCGATCAACGGCGAGCGCTGCACCGCCGGT
TCGCGCATCTTTATTCAACAAAGCATCTACCCGGAATTCGTGAAATTTGCCGAACGCGCC
AACCGTGTGCGCGTGGGCGATCCGACCGATCCGAATACCCAGGTTGGGGCGCTTATCAGC
CAGCAACACTGGGAAAAAGTCTCCGGCTATATCCGTCTGGGCATTGAAGAAGGCGCCACC
CTGCTGGCGGGCGGCCCGGATAAACCGTCTGACCTGCCTGCACACCTGAAAGGCGGCAAC
TTCCTGCGCCCAACGGTGCTGGCGGACGTAGATAACCGTATGCGCGTTGCCCAGGAAGAG
ATTTTCGGGCCGGTCGCCTGCCTGCTGCCGTTTAAAGACGAAGCCGAAGCGTTACGCCTG
GCAAACGACGTGGAGTATGGCCTCGCGTCGTACATCTGGACACAGGATGTCAGCAAAGTG
CTGCGTCTGGCGCGCGGCATTGAAGCAGGCATGGTGTTCGTCAACACCCAGTTCGTGCGT
GACCTGCGCCACGCATTTGGCGGCGTAAAACCTCGCACCGGGCGTGAAGGCGGTGGATAC
AGTTCGAAGTGTTCGCGGAAATGAAGAAGAACGTCTGCATTCCATGGCGGACCATCCCA

You can access the database entry in the NCBI database by using this link.

Files containing FASTA sequences are commonly denominated with the extension “.fa” or “.fasta”. For example, if I were to save the above sequence into a file, I could call it “Ecoli_hpcC.fasta”. It is not compulsory to call a DNA sequence file *.fasta or *.fa (where * represents any combination of letters or numbers used to name a file), and I could call it “Ecoli_hpcC.mickeymouse” if I wanted, but it would mean absolutely nothing to other people. The world of bioinformatics is full of conventions that are really unwritten rules. We choose to follow them to ease communication and to share data with other scientists.

When determining the length of a DNA sequences, we talk in terms of “bases” or “base pairs”; the difference between them implies that the latter contains both strands of the DNA molecule. But this nomenclature has no consequences in terms of the length: a 100-base molecule is the same length as a 100-base pair molecule.

It is important to note that FASTA sequences are not restricted to DNA sequences, they can also be used to represent protein sequences, in which each letter represents a single amino acid. Here is an example of a FASTA file for a protein sequence.

>CAA57102.1 dehydrogenase enzyme [Escherichia coli]
MKKVNHWINGKNVAGNDYFLTTNPATGEVLADVASGGEAEINQAVATAKEAFPKWANLPMKERARLMRRL
GDLIDQNVPEIAAMETADTGLPIHQTKNVLIPRASHNFEFFAEVCQQMNGKTYPVDDKMLNYTLVQPVGV
CALVSPWNVPFMTATWKVAPCLALGITAVLKMSELSPLTADRLGELALEAGIPAGVLNVVQGYGATAGDA
LVRHHDVRAVSFTGGTATGRNIMKNAGLKKYSMELGGKSPVLIFEDADIERALDAALFTIFSINGERCTA
GSRIFIQQSIYPEFVKFAERANRVRVGDPTDPNTQVGALISQQHWEKVSGYIRLGIEEGATLLAGGPDKP
SDLPAHLKGGNFLRPTVLADVDNRMRVAQEEIFGPVACLLPFKDEAEALRLANDVEYGLASYIWTQDVSK
VLRLARGIEAGMVFVNTQFVRDLRHAFGGVKPRTGREGGGYSSKCSRK

In the example above, the protein FASTA sequence of an Escherichia coli dehydrogenase is shown. This protein has accession number “CAA57102.1” as shown in the header of the FASTA entry.

Share this article:

This article is from the free online course:

Bacterial Genomes: From DNA to Protein Function Using Bioinformatics

Wellcome Genome Campus Advanced Courses and Scientific Conferences