Reading and writing DNA and protein sequences
In this article, you will learn the most basic format for the storage of DNA and protein sequences, the FASTA file.
DNA sequence data are commonly stored in text files, sometimes also called flat files. These are files that can be opened in almost any text editor. The most common type of file is called a FASTA file, in which sequences are stored in FASTA format.
The name FASTA derives from a software package written in the mid-1980s that searches quickly through large collections of sequence data - the software is called FASTA, but is also called FAST-N (nucleotide) and FAST-P (protein).
The FASTA format must, at a minimum, have a header (always preceded by a “>”) in the first line of the file, and the sequence starting in the second line. The header includes some minimal information about the sequence. For example, the hpcC gene from Escherichia coli, with accession number X81322.1, can be represented as follows:
>X81322.1 E.coli hpcC gene GAAGTAGAAGGCGTGGGCCGCCTGGTGAACCGAATTGTTGAGTGAGGAAACAGCGAAATG AAAAAAGTAAATCATTGGATCAACGGCAAAAATGTTGCAGGTAACGACTACTTCCTGACC ACCAATCCGGCAACGGGTGAAGTGCTGGCGGATGTGGCCTCTGGCGGTGAAGCGGAGATC AATCAGGCGGTAGCGACAGCGAAAGAGGCGTTCCCGAAATGGGCCAATCTGCCGATGAAA GAGCGTGCGCGCCTGATGCGCCGTCTGGGCGATCTGATCGACCAGAACGTGCCAGAGATC GCCGCGATGGAAACCGCGGACACGGGCCTGCCGATCCATCAGACCAAAAATGTGTTGATC CCACGCGCTTCTCACAACTTTGAATTTTTCGCGGAAGTCTGCCAGCAGATGAACGGCAAG ACTTATCCGGTCGACGACAAGATGCTCAACTACACGCTGGTGCAGCCGGTAGGCGTTTGT GCACTGGTGTCACCGTGGAACGTGCCGTTTATGACCGCCACCTGGAAGGTCGCGCCGTGT CTGGCGCTGGGCATTACCGCGGTGCTGAAGATGTCCGAACTCTCCCCGCTGACCGCTGAC CGCCTGGGTGAGCTGGCGCTGGAAGCCGGTATTCCGGCGGGCGTTCTGAACGTGGTACAG GGCTACGGCGCAACCGCAGGCGATGCGCTGGTCCGTCATCATGACGTGCGTGCCGTGTCG TTCACCGGCGGTACGGCGACCGGGCGCAATATCATGAAAAACGCCGGGCTGAAAAAATAC TCCATGGAACTGGGCGGTAAATCGCCGGTGCTGATTTTTGAAGATGCCGATATTGAGCGC GCGCTGGACGCCGCCCTGTTCACCATCTTCTCGATCAACGGCGAGCGCTGCACCGCCGGT TCGCGCATCTTTATTCAACAAAGCATCTACCCGGAATTCGTGAAATTTGCCGAACGCGCC AACCGTGTGCGCGTGGGCGATCCGACCGATCCGAATACCCAGGTTGGGGCGCTTATCAGC CAGCAACACTGGGAAAAAGTCTCCGGCTATATCCGTCTGGGCATTGAAGAAGGCGCCACC CTGCTGGCGGGCGGCCCGGATAAACCGTCTGACCTGCCTGCACACCTGAAAGGCGGCAAC TTCCTGCGCCCAACGGTGCTGGCGGACGTAGATAACCGTATGCGCGTTGCCCAGGAAGAG ATTTTCGGGCCGGTCGCCTGCCTGCTGCCGTTTAAAGACGAAGCCGAAGCGTTACGCCTG GCAAACGACGTGGAGTATGGCCTCGCGTCGTACATCTGGACACAGGATGTCAGCAAAGTG CTGCGTCTGGCGCGCGGCATTGAAGCAGGCATGGTGTTCGTCAACACCCAGTTCGTGCGT GACCTGCGCCACGCATTTGGCGGCGTAAAACCTCGCACCGGGCGTGAAGGCGGTGGATAC AGTTCGAAGTGTTCGCGGAAATGAAGAAGAACGTCTGCATTCCATGGCGGACCATCCCA
You can access the database entry in the NCBI database by using this link.
Files containing FASTA sequences are commonly denominated with the extension “.fa” or “.fasta”. For example, if I were to save the above sequence into a file, I could call it “Ecoli_hpcC.fasta”. It is not compulsory to call a DNA sequence file *.fasta or *.fa (where * represents any combination of letters or numbers used to name a file), and I could call it “Ecoli_hpcC.mickeymouse” if I wanted, but it would mean absolutely nothing to other people. The world of bioinformatics is full of conventions that are really unwritten rules. We choose to follow them to ease communication and to share data with other scientists.
When determining the length of a DNA sequences, we talk in terms of “bases” or “base pairs”; the difference between them implies that the latter contains both strands of the DNA molecule. But this nomenclature has no consequences in terms of the length: a 100-base molecule is the same length as a 100-base pair molecule.
It is important to note that FASTA sequences are not restricted to DNA sequences, they can also be used to represent protein sequences, in which each letter represents a single amino acid. Here is an example of a FASTA file for a protein sequence.
>CAA57102.1 dehydrogenase enzyme [Escherichia coli] MKKVNHWINGKNVAGNDYFLTTNPATGEVLADVASGGEAEINQAVATAKEAFPKWANLPMKERARLMRRL GDLIDQNVPEIAAMETADTGLPIHQTKNVLIPRASHNFEFFAEVCQQMNGKTYPVDDKMLNYTLVQPVGV CALVSPWNVPFMTATWKVAPCLALGITAVLKMSELSPLTADRLGELALEAGIPAGVLNVVQGYGATAGDA LVRHHDVRAVSFTGGTATGRNIMKNAGLKKYSMELGGKSPVLIFEDADIERALDAALFTIFSINGERCTA GSRIFIQQSIYPEFVKFAERANRVRVGDPTDPNTQVGALISQQHWEKVSGYIRLGIEEGATLLAGGPDKP SDLPAHLKGGNFLRPTVLADVDNRMRVAQEEIFGPVACLLPFKDEAEALRLANDVEYGLASYIWTQDVSK VLRLARGIEAGMVFVNTQFVRDLRHAFGGVKPRTGREGGGYSSKCSRK
In the example above, the protein FASTA sequence of an Escherichia coli dehydrogenase is shown. This protein has accession number “CAA57102.1” as shown in the header of the FASTA entry.
© Wellcome Genome Campus Advanced Courses and Scientific Conferences