Skip main navigation

What is a FASTA file?

In this article, you will learn the most basic format for the storage of DNA and protein sequences, the FASTA file. FASTA File for DNA Sequence DataDNA sequence data are commonly stored in text files, sometimes also called flat files. These are files that can be opened in almost any text editor. The most common type of file is called a FASTA file, in which sequences are stored in FASTA format.
A screen image of Illumina MiSeq Output
© Wellcome Genome Campus Advanced Courses and Scientific Conferences

In this article, you will learn the most basic format for the storage of DNA and protein sequences, the FASTA file.

FASTA File for DNA Sequence Data

DNA sequence data are commonly stored in text files, sometimes also called flat files. These are files that can be opened in almost any text editor. The most common type of file is called a FASTA file, in which sequences are stored in FASTA format.

The name FASTA derives from a software package written in the mid-1980s that searches quickly through large collections of sequence data – the software is called FASTA, but is also called FAST-N (nucleotide) and FAST-P (protein).

The FASTA format must, at a minimum, have a header (always preceded by a “>”) in the first line of the file, and the sequence starting in the second line. The header includes some minimal information about the sequence. For example, the hpcC gene from Escherichia coli, with accession number X81322.1, can be represented as follows:

>X81322.1 E.coli hpcC gene GAAGTAGAAGGCGTGGGCCGCCTGGTGAACCGAATTGTTGAGTGAGGAAACAGCGAAATG AAAAAAGTAAATCATTGGATCAACGGCAAAAATGTTGCAGGTAACGACTACTTCCTGACC ACCAATCCGGCAACGGGTGAAGTGCTGGCGGATGTGGCCTCTGGCGGTGAAGCGGAGATC AATCAGGCGGTAGCGACAGCGAAAGAGGCGTTCCCGAAATGGGCCAATCTGCCGATGAAA GAGCGTGCGCGCCTGATGCGCCGTCTGGGCGATCTGATCGACCAGAACGTGCCAGAGATC GCCGCGATGGAAACCGCGGACACGGGCCTGCCGATCCATCAGACCAAAAATGTGTTGATC CCACGCGCTTCTCACAACTTTGAATTTTTCGCGGAAGTCTGCCAGCAGATGAACGGCAAG ACTTATCCGGTCGACGACAAGATGCTCAACTACACGCTGGTGCAGCCGGTAGGCGTTTGT GCACTGGTGTCACCGTGGAACGTGCCGTTTATGACCGCCACCTGGAAGGTCGCGCCGTGT CTGGCGCTGGGCATTACCGCGGTGCTGAAGATGTCCGAACTCTCCCCGCTGACCGCTGAC CGCCTGGGTGAGCTGGCGCTGGAAGCCGGTATTCCGGCGGGCGTTCTGAACGTGGTACAG GGCTACGGCGCAACCGCAGGCGATGCGCTGGTCCGTCATCATGACGTGCGTGCCGTGTCG TTCACCGGCGGTACGGCGACCGGGCGCAATATCATGAAAAACGCCGGGCTGAAAAAATAC TCCATGGAACTGGGCGGTAAATCGCCGGTGCTGATTTTTGAAGATGCCGATATTGAGCGC GCGCTGGACGCCGCCCTGTTCACCATCTTCTCGATCAACGGCGAGCGCTGCACCGCCGGT TCGCGCATCTTTATTCAACAAAGCATCTACCCGGAATTCGTGAAATTTGCCGAACGCGCC AACCGTGTGCGCGTGGGCGATCCGACCGATCCGAATACCCAGGTTGGGGCGCTTATCAGC CAGCAACACTGGGAAAAAGTCTCCGGCTATATCCGTCTGGGCATTGAAGAAGGCGCCACC CTGCTGGCGGGCGGCCCGGATAAACCGTCTGACCTGCCTGCACACCTGAAAGGCGGCAAC TTCCTGCGCCCAACGGTGCTGGCGGACGTAGATAACCGTATGCGCGTTGCCCAGGAAGAG ATTTTCGGGCCGGTCGCCTGCCTGCTGCCGTTTAAAGACGAAGCCGAAGCGTTACGCCTG GCAAACGACGTGGAGTATGGCCTCGCGTCGTACATCTGGACACAGGATGTCAGCAAAGTG CTGCGTCTGGCGCGCGGCATTGAAGCAGGCATGGTGTTCGTCAACACCCAGTTCGTGCGT GACCTGCGCCACGCATTTGGCGGCGTAAAACCTCGCACCGGGCGTGAAGGCGGTGGATAC AGTTCGAAGTGTTCGCGGAAATGAAGAAGAACGTCTGCATTCCATGGCGGACCATCCCA

You can access the database entry in the NCBI database by using this link.

Files containing FASTA sequences are commonly denominated with the extension “.fa” or “.fasta”. For example, if I were to save the above sequence into a file, I could call it “Ecoli_hpcC.fasta”. It is not compulsory to call a DNA sequence file *.fasta or *.fa (where * represents any combination of letters or numbers used to name a file), and I could call it “Ecoli_hpcC.mickeymouse” if I wanted, but it would mean absolutely nothing to other people. The world of bioinformatics is full of conventions that are really unwritten rules. We choose to follow them to ease communication and to share data with other scientists.

When determining the length of DNA sequences, we talk in terms of “bases” or “base pairs”; the difference between them implies that the latter contains both strands of the DNA molecule. But this nomenclature has no consequences in terms of the length: a 100-base molecule is the same length as a 100-base pair molecule.

It is important to note that FASTA sequences are not restricted to DNA sequences, they can also be used to represent protein sequences, in which each letter represents a single amino acid. Here is an example of a FASTA file for a protein sequence.

>CAA57102.1 dehydrogenase enzyme [Escherichia coli] MKKVNHWINGKNVAGNDYFLTTNPATGEVLADVASGGEAEINQAVATAKEAFPKWANLPMKERARLMRRL GDLIDQNVPEIAAMETADTGLPIHQTKNVLIPRASHNFEFFAEVCQQMNGKTYPVDDKMLNYTLVQPVGV CALVSPWNVPFMTATWKVAPCLALGITAVLKMSELSPLTADRLGELALEAGIPAGVLNVVQGYGATAGDA LVRHHDVRAVSFTGGTATGRNIMKNAGLKKYSMELGGKSPVLIFEDADIERALDAALFTIFSINGERCTA GSRIFIQQSIYPEFVKFAERANRVRVGDPTDPNTQVGALISQQHWEKVSGYIRLGIEEGATLLAGGPDKP SDLPAHLKGGNFLRPTVLADVDNRMRVAQEEIFGPVACLLPFKDEAEALRLANDVEYGLASYIWTQDVSK VLRLARGIEAGMVFVNTQFVRDLRHAFGGVKPRTGREGGGYSSKCSRK

In the example above, the protein FASTA sequence of an Escherichia coli dehydrogenase is shown. This protein has accession number “CAA57102.1” as shown in the header of the FASTA entry.

© Wellcome Genome Campus Advanced Courses and Scientific Conferences
This article is from the free online

Bacterial Genomes I: From DNA to Protein Function Using Bioinformatics

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education