Genomes in chunks

Using multi-fasta files.
A photo of two oranges with the one in front split into a half and two quarters.  The split orange's flesh is facing forwards.
In this Step we will learn how to represent large multi-piece genomes in one file.

As it is the case for genes and proteins, whole genome sequences can also be stored in FASTA format. Fully sequenced genomes consisting of only one chromosome (as is the case of many bacterial genomes) can be represented in FASTA files that contain one entry (designated by the “>” and the sequence in the next line) for the full genome sequence. But more common than not, the genomes are known in chunks (that is to say some gaps of unknown size are present) or the genome has more than one chromosome. In this case, genome sequences are stored in a multi-FASTA file.

Multi-FASTA files have one FASTA entry for each chunk (chromosome or scaffold) of DNA. An example (mock) of a multi-FASTA section is shown below.

>Futuris learnis bacterium - Chr1
>Futuris learnis bacterium - Chr2

(Please note, this is a dummy example – and bacterial chromosomes are much larger than what it is represented here!)

Multi-FASTA files are not limited to the storage of genomic DNA from just one organism per file. Remember that we established earlier that genes and proteins can also be stored as multi-FASTA so it is not uncommon that DNA and proteins sequences from different organisms are stored in one file. For instance, if we want to collect all sequences of a virulence protein from different bacteria, we could collect them all in one multi-FASTA file.

