Genomes in chunks
In this Step we will learn how to represent large multi-piece genomes in one file.
As it is the case for genes and proteins, whole genome sequences can also be stored in FASTA format. Fully sequenced genomes consisting of only one chromosome (as is the case of many bacterial genomes) can be represented in FASTA files that contain one entry (designated by the “>” and the sequence in the next line) for the full genome sequence. But more common than not, the genomes are known in chunks (that is to say some gaps of unknown size are present) or the genome has more than one chromosome. In this case, genome sequences are stored in a multi-FASTA file.
Multi-FASTA files have one FASTA entry for each chunk (chromosome or scaffold) of DNA. An example (mock) of a multi-FASTA section is shown below.
>Futuris learnis bacterium - Chr1 TGGATTCGCACTCCTCCAGCTTATAGACCACCAAATGCCCCTATCCTATCAACACTTCCG GAGACTACTGTTGTTAGACGACGAGGCAGGTCCCCTAGAAGAAGAACTCCCTCGCCTCGC AGACGAAGGTCTCAATCGCCGCGTCGCAGAAGATCTCAATCTCGGGAATCTCAATGTTAG TATTCCTTGGACTCATAAGGTGGGGAACTTTACTGGGCTTTATTCTTCTACTGTACCTGT CTTTAATCCTCATTGGAAAACACCATCTTTTCCTAATATACATTTACACCAAGACATTAT CAAAAAATGTGAACAGTTTGTAGGCCCACTCACAGTTAATGAGAAAAGAAGATTGCAATT GATTATGCCTGCCAGGTTTTATCCAAAGGTTACCAAATATTTACCATTGGATAAGGGTAT TAAACCTTATTATCCAGAACATCTAGTTAATCATTACTTCCAAACTAGACACTATTTACA CACTCTATGGAAGGCGGGTATATTATATAAGAGAGAAACAACACATAGCGCCTCATTTTG TGGGTCACCATA >Futuris learnis bacterium - Chr2 TATGGTGACCCACAAAATGAGGCGCTATGTGTTGTTTCTCTCTTATATAATATACCCGCC TTCCATAGAGTGTGTAAATAGTGTCTAGTTTGGAAGTAATGATTAACTAGATGTTCTGGA TAATAAGGTTTAATACCCTTATCCAATGGTAAATATTTGGTAACCTTTGGATAAAACCTG GCAGGCATAATCAATTGCAATCTTCTTTTCTCATTAACTGTGAGTGGGCCTACAAACTGT TCACATTTTTTGATAATGTCTTGGTGTAAATGTATATTAGGAAAAGATGGTGTTTTCCAA TGAGGATTAAAGACAGGTACAGTAGAAGAATAAAGCCCAGTAAAGTTCCCCACCTTATGA GTCCAAGGAATACTAACATTGAGATTCCCGAGATTGAGATCTTCTGCGACGCGGCGATTG AGACCTTCGTCTGCGAGGCGAGGGAGTTCTTCTTCTAGGGGACCTGCCTCGTCGTCTAAC AACAGTAGTCTCCGGAAGTGTTGATAGGATAGGGGCATTTGGTGGTCTATAAGCTGGAGG AGTGCGAATCCA
(Please note, this is a dummy example - and bacterial chromosomes are much larger than what it is represented here!)
Multi-FASTA files are not limited to the storage of genomic DNA from just one organism per file. Remember that we established earlier that genes and proteins can also be stored as multi-FASTA so it is not uncommon that DNA and proteins sequences from different organisms are stored in one file. For instance, if we want to collect all sequences of a virulence protein from different bacteria, we could collect them all in one multi-FASTA file.
© Wellcome Genome Campus Advanced Courses and Scientific Conferences