Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. T&Cs apply

Definitions of File Formats Used in this Course

table of different file types/formats mentioned and used in this course with explanations

Throughout this course we will be mentioning and using different file formats, outputs of the different stages of the genomics pipeline. You don’t have to read or know about these formats in advance, but you can always return to this step for more information on them, once they get mentioned in the course material.

graph of different types of files produced as outputs of different pipeline step Workflow graph of how the files are created

File format extension Definition / explanation
fasta FASTA files are used to store nucleotide or amino acid sequences
fastq FASTQ files are used to store nucleotide sequences along with a quality score for each nucleotide of the sequence
sam A SAM (Sequence Alignment/Map) file is a human-readable text format in bioinformatics for storing alignment data
bam A BAM (Binary Alignment/Map) file is the binary format for storing alignment data
vcf A VCF (Variant Call Format) file enables the structured representation and sharing of genetic variation data obtained from various sequencing experiments
csv CSV stands for Comma-Separated Values and, a tabular format with a comma (,) between each value

More details on these file formats will be available throughout the course, and here are some introductory information:

fasta

FASTA files are used to store nucleotide or amino acid sequences.

The general structure of a FASTA file is illustrated below:

>sample01 <– NAME OF THE SEQUENCE

AGCGTGTACTGTGCATGTCGATG <– SEQUENCE ITSELF

Each sequence is represented by a name, which always starts with the character >, followed by the actual sequence.

A FASTA file can contain several sequences, for example:

>sample01

AGCGTGTACTGTGCATGTCGATG

>sample02

AGCGTGTACTGTGCATGTCGATG

Each sequence can sometimes span multiple lines, and separate sequences can always be identified by the > character. For example, this contains the same sequences as above:

>sample01 <– FIRST SEQUENCE STARTS HERE

AGCGTGTACTGT

GCATGTCGATG

>sample02 <– SECOND SEQUENCE STARTS HERE

AGCGTGTACTGT

GCATGTCGATG

fastq

FASTQ files are used to store nucleotide sequences along with a quality score for each nucleotide of the sequence. These files are the typical format obtained from NGS sequencing platforms such as Illumina and Nanopore (after basecalling).

The file format is as follows:

@SEQ_ID <– SEQUENCE NAME

AGCGTGTACTGTGCATGTCGATG <– SEQUENCE

+ <– SEPARATOR

%%).1***-+*’’))**55CCFF <– QUALITY SCORES

In FASTQ files each sequence is always represented across 4 lines. The quality scores are encoded in a compact form, using a single character. They represent a score that can vary between 0 and 40 (see Illumina’s Quality Score Encoding). The reason single characters are used to encode the quality scores is that it saves space when storing these large files. Software that work on FASTQ files automatically convert these characters into their score, so we don’t have to worry about doing this conversion ourselves.

The quality value in common use is called a Phred score and it represents the probability that the respective base is an error. For example, a base with quality 20 has a probability of 1% being an error. A base with quality 30 has a 0.1% chance of being an error. Typically, a Phred score threshold of >20 or >30 is used when applying quality filters to sequencing reads.

Because FASTQ files tend to be quite large, they are often compressed to save space. The most common compression format is called gzip and uses the extension .gz. To look at a gzip file, we can use the command zcat, which decompresses the file and prints the output as text.

SAM

A SAM (Sequence Alignment/Map) file is a plain text file format used in bioinformatics to record data from DNA or RNA sequence alignment, specifically the results of aligning sequencing reads to a reference genome or transcriptome. SAM files are a human-readable text representation of alignment data that are frequently used as a bridge between alignment tools and downstream analysis tools.

BAM

SAM files can be readily converted to BAM (Binary Alignment Map) format, which is a binary version of the same alignment data that takes up less space.

VCF

A VCF (Variant Call Format) file is a standard file format used in bioinformatics to record information about genetic variants, specifically differences in DNA sequences. VCF files are commonly used in the study of DNA sequencing data, such as those generated by whole-genome sequencing, whole-exome sequencing, or targeted sequencing investigations. These files provide a standardised means to represent and distribute information about genetic variations, making them essential for tasks such as variant calling, annotation, and downstream genetic analysis.

CSV

CSV stands for Comma-Separated Values and, as the name suggests is a tabular format with a comma (,) between each value:

sample,fastq_1,fastq_2 sample1,fastq_1.fq.gz,fastq_2.fq.gz

© Wellcome Connecting Science
This article is from the free online

Bioinformatics for Biologists: Analysing and Interpreting Genomics Datasets

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now