Want to keep learning?

This content is taken from the Wellcome Genome Campus Advanced Courses and Scientific Conferences's online course, Bacterial Genomes: From DNA to Protein Function Using Bioinformatics. Join the course to learn more.
An abstract image showing a colourful explosion of light.
It's a BLAST!

BLAST, a tool for homology annotation

In this article, you will learn about BLAST, a tool used to find similar sequences in biological databases.

BLAST (Basic Local Alignment Search Tool) is one of the most commonly used tools to search for sequences that are similar to each other. Most biological databases have a BLAST server to search through their datasets. BLAST is a fast searching programme that is able to compare a query sequence with hundreds to millions of sequences quickly.

BLAST uses three steps. First, it ‘chops’ the query sequence into small ‘words’ of typically 3-4 amino acids for proteins or 10-12 nucleotides for DNA sequences (Fig 1A). Second, it uses these short words to look for perfect matches across all the entries in the database (Fig 1B). Third, when a match is found it then tries to extend the alignment by comparing consecutive letters of the word. For each new pair of letters, it evaluates whether it is a good match (Fig 1C). If it is a good match then the score is increased and if it is a bad match the score is reduced. The score table for each pair of amino acids or nucleotides is precomputed and incorporated into the BLAST algorithm.

Figure 1

Figure 1 - The three steps of a BLAST alignment (simplified). A) chop query into short words; B) find exact matches in the database; C) extend matches and assign a score’.

The extension step will continue until the overall score drops below a given value. At this point, the extension step is dropped and the alignment is recorded with its score. The results are then presented as a list of alignments with associated scores. The alignments with the highest scores are most likely to be true matches or homologues of the query sequence. Other result parameters are reported, such as E-value (expectation value) and the percentage identity. The E-value describes the number of hits that could be found by chance given the length of the sequence and the size of the database. The lower the E-value, the greater the chances that the result is not due to chance.

There are different flavours of the BLAST programme depending on whether the query is a nucleotide or a protein sequence and also depending on the nature (nucleotide or protein) of the database we are searching in. If the query is a nucleotide sequence and we are searching for matches in a nucleotide database, the program to use is BLASTn. Similarly, if the query is a protein sequence and we are looking for matches in a protein database, the programme to use is BLASTp.

But what should you do when the query and the database are different? For example, if you want to query a protein sequence to find the best matches in a nucleotide database? In order to make an alignment, query and subject need to be of the same nature. Helpfully, BLAST can translate all the entries in the nucleotide database into protein sequences - each sequence can be translated into the 6 possible frames! You can then use the resulting “translated database” as the subject for the search. This flavour of BLAST is called tBLASTn. In the reverse scenario, when a nucleotide sequence is the query and you want to search a protein database. The query is translated into the 6 possible frames and is then aligned to the query. This is called BLASTx.

In this article you learnt the basics of how BLAST works. In the next Step, we will demonstrate how to use BLAST and how to interpret its results. Finally, you will have the chance to put your newly acquired knowledge to the test by performing some guided BLAST searches of your own.

Share this article:

This article is from the free online course:

Bacterial Genomes: From DNA to Protein Function Using Bioinformatics

Wellcome Genome Campus Advanced Courses and Scientific Conferences