Want to keep learning?

This content is taken from the Wellcome Genome Campus Advanced Courses and Scientific Conferences's online course, Bacterial Genomes: From DNA to Protein Function Using Bioinformatics. Join the course to learn more.
A variety of cups and mugs with one larger blue jug
Cups and mugs can be quite different but they all serve a similar purpose

Similar is not the same but it helps!

In this article you will be introduced to the concept of homology annotation, a way to identify sequences that are similar to one another.

Imagine that you have never seen a coffee mug before. However, you have drunk water from a tumbler glass, a paper cup and even used your hands as a vessel. When you see an “unknown item” (i.e. coffee mug), despite its unfamiliar form, with a handle sticking out of one side, because of the shape of this item you can infer it might (or at least could) be used for the same purpose of holding or drinking a liquid. Very intuitive, right?

Assigning the same function to things that look similar is innate to human nature. The same process can be used for protein sequences.

Now imagine that we have a large collection of protein sequences (e.g. 1000) only 20 of which we know the function. We can use a similar approach to that used with the coffee mug to infer their functions: sequences that look the same have a good chance of doing the same thing.

This is called homology annotation and the principle that enables scientists to use similarity to infer function is based on the conservation of a given sequence or slight variations of it throughout evolution. In general terms, the more similar two sequences are, the more likely they are to be related. Consequently, homology annotation is based on the comparison of DNA or proteins at the sequence level - that is, by comparing the similarity of nucleotides or amino acids sequences between related sequences.

Protein sequences that confer function are often found in blocks of conservation called protein domains. These regions have a defined three-dimensional structure or motif (shape) that can function and evolve independently from the rest of the protein sequence. These blocks of conservation are found in proteins throughout nature, and any given protein sequence can have more than one protein domain. The key to using motif similarity to infer function relies on the principle that when two proteins have a conserved function, although their sequence similarity at the amino acid level can be lost, their protein domain conservation must remain.

However, there are exceptions to every rule and it is possible that two sequences or motifs that are similar to each other have different roles in different organisms or even in different compartments of the same cell. Therefore, it is important to remember that the inference of function is only a projection of its function. Therefore, it is common to see protein names or functional descriptions accompanied by the words “putative” or “potential”. In order to be certain of the function of a protein, it must be confirmed by experiment.

For a more in-depth view of how these secondary databases can aid the annotation of full genomes, we recommend this review article entitled “Protein function annotation by homology-based inference” by Loewenstein et al.

In the next sections you will learn about tools and databases that use similarity between sequences and motifs to help researchers assign function to proteins.

Share this article:

This article is from the free online course:

Bacterial Genomes: From DNA to Protein Function Using Bioinformatics

Wellcome Genome Campus Advanced Courses and Scientific Conferences