Skip main navigation

Similar is not the same but it helps!

Similar is not the same but it helps! (A praise for homology annotation)
A variety of cups and mugs with one larger blue jug
© Wellcome Genome Campus Advanced Courses and Scientific Conferences

In this article you will be introduced to the concept of homology annotation, a way to identify sequences that are similar to one another.

Imagine that you have never seen a coffee mug before. However, you have drunk water from a tumbler glass, a paper cup and even used your hands as a vessel. When you see an “unknown item” (i.e. coffee mug), despite its unfamiliar form, with a handle sticking out of one side, because of the shape of this item you can infer it might (or at least could) be used for the same purpose of holding or drinking a liquid. Very intuitive, right?

Assigning the same function to things that look similar is innate to human nature. The same process can be used for protein sequences.

Now imagine that we have a large collection of protein sequences (e.g. 1000) only 20 of which we know the function. We can use a similar approach to that used with the coffee mug to infer their functions: sequences that look the same have a good chance of doing the same thing.

This is called homology annotation and the principle that enables scientists to use similarity to infer function is based on the conservation of a given sequence or slight variations of it throughout evolution. In general terms, the more similar two sequences are, the more likely they are to be related. Consequently, homology annotation is based on the comparison of DNA or proteins at the sequence level – that is, by comparing the similarity of nucleotides or amino acids sequences between related sequences.

Protein sequences that confer function are often found in blocks of conservation called protein domains. These regions have a defined three-dimensional structure or motif (shape) that can function and evolve independently from the rest of the protein sequence. These blocks of conservation are found in proteins throughout nature, and any given protein sequence can have more than one protein domain. The key to using motif similarity to infer function relies on the principle that when two proteins have a conserved function, although their sequence similarity at the amino acid level can be lost, their protein domain conservation must remain.

However, there are exceptions to every rule and it is possible that two sequences or motifs that are similar to each other have different roles in different organisms or even in different compartments of the same cell. Therefore, it is important to remember that the inference of function is only a projection of its function. Therefore, it is common to see protein names or functional descriptions accompanied by the words “putative” or “potential”. In order to be certain of the function of a protein, it must be confirmed by experiment.

For a more in-depth view of how these secondary databases can aid the annotation of full genomes, we recommend this review article entitled “Protein function annotation by homology-based inference” by Loewenstein et al.

In the next sections you will learn about tools and databases that use similarity between sequences and motifs to help researchers assign function to proteins.

© Wellcome Genome Campus Advanced Courses and Scientific Conferences
This article is from the free online

Bacterial Genomes I: From DNA to Protein Function Using Bioinformatics

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now