Other protein databases: Pfam, Interpro and Phobius
In this Step you will learn about other protein databases that are useful in our search for protein function.
Two of the most popular secondary databases recognise conserved protein domains within a protein sequence. These databases are Pfam and Interpro and they are hosted by EMBL-EBI. Pfam is a manually curated database, which means that a human researcher builds the different “families” into which proteins with the same conserved domains are classified. Pfam is one large database of protein families groups with shared conserved protein domains.
Interpro, on the other hand, is a much larger collection of many databases. It takes a large number (about eleven!) of protein domain recognition algorithms and centralises them into one tool. The individual algorithms are highly specialised and diverse in their predictions, and it would take a long time to go through all of them one at a time. Interpro saves us time and much ‘clicking’ and ‘copying/pasting’ by providing a single portal to query all of these valuable databases. Although Pfam and Interpro arose independently, now they are interlinked: Pfam is one of the databases used by Interpro when searching for conserved domains, and Pfam entries have a tab containing the Interpro description for the same conserved domain, as shown in this example.
Other secondary databases search for signatures associated with protein sub-cellular localisation or protein sorting. These signatures define whether a protein is retained in the cytoplasm, positioned in the cellular membrane, or secreted to the extracellular space. Not surprisingly, these protein signatures are conserved and therefore can be analysed in a similar way to those of conserved domains, by comparing the sequence of various secreted proteins it is possible to find out what is the part of the sequence that determines a secretory fate for that particular protein.
Two protein signatures are particularly relevant: signal peptides and transmembrane domains. Signal peptides are short (~20 amino acids) sequences located at the N-terminus of proteins and they act as tags that direct the localisation of newly synthesised proteins. In bacteria, signal peptides direct the protein across the plasma membrane into the periplasm or to the extracellular space. Proteins that are embedded in the cellular membrane have critical roles, given that they reside in the interface between the bacterium and the environment (or host). Some of them are transportation channels whereas others are signal receptors. They are commonly called transmembrane (TM) proteins, and this characteristic is conferred by a specific amino-acid sequence that forms the transmembrane section of the protein. Again, algorithms can be applied to detect signal peptides and TM signatures. A great tool for the prediction of these features is Phobius. It predicts both signal peptides and TM domains simultaneously and has a helpful graphical output.
If you are interested in learning more about the methods behind Phobius, we encourage you to read this PubMed publication: Advantages of combined transmembrane topology and signal peptide prediction–the Phobius web server by Käll L, Krogh A, and Sonnhammer EL.
The databases we have described above are just a small sample of the hundreds of secondary databases dedicated to the prediction of conserved domains in proteins. We chose to show Pfam and Interpro to you because they are good examples of large-scale analysis applied to pre-existing datasets, with the objective of retrieving information to aid the investigation of proteins of unknown function. We also chose Phobius because of its broad applicability to the prediction of protein localisation, which when combined with protein function, can be highly informative. In addition, these four examples are popular with researchers and are well-maintained tools.
In the coming Steps, we will demonstrate how to use these and other databases.
© Wellcome Genome Campus Advanced Courses and Scientific Conferences