Skip main navigation

Learn how to use public databases to collect information about a protein sequence

Learn how to use public databases to collect information about a protein sequence.
4.8
Hello, I’m Martin Aslett. And I work for the Wellcome Genome Campus Advanced Courses [and Scientific Conferences] Team, based at the Welcome Genome Campus. In this activity, we will use a public database and protein amino acid sequences to find out more about potential protein function. Let’s use this Salmonella enterica entry as an example. The short description line of the header of the FASTA formatted sequence suggests that this sequence has no known function. I have this sequence saved in the Notepad file as this is plain text format on a PC. And this will be best for cutting and pasting into web pages.
41.2
Let’s now use this sequence in different secondary databases that we’ve already seen to see whether we can learn more about its function. I’m now going to use InterPro Scan to look for conserved domaines in my sequence of interest. I open my internet browser and can either search for InterPro Scan or type in the URL that is now appearing on the screen.
76.1
This is the InterPro page and then in this search box, I can cut and paste my sequence and then search for protein domains and other features of interest.
90.1
I go to the file I have opened in Notepad and I copy the sequence and then paste it into the search box. Note that I included the header line starting with the arrow symbol. Most internet sites have learned to ignore this. But you might find with some searches that you need to remove this line. Interpro Scan utilises searches of a large set of databases. I will click Submit to start the search and then come back to the results later. While this is running, I’ll explore other databases so I can later compare the results for all of them. The first database I will look at is Pfam. This is a conserved protein domain database.
136.5
I’ll open a new tab and then I can either search for Pfam or again, type in the address that is appearing on your screen now.
157.6
This is the Pfam page. As you can see, there are many options. I will go to the Sequence Search. Again, I will paste my protein sequence into the search box, and I hit Go to search. This may take a while, depending on your internet connection or how many jobs are in the queue. On average, this search should take around 15 to 30 seconds. The results page shows one hit. This is the LRGB domain. That’s represented by this green box, which extends over almost the entire protein sequence. The grey bar behind it represents the whole of our protein sequence. You’ll note the conserved domain is slightly shorter than the entire protein.
201.8
An important parameter to look at is the expected value or E-value. This is a measurement of how good the match is between the conserved domain and my protein sequence. We’re not discussing E-values in any detail at the moment. Suffice it to say that the lower the e-value is, the better the match. Now I want to find out more about this conserved domain as it might give me some more information about my sequence of interest. I’ll click on the name of the domain. On this page, we find the Pfam description. This includes a detailed description of the domain, a link to InterPro and sometimes literature references. These are commonly the papers that the original curator used to curate this domain.
249.9
The description of this domain is interesting. The thing to highlight is that it is involved in both murein hydrolase activity and penicillin tolerance. This is interesting because in certain bacteria, it may be involved in penicillin resistance. One other thing to note is that, according to this description, proteins with this domain are potential membrane proteins. This means that they are likely to have contact with either the extracellular space or the host itself. Now I will use another secondary database to find out whether my sequence of interest has transmembrane domains. This will indicate whether it is likely to be a membrane protein. I open another browser tab and can either search for Phobius or type the address that appears on screen.
308.9
This is the Phobius page. In the submission section I can either paste my sequence or choose a file from my computer. This is useful if the sequence is long or if you’re using multiple sequences. You’ll notice there are three output formats– short, long without graphics, or long with graphics. I’ll choose the default option of long with graphics and click Submit. The results appear very quickly. Here we found the name of the submitted sequence, whether it has transmembrane domains, and whether it has signal peptides. We also find the coordinates of the signal peptides or transmembrane domains. In the graphical display, you’ll notice that there are grey bars. These represent potential transmembrane domains.
359.6
Some of these are very low, indicating an unlikely probability of them being transmembrane domains. These won’t be counted in our search. The red line indicates the probability of each residue being part of a signal peptides. This shows that the first 23 amino acids are part of a signal peptide. After this, the search quickly drops to zero. The green and blue lines respectively represent the probability of these regions being cytoplasmic or non-cytoplasmic. We’ll now go back to the InterPro query and compare the results. The InterPro search has now finished. Similar to the Pfam output, the conserved domains are represented with bars that spanned the length of the conserved domain with respect to the full length the protein sequence.
405.9
Not surprisingly, the InterProSan results include Pfam matches, such as the first one here but also matches to other databases such as Panther, and TIGRFAMS You’re encouraged to investigate these databases on your own. In the unintergrated signatures panel, we found some results that back up our previous searches. We find that InterProSan uses TM Helix, but also integrates Phobius results. In addition, there are a number of integrated algorithms which predict a signal peptide at the N-teminus. Finally, we can get to UniProt and search for this protein based on a succession number. Again, I will open a new tab in my browser. You can either search for UniProt or, as I’m doing, type in the URL.
467.9
UniProt is the central reference database for protein sequences and their functional information. I will type in the accession number for our protein, NP_456741 and hit Search. Our search shows us the UniProt accession code. If I click on this, we come to the full page for our protein. As you can see, this shows that the search results that are found backed up in this entry, such as transmembrane domains and family domains like Pfam. In summary, this activity shows that we can use publicly available secondary databases to find out clues about the potential function of an amino acid sequence.
511.2
We’ve used four different pages– UniProt, Phobius, Pfam, and firstly, InterPro to make searches find clues as to what the function of our previously unknown functioned protein will be. The databases we’ve used are merely a small selection of those available online. We encourage you to do your own searches to find other databases which may be more relevant to the proteins that you are searching for functionality for. We hope you found this video enjoyable. Please add ideas, suggestions, or questions in the comments section. We look forward to hearing back from you.

In this video, Martin Aslett demonstrates how to use public databases to collect information about a protein sequence. These pieces of information will assist us in inferring the potential function of a protein.

In this video, you will learn how, using online searches of resources such as Interpro, Pfam and Phobius, evidence of a protein’s likely function may be accumulated.

Notice that the web interface of Interpro that appears in the video is not the same as the one currently accessible online. This is not a problem since the same query boxes and output information will be displayed.

For this example we will use a Salmonella enterica protein. You can find the protein sequence below.

>NP_456741.1 hypothetical protein STY2412 MMTYIWWSLPLTLAVFFAARRLAAHFKMPLLNPLLVAMVVIIPFLLLTGIPYEHYFKGSEVLNDLLQPAV VALAYPLYEQLHQIRARWKSIISICFVGSLVAMITGTSVALLMGATPEIAASVLPKSVTTPIAMAVGGSI GGIPAISAVCVIFVGILGAVFGHTLLNAMHIRTKAARGLAMGTASHALGTARCAELDYQEGAFSSLALVI CGIITSLVAPFLFPLILAVMR 

To get the most from this step, we recommend that you try to replicate the steps. You can do so by pausing the video and performing the tasks in your internet browser or you can watch it first and replicate the steps later.

This article is from the free online

Bacterial Genomes I: From DNA to Protein Function Using Bioinformatics

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education