Skip to 0 minutes and 6 seconds Hello everyone. My name is Anna Protasio. I am a researcher of the Wellcome Sanger Institute. And today, we are going to learn how to retrieve a gene entry from a repository. We are going to use two repositories today, namely NCBI, hosted by the NIH in the US, and the European Nucleotide Archive. First, we’re going to start by navigating to NCBI, by typing www.ncbi.nlm.nih.gov. This is the page where we arrive. And we have a dropdown menu of databases that we can use. Today, we are going to use the Nucleotide database, found here. And in the search box, we are going to type E.coli HpcC This is the name of the gene we are going to use today for this demonstration.

Skip to 1 minute and 7 seconds Now we’re going to click Search.

Skip to 1 minute and 14 seconds The results are back. And notice that some of the entries have the name complete chromosome in them. We are not interested in these entries, as we’re interested only in a gene entry, which is found on the third place in this case. And we’re to click on this entry. And now, we are shown the GenBank entry for this gene. Notice that on the left-hand side, you have tags such as locus, definition, accession. And they are populated by these coats codes. So here, we have accession number. We have the definition of the gene. We also have some keywords. And also, importantly, we have links to, for example, the organism Escherichia coli.

Skip to 1 minute and 56 seconds And we have also PubMed links to get more information about this entry. We are not going to do this today, but you are encouraged to do so yourself in your time. At the bottom of the page, we have the DNA sequence for this gene. We also have the protein sequence, which is the conceptual translation for this gene. We are now going to download this entry into a file. And for this, we go to the Send To dropdown menu. We’re going to leave Complete Record. We’re going to choose File, and leave GenBank as the format. And then click on Create File. This is going to automatically download a sequence into our Download folder.

Skip to 2 minutes and 41 seconds To access that sequence, we need to open it into a text editor. In this example, I am working on a PC, but you will also have similar software in your Mac or Linux machines. I’m going to find the file. And observe here, that the extension .gb has been lost. We will need to open this in WordPad in our Window’s example.

Skip to 3 minutes and 9 seconds For that, I am going to open WordPad.

Skip to 3 minutes and 21 seconds And from here, I am going to open the file directly from the Downloads folder. Notice that the file does not appear in the list. This is because the extension is not necessarily compatible with WordPad, but we can choose to show all documents. And here is our sequence. When I click on our sequence, you click Open. And here’s our entry. Notice that this file has the same format as you observed in the web browser, but some of the links are removed. This is because this is now a flat file, and it only contains text. But this is a good way of keeping a record of your sequence of interest.

Skip to 4 minutes and 4 seconds I’m now going to close this file. And I’m going to show you how to download a FASTA sequence from this gene. We’re going to use the same dropdown menu, Send To. But instead of doing Complete Record, we’re going to choose Coding Sequences. And in format download, we will have two options. We can download the nucleotide, or we can download the protein. In this instance, we’re going to download the nucleotide. We’re going to click on Create File, and another file is being downloaded. We’re going to do the same procedure. We’re going to open WordPad, or if you already have it open before, you can just use it directly.

Skip to 4 minutes and 48 seconds And from here, going to the Downloads folder, and again, I need to change to see all documents. And this is our other sequence that has a text document type. And this is a FASTA sequence. The first line has a more than symbol followed by the name of the entry, which is quite long in this case. It could be just as small as the accession number and then, all the sequence that follows.

Skip to 5 minutes and 22 seconds In order to download a protein sequence, we can repeat the same steps. Go into Send To, coding sequence, and choose the protein file.

Skip to 5 minutes and 34 seconds This downloads as well as a sequence. This is our third file that we downloaded. We can open it again in WordPad.

Skip to 5 minutes and 48 seconds And this is our protein sequence.

Skip to 5 minutes and 52 seconds Notice that the sequence is different this time. It’s amino acids rather than bases.

Skip to 6 minutes and 0 seconds We are now going to navigate to the European Nucleotide Archive website to have a look at the same entry in the different repository. We are now going to retrieve the same gene entry from a different database. We are going to use the European Nucleotide Archive. For that, we will navigate to the site by typing www.ebi.ac.uk/ena. This is a search site. In the text search, we’re going to type exactly the same thing– E.coli HpcC. And we’re going to click Search.

Skip to 6 minutes and 44 seconds And here, it comes back with two results. And we’re going to click on this link. And here, we have a similar entry to what we had before. Don’t be confused about their different layout, but all the same information is here. We’re going to click Text. This will automatically download a file for us. And from here, we can find it in the Finder. This is our file. And we can open it with WordPad.

Skip to 7 minutes and 15 seconds And this is our entry. This is equivalent to the GenBank entry that we downloaded earlier, and all the same information is here. It’s just encoded slightly different. So instead of having all the words, such as accession or pubmed, we have just-two letter codes, but the same is there. And at the bottom, you will also have the nucleotide sequence and the amino acid sequence.

Skip to 7 minutes and 43 seconds If we wanted to just download the FASTA sequence, we can use the link FASTA at the end. We can repeat the same procedure, but now the extension might present some problems, so we are going to have to choose to open it with WordPad. And here’s our sequence.

Skip to 8 minutes and 8 seconds It is possible to change the extension of the files just by using the rename option on your software. In this demonstration, we’ve shown you how to search for a gene entry using NCBI and ENA, as well as to download the sequences into your computer. If you have any comments, please leave them on the comments section below the activity. We hope you have enjoyed this demonstration and hope to hear from you soon.

How to retrieve DNA/protein sequences from public repositories

In this video we demonstrate how to download DNA and/or protein sequences from public repositories. We use two popular repositories: NCBI (National Center for Biotechnology and Information) and ENA (European Nucleotide Archive).

Why do we need to know this? Searching for a given gene or protein in a database is where a lot of research starts. For example, imagine you are reading a research paper or a text book on antibiotic resistance in a particular E.coli strain. The paper tells you that the gene or protein responsible for the resistance is called “hpcC”; you want to know the sequence of this gene or protein. How you do find it? Where do you get this information from?

In this video, you will learn where and how to obtain the gene and protein sequences for a gene of interest.

To get the most from this step, we recommend that you try to replicate the steps. You can do so by pausing the video and performing the tasks in your internet browser or you can watch it first and replicate the steps later.

