Skip main navigation

New offer! Get 30% off your first 2 months of Unlimited Monthly. Start your subscription for just £35.99 £24.99. New subscribers only T&Cs apply

Find out more

How to retrieve DNA/protein sequences from public repositories

How to retrieve DNA/protein sequences from public repositories: NCBI (National Center for Biotechnology and Information) and ENA (European Nucleotide
Hello, everyone. My name is Anna Protasio. I am a researcher at the Wellcome Sanger Institute. And today we are going to learn how to retrieve a gene entry from a repository. We are going to use two repositories today, namely NCBI hosted by the NIH in the US, and the European Nucleotide Archive. First, we’re going to start by navigating to NCBI by typing This is the page where we arrive. And we have a drop down menu of databases that we can use. Today we are going to use the nucleotide database, found here. And in the search box, we are going to type E. coli HPCC.
This is the name of the gene we are going to use today for this demonstration. Now we’re going to click Search.
The results are back. And notice that some of the entries have the name complete chromosome in them. We are not interested in these entries as we’re interested only in a gene entry, which is found on the third place in this case. We’re going to click on this entry. Now we are showing the gene bank entry for these gene. Notice that on the left hand side, you have tags such as Locals, Definition, Accession. And they are populated by these codes. So here we have the accession number. We have the definition of the gene. We also have some keywords. And also, importantly, we have links to, for example, the organism Escherichia coli.
And we have also PubMed links to get more information about this entry. We are not going to do this today but you’re encouraged to do so yourself in your time. At the bottom of the page, we have the DNA sequence for this gene. We also have the protein sequence, which is the conception translation for this gene. We are now going to download this entry into a file. And for this we go to the Send to drop down menu. We’re going to leave Complete Record. We’re going to choose File, and leave Gene bank as the format. And then click on Create File. This is going to automatically download a sequence into our Download folder.
To access that sequence, we need to open it into a text editor. In this example, I am working on a PC. But you will also have similar software in your Mac or Linux machines. I’m going to find it, find the file. And observe here that the extension .gb has been lost. We will need to open this in Wordpad in our Windows example.
For that, I am going to open Wordpad.
And from here, I am going to open the file directly from the Downloads folder. Notice that the file does not appear in the list. This is because the extension is not necessarily compatible with Wordpad. But we can choose to show all documents. And here is our sequence. When I click on our sequence, let me click Open. And here’s our entry. Notice that this file has the same format as you observed in the web browser. But some of the links are removed. This is because this is now a flat file, and it only contains text. But this is a good way of keeping a record of your sequence of interest.
I’m now going to close this file. And I’m going to show you how to download a FASTA sequence from this gene. We’re going to use the same drop down menu, Send to, but instead doing Complete Record, we’re going to choose Coding Sequences. And in the format download, we will have two options. We can download the nucleotide, or we can download the protein. In this instance, we’re going to download the nucleotide. We’re going to click on Create File. And another file has been downloaded. We’re going to do the same procedure. We’re going to open Wordpad, or if you already have it open before, you can just use it directly.
And from here, going to the Downloads folder, and again I need to change to see all documents. And this is our other sequence. That has a Text document type. And this is a FASTA sequence. The first line has a more than symbol, followed by the name of the entry, which is quite long, in this case. It could be just as small as the accession number, and then all the sequence that follows.
In order to download the protein sequence, we can repeat the same steps. Going to Send To, Coding Sequence, and choose the protein file.
This downloads as well as a sequence. This is our third file that we downloaded. We can open it again in Wordpad.
And this is our protein sequence. Notice that the sequence is different. This time it’s amino acids rather than bases.
In the second part of this little tutorial, we are going to access the same sequence that we use in NCBI, but this time in a different repository called the European Nucleotide Archive or ENA for short. For that I’ve took my internet browser to the following address– Or you can use an internet search engine to look for EBI ENA and it will do the job for you. Here we have a number of different search boxes. And I’ve already looked for E. coli HPCC. We can also use some of the options down here. We’re going to click on the Search box. And here we can enter the term that we want to look for– in our case, E. coli HPCC.
I’m going to hit the button Search. It comes back with two entries. They are the same. And we’re going to click on one of them. And this is the entry that EBI holds for this particular gene. Here on the right hand side, we have a number of different options that we can look at. We can ask EBI to show us more additional attributes of this gene. And here we will have information such as the strain, keywords, and other information that are important for access informatics. All this information that we can access here is also present in NCBI. So it’s just they are two mirror databases.
In the same way that we downloaded nucleotide and record data for this particular sequence, we can do the same in ENA. And we can do this through the Download area here. And we can download the EMBL, or “emble” file. That will hold the annotation.
It downloaded automatically in my computer, and I can open it. And because now I’ve switched to a Mac, to give you another flavour of how these things would be in other systems, this opened automatically in the TextEdit programme. And you can see here that the file looks rather similar to the previous one, but still has some differences. So instead of, for example, having the word accession here or PubMed entry, it has a list of two letter codes. But all the same information is there. At the bottom of the file, as I had for my NCBI file, I also have the nucleotide sequence of the gene, as well as the protein sequence that that gene encodes.
I am going to close this file now. And I’m going to show you how the FASTA file entry looks like. So you can click on FASTA and it will download automatically. We can open it again. And here is the FASTA entry for the nucleotide for that particular gene. We hope you found this tutorial helpful. And if you have any comments or suggestions, please use the box underneath the activity. And we hope to see you back very soon.

In this video we demonstrate how to download DNA and/or protein sequences from public repositories. We use two popular repositories: NCBI (National Center for Biotechnology and Information) and ENA (European Nucleotide Archive).

Why do we need to know this? Searching for a given gene or protein in a database is where a lot of research starts. For example, imagine you are reading a research paper or a text book on antibiotic resistance in a particular E.coli strain. The paper tells you that the gene or protein responsible for the resistance is called “hpcC”; you want to know the sequence of this gene or protein. How you do find it? Where do you get this information from?

In this video, you will learn where and how to obtain the gene and protein sequences for a gene of interest.

To get the most from this step, we recommend that you try to replicate the steps. You can do so by pausing the video and performing the tasks in your internet browser or you can watch it first and replicate the steps later.

This article is from the free online

Bacterial Genomes I: From DNA to Protein Function Using Bioinformatics

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now