Contact FutureLearn for Support
Skip main navigation
We use cookies to give you a better experience, if that’s ok you can close this message and carry on browsing. For more info read our cookies policy.
We use cookies to give you a better experience. Carry on browsing if you're happy with this, or read our cookies policy for more information.

Skip to 0 minutes and 10 secondsIn this video, I'm going to show you how you can analyse flu sequences yourself on the web. You don't need any special software for this, just an ordinary web browser. And the site that you need to point your browser at is the National Centre for Biotechnology Information in the USA. The URL for that is, as you can see at the top of the screen, Now NCBI stands for National Centre for Biotechnology Information. NLM is the National Library of Medicine of the USA. And that's part of the National Institutes of Health. And of course a government site, hence the .gov.

Skip to 1 minute and 2 secondsSo once you find the site, you can explore many very interesting things about bioinformatics and computational biology. Things that are not just relevant to flu and other viruses, but to the whole of biology. But to find the specialist part that deals with flu sequences, you need to scroll down to the bottom of the page here. And under this little list here where it says, featured, you'll see three from the bottom, you'll see influenza virus. So if we click on influenza virus, this takes us through to the influenza virus resource, which is a part of the NCBI website that's dedicated just to the analysis of influenza virus. Now again, there are many things that you can explore here for yourself.

Skip to 1 minute and 46 secondsBut what we're going to do today is just look at the database section. So on this side here, I'm going to click on the database link. And that takes us through to the influenza virus database part of the influenza virus resource. So the first thing we're going to do is we're going to retrieve a protein sequence for a haemagglutinin from the 1977 Russian flu pandemic. So what we then do is to look here we're looking for influenza type A. That's in the type column. In the host choice box, we just want flus that infect humans. You can see as you scroll down here that there are very many species were flu sequences have been isolated.

Skip to 2 minutes and 29 secondsBut we'll just take human ones. We'll have a look at the haemagglutinin gene. You will see the segments listed. Each protein of course, encoded in a different segment. And there are some segments that encode more than one protein here. PB1 and PB1F2, which is variant and PA, and PX and so on. But we will choose HA here, we'll choose subtype H1N1, because that's the subtype of the 1977 Russian flu pandemic. If I were to click Add Query now here, then we would get hundreds of sequences of all of the human H1N1 across all the years. But we just want ones from the 1977 in the first instance. So I'm going to click Collection Date 1977 to 1977.

Skip to 3 minutes and 21 secondsSo only sequences from that one year will be retrieved. And we're only looking for full length sequences, so click on that. And we also click this box here to remove duplicates. So now I've got my query ready. I hit Add Query, and it will then query the influenza virus database. So here our answer has come back already. It says that in the year 1977, there are four non-duplicate sequences for haemagglutinin from humans. So we can now look at those by clicking on Show Results here. This button here. Show Results. And we'll now get a list of those four sequences. So they were all collected in 1977. Two of them were collected in Russia.

Skip to 4 minutes and 7 secondsThese are the USSR 90 and 92 strains. And there was then one in China, in Tientsin also 1977. And then there was a Hong Kong 117 strain as well. So we reckon that Russian flu started in the Soviet Union in 1977, so we'll choose this USSR 92 strain to look at in more detail. So if I click on that link here, which is the accession number. The accession number is the database reference for that particular protein. So this takes us through to a page which gives us information about the haemagglutinin sequence from this USSR 92 strain. As you can see, there's quite a lot of information here. Much of it is really quite technical.

Skip to 4 minutes and 53 secondsAt the bottom here, we see the protein sequence. That's the sequence of amino acids each one of these represents an amino acid in the protein chain. And we can see here also that we have information like country of collection Russia, collection date 1977, the name of the strain and so forth. Now how can we find how similar this haemagglutinin protein is to the haemagglutinin proteins of other strains? And the way we do that is by using a tool which we call Blast.

Skip to 5 minutes and 24 secondsSo blast takes protein sequences-- and in fact it also takes nucleic acid sequences as well, but in this case we're dealing with a protein sequence-- it takes protein sequences and compares them to the rest of the database of protein sequences. So up here in the right hand column, we can see a Run Blast. So if we click on Run Blast there, it takes us through to the Blast Query window. You see that the sequence that we're using is a query. That's the haemagglutinin from the USSR 92 strain from 1977. It's already entered there. And it's looking for protein sequences in general. This is the default protein sequence database than our other more specialised ones.

Skip to 6 minutes and 5 secondsAnd then in order to make our search run a bit faster, we're going to restrict it to flu sequences. So if I can just click there, influenza. Just typing in influenza, and then we're going to look for influenza virus A. So click there. Confine this search to influenza A. The rest you can leave according to default. There's no need to change any other search parameters. And now we click the Blast button. So Blast is now working on our sequence, comparing it to the whole of the database of sequences at the National Centre for Biotechnology Information in the USA. This is called the GenBank database. It's quite famous. So now we have the result of our query.

Skip to 6 minutes and 46 secondsSo we started with haemagglutinin protein from USSR 92 and we queried it against the influenza virus proteins from GenBank. And this top line here confirms that we submitted a haemagglutinin protein. It's showing that the protein is a member of the haemagglutinin super family. So it tells us it looks generally like other haemagglutinin proteins. And here are our actual Blast hits, which are the closest matches. And the top match is of the protein to itself. So we click on this link here. We can see that in fact, here is our query protein, and it's found itself in the database. And it's 100% identical. That's not that informative, of course. We're interested in what other things it's similar to.

Skip to 7 minutes and 30 secondsNot just similar to itself. And here we see that the next top hit is another sequence from the Russian flu pandemic. One sampled a year later, in Memphis, in the USA. This is the Memphis 13 strain. The Memphis 13 strain isn't quite exactly the same as the Russian one. There's one amino acid difference. This is an example of antigenic drift. The gradual accumulation of changes in protein to avoid the host immune system. You can see here that there are 565 out of 566 identities between Russian and the American flu from a year later. And the difference is here in this position here. It's actually a valine in the Russian sequence has been replaced by a phenylalanine in the American sequence.

Skip to 8 minutes and 17 secondsHowever, one of the other things that we see, which is interesting, is that in fact, not very far down the hit list is something which is not from the Russian flu pandemic or shortly afterwards, but actually much older sequences. Here we have the Roma sequence of 1949 from Italy. And we have others. We have Albany New York from 1948. We have FFW from 1951 and so on. If we look at the Italian sequence, the Roma sequence from 1949, we can see that again, it's really very close. It's only got six changes out of 566. There's one there, there's another one there.

Skip to 9 minutes and 1 secondAnd there's another one there. And there's another one. So you can look and find the six differences between the top line, which is the Russian flu sequence from 1977, and the bottom line, which is the Italian sequence, the Roma 1949 sequence. Now these sequences are some 28 years apart, but they're very similar. And this is one of the reasons why it's difficult to understand the origins of the 1977 pandemic flu strain. Because over 28 years worth of evolution, there should actually be rather more changes in a flu protein than we see in this particular comparison here. We can illustrate this in a little more detail by doing something called phylogenetic tree building.

Skip to 9 minutes and 49 secondsAnd in the second part of this demonstration, I'm going to show you how you can build a phylogenetic tree for the haemagglutinin for the Russian flu pandemic of 1977.

Building a Phylogenetic Tree: The NCBI database

In this step, we’ll visit the website of the US National Center for Biotechnology Information (NCBI) which is part of the US National Library of Medicine, itself one of the National Institutes of Health based at Bethesda, Maryland.

You can access the NCBI website at


NCBI has been the world’s principal resource for genetic information for nearly 30 years. Its gene repository, GenBank, contains millions of files of sequence information, entirely open to public and scientists alike.

NCBI also provides some free online tools which anyone can learn to use to do their own analyses.

Phylogenetics is the science of how organisms are related. Phylogenetic trees are like family trees - things that are closely related are close together and things that are less related are further apart. So, in a family tree, you might have your brothers and sisters on the same branch of the tree as yourself, and your cousins on the next nearest branch. Your grandparents, which are the ancestors you share with your cousins would be situated at the part of the tree at the base of the two branches. This is called a node.

Similarly we can draw trees to represent how viruses are related. In this demonstration, we’ll be using this technique to demonstrate why it is thought that the 1977 pandemic H1N1 virus was actually a laboratory escapee, on the grounds that it is very close on the phylogenetic tree (almost an identical twin, we might say) to older viruses from the 1950s. This demonstration will be split into two steps.

In this first one, we’ll just be looking at how to find some influenza genomes to use in our analysis. To do this, we’ll be retrieving some influenza genome sequences from the GenBank database, and then looking for similar sequences using BLAST software.

For those interested in exploring the subject of phylogenetics further, a link to an introductory textbook is given below.

Share this video:

This video is from the free online course:

Influenza: How the Flu Spreads and Evolves

Lancaster University