We are here with Rob Finn, who works at the European Bioinformatics Institute and is going to talk to us about protein families. Hello, Rob. Hi, Anna. Can you tell us about your role? Yeah, sure. So I run a large team at the EBI called the Sequence Families Team. So it actually is an umbrella for a number of different resources. So we have protein families resources, and there’s Interpro and Pfam there. We have RNA families, which is Rfam and RNA central. And then I also run the EBI’s Metagenomics Analysis Platform, which actually uses those resources to actually analyse microbial community DNA to try and work out who’s there and what they’re doing.
You’ve mentioned a number of databases that your team covers. Can you tell us more about Pfam? So Pfam is the grandfather resource. It’s been around for quite a long time. So it’s been around for about 20 years. So it’s been there since the start of informatics. And its primary role has always been to able us to transfer the information over from the few experimentally characterised proteins to many others that come off these large-scale sequencing projects. And so proteins can be considered as being built up of these functional units, called domains.
And what we do is we try and model those individual units such that we can have these little models that we can then scan new sequences again and say, OK, I’ve seen that instance before. Label it with that. And that gives us an idea of function. So if you’ve seen something that, say, binds DNA, and I see that domain in another sequence, I know that’s likely to bind DNA. So are the domains found in these sequences conserved? Yeah, so we rely on that. Evolution, over time, they make changes. But there’s always an underlying signal.
And that’s really what we’re trying to encapsulate in our mathematical models, called Profile hidden Markov models, where we actually capsulate evolutionary types of the sequence variation– so the parts that are the same and the parts that are different. And importantly, we model the inserts and deletions within those sequences that give us very sensitive models. How do you build a Profile hidden Markov of model? The way we work is we take a few examples where we know that they’re related– a few sequence examples. And then we search. We take those, build a profile HMM around them, and then search over and over again to expand the set of sequences. And this is really why Pfam is very good.
So that training set, as long as it’s reasonably representative of that space, is very good. So what we find– and this is what we’ve grown linear over time– is actually these profile HMMs are very good at modelling evolution. So what we find in evolution is you get subtle changes over time. And so as long as you’ve got good representatives across the phylogenetic tree of sequences, that’s enough for these profile HMMs to detect the intervening steps, because as I’ve alluded, evolution is a continuum. And so you don’t suddenly find massive changes. You see discrete changes where, depending on natural variation or selection pressure, you just see a slight change of a family over the course of time.
Can the same profile HMM be used to identify similar sequences in bacteria and in humans? Yeah, so there are core housekeeping genes that, for example, that you use for replicating DNA, where the genes are conserved all the way across. And a single HMM is capable of detecting those similarities all the way. There are other protein families and domains that are specific to a particular clade. And that’s what makes those things different. So this is one of the applications of Pfam. If you are interested in having a drug target against a bacteria, if you can find a particular domain that’s only found in bacteria and not in humans, then that’s likely to mean that the drug won’t be cross-reactive.
Has Pfam been used to identify any families involved in pathogenicity in bacteria? So there are a number of examples where, certainly, we’ve discovered a family or we’ve found extensive use of that family in different bacteria. Examples include immunity proteins, where a bacteria actually excretes a particular protein that then kills all the other bacteria around it. And then it has a cognate partner that allows it to not work against itself. And so therefore, it can exist without destroying itself. And then you can have related bacteria that have this cognate partner. So that allows a bacteria to invade a particular environment for, let’s say, the human gut, and then lead to that pathogenicity. Thank you, Rob. That was fascinating.