Skip to 0 minutes and 11 seconds Hi! My name is Tony Smith. In this lesson, we’re going to look at a practical application of data mining in the world of biology. Knowledge discovery with biological data, or so-called bioinformatics. Now, there are many different types of biological problems that we might want to study, many different data types. I’m going to look at a subset that’s quite common, called “sequence analysis”. Sequence of nucleotides that make up genes or sequences of amino acids that make up proteins – in fact, the latter. We’re going to look at a very easily stated sequence problem for proteins. It goes like
Skip to 0 minutes and 46 seconds this: given a freshly produced protein, which portion of it is the signal peptide? Now, what does this mean? Well, you might remember from high school biology that along your DNA there are nucleotide sequences called genes. Genes get copied with messenger RNA to produce a transcript, and the transcript is used to string together amino acids into a polypeptide chain, which is a protein. Proteins perform some function in a cell, and, in order to do that, they have to be transported to where they’re going to perform that function, and, through that transport, they have to pass through a membrane.
Skip to 1 minute and 20 seconds In so doing, what happens is the 20 or 30 or so amino acids at the beginning of the protein – called the signal peptide – they open up a translocation channel that allows the protein to pass through the membrane. In so doing, the signal peptide portion gets cleaved off. The signal peptide is kind of like a key that opens a door for a protein, and, if we know what the key is, it give us an idea as to what the function of the protein might be. We want to predict where the signal peptide ends. Where is the cleavage point? We first ask ourselves what’s our general goal? Do we want an accurate prediction or do we want an explanatory model?
Skip to 2 minutes and 1 second Something that gives us some knowledge. We’ll have to ask what features might be relevant in predicting the cleavage site. So what features do we need to generate from the data we’re given? What approach are we going to take? What learning algorithms in Weka we might use, and how are we going to know if the model produced by Weka is any good? How do we know if we’re successful? Here’s some 10 instances or so of new proteins. As you can see, they’re sequences of letters where each letter corresponds to a different type of amino acid. M is Methionine, A is Alanine, S is Serine, and so on.
Skip to 2 minutes and 33 seconds About 25 or 30 residues along for the beginning of the protein, marked in red here, is the cleavage site. That’s the beginning of the mature protein, the part that survives after cleavage. That’s what we’re trying to predict. Which of those residues is the cleavage site. What properties do we think are relevant? Do we want properties of the entire signal peptide or just properties around the cleavage site? We might get some domain knowledge from a biologist to help us out, or we might do some ad hoc statistical analysis to look for thing that might correlate with the cleavage site.
Skip to 3 minutes and 8 seconds For example, given the 1400 examples in our dataset, we might find that there’s a very tightly clustered length, with the mean length of 24. Knowing the position of a residue might be useful in predicting whether or not it’s the cleavage site. If we look at the residue at the start of the protein and, perhaps, the three residues immediately upstream of the cleavage site and the three residues downstream from it, there might be some useful information there, some context.
Skip to 3 minutes and 35 seconds In fact, if we do a histogram of the upstream region of the data we’ve got, we’ll see that is looks like the letter A, Alanine, and perhaps the letter L and maybe S, as well, seem to be quite frequent around the cleavage site. So that could be useful.
Skip to 3 minutes and 51 seconds When we don’t have much domain knowledge, we might come up with a set of features that include the position of the residue being considered; the residues at each position, three either side of the cleavage point; and then for each residue that we know is the cleavage site, we’ll put that in the class of yes this is the cleavage point; and we’ll just get some negative instances by randomly choosing some other residues and producing the same information. We might do this inside a spreadsheet. Here’s an example. Each column is an attribute and each row is one instance of a residue. We record all this information. This can be saved in a comma-separated version in most spreadsheet packages.
Skip to 4 minutes and 27 seconds Weka, of course, can load a CSV package. We’re going to go ahead and load in this data into Weka and have a go seeing if we can predict the cleavage site from it. I’ve loaded up the dataset that I just showed you into Weka. We see here we’ve got the features, the length, or the position of the acid in question. Which residue is at the –3 position, –2, –1. The residue at the cleavage site and 1, 2, and 3 upstream. And I’ve recorded whether this is an example of the cleavage site or a randomly chosen other residue that’s not.
Skip to 5 minutes and 4 seconds Now, if I go straight to classify, I want an explanatory model, so I’m going to go for a C4.5 decision tree. I’ll go down to trees, load up J48, which is C4.5, and, under the default settings of 10-fold cross-validation, I’m just going to go ahead and start up Weka. It comes back pretty quickly. If we look at the accuracy, we’ll see we’ve got 78-79% accuracy. That’s pretty good considering other state-of-the-art software for predicting the signal peptide cleavage point performs at about 80-85% accuracy. So we’ve already done really well, but is this model any good? Now, if we look at the true positive rates for the two classes.
Skip to 5 minutes and 42 seconds Here we’ve got the Yes and No class, and if we look at the true positive rates, they’re around 80%, so that pretty good. Let’s take a look at the decision tree produced. I’ll just pop up the visualization of it. Enlarge that a little bit. Fit to the Screen. Now, there’s a couple of reasons why this decision tree suggests we haven’t come up with a very good model. One is it’s very wide and very shallow, and it’s highly branching. Each of these tests seems to produce a lot of very small subsets. This suggests that what we’ve done is that we’ve actual found a model that overfits the data. Now, what does that mean? Well, let me give you an example.
Skip to 6 minutes and 24 seconds Machine learning algorithms are trying their best to get predictive accuracy, and it’s often very easy for learning algorithms to find some model that will work. There are two reasons why we might get good performance for the wrong reasons. One is sparseness of d ata, and another is overfitting the data. Let’s look at each of these problems and see if we can figure out what’s going on with our example here. Data sparseness is another form of overfitting, but it’s specifically because we don’t have enough instances to figure out the true underlying relationship. Consider this very small dataset here. What I’ve done is that I’ve rolled two dice – six-sided game dice – and I’ve tossed a coin. Two dice, one coin.
Skip to 7 minutes and 6 seconds I’ve recorded the outcomes. I rolled a 3 with one dice, a 5 with another, and a heads with the coin. I did that four times and recorded the four instances here. Now, we know that there are six possible outcomes for rolling a dice. I’ve got two dice. Two outcomes for a coin toss. That’s 6 x 6 x 2. That’s 72 possible instances we could’ve had, but we only have 4. I give these four instances to Weka. I say come up with a rule that allows me to predict the coin toss from the roll of the dice.
Skip to 7 minutes and 37 seconds It comes up with a model: if Die1 > 2 then the outcome of the coin toss is heads, otherwise it’s tails. That fits the data we’ve got here. 100% correct, but, of course, if we had additional instances, then hopefully Weka would see that there’s no correlation, these are random outcomes. This is the problem of overfitting due to data sparseness. This is a real problem with our signal peptide, because we’ve recorded 7 different residues around the cleavage site, so each of them can be 1 of 20 residues. That’s 20^7 possible patterns. We’ve got the position, there’s about 60 different integers there. The two class values.
Skip to 8 minutes and 16 seconds That’s 153 billion possible instances of which we have 1400 positive ones and an equal number of negative ones. A tiny fraction. That’s data sparseness. Overfitting, in general, can be indicated when the model is overly complex, such that the tests practically uniquely identify instances. The model splits instances into lots of very small subsets, and a telltale sign of this is the model is complex, highly branching. That’s what we see from our example here. We can usually tell if we’ve been overfitting. If we just get some more data, if we tried to predict it based on the tree we learned, we’d get poor performance. Of course, we don’t often have extra data.
Skip to 9 minutes and 8 seconds Given these characteristics of an overfitting model, I would look at the decision tree we’ve got here and suggest that it is overfitting. One way to test that is I’ve actually prepared a dataset with three times as many negative instances. I’ll just go back and load up file two here, sigdata2. That’s the same as data1, only with three times as many negative instances. We’ll just go back to Classify under the same default settings. We’ll go ahead and start it up. Now, if we look at the accuracy, we’ll see it’s even gone up, 82.5%.But, if we look at the true positive rate of the cleavage class, it’s actually down to almost 50%.
Skip to 9 minutes and 48 seconds That is practically a coin toss in its accuracy in predicting the
Skip to 9 minutes and 54 seconds very thing we’re interested in: is this the cleavage site? This doesn’t look like a very fruitful way of going about trying to predict the cleavage site. Our amino acid context approach appears to be overfitting the data. What else could we try? Well, we might look for a different set of features that capture the more general properties of signal peptides. A more informed approach, which we might learn about by consulting an expert, a biologist, is we assume that the cleavage occurs because of physical forces at the molecular level. That is, amino acids have electro-chemical properties. We might create features that capture those physicochemical properties of amino acids around the cleavage site or of the signal peptide as a whole.
Skip to 10 minutes and 39 seconds We can get some domain knowledge from the experts. What kind of knowledge would we get? Well, this diagram here shows a distribution of the amino acids at positions relative to the cleavage site. If we look at the –1 position, that’s the amino acids immediately upstream of the cleavage site. Here the size of the letters is proportional to the frequency of the amino acid type at that position. we’ll see at the –1 position, there’s a lot of A’s, quite a few G’s, S’s, some C’s and T’s. At the –3 position, we see A’s, V’s, S’s, and T’s. Also, sort of the region 5 to 15 upstream, we see there’s a lot of L’s, V’s, and A’s. What’s going on here?
Skip to 11 minutes and 23 seconds What are the electro-chemical properties of A’s and L’s and V’s that we might exploit to capture this non-uniform distribution in these relative positions? It turns out that amino acids have well-known types. They can be molecules that tend to not like being near water. They’re called hydrophobic. You see on the right side of this Venn diagram, we’ve got A, V, P, M, L, F. These are all hydrophobic amino acids. On the other side, we’ve got the hydrophilic ones, the ones that like to be near water. We also have some amino acids that are positively charged and some are negatively charged. This affects whether or not they stick together, of course. And then the rest are not really very charged.
Skip to 12 minutes and 10 seconds There are residues with small side chains, the bit of the molecule that distinguishes one residue from another. We’ve got A, V, P, G, C, N, S there all have small side chains, and the other ones are somewhat larger. These are the kinds of properties we could record about the molecule around the cleavage site. In fact, biologists know of the physicochemical properties around signal peptides, and they talk about this thing called the C-region, H-region, and the N-region. Now, the C-region is just those 3, 4, 5, 6 residues immediately upstream of the cleavage site. They’re usually uncharged at position –3 and the –1 position are small, have a small side chain. Adjacent to that upstream is the H-region, about 8 residues long.
Skip to 12 minutes and 59 seconds That was all the L’s and V’s we saw. It tends to be a hydrophobic region. Then, above that, to the beginning of the protein is the N-region, which tends to be positively charged. This is information we can use to construct more informed features. The possible features we might include are the size, the charge, the polarity, and the general hydrophobicity of regions of the signal peptide, especially at position –1 and –3, because they seem to be quite distinct. We might compute the total hydrophobicity in an approximate H-region, about 5 to 15 upstream of the cleavage site. We might look at the total charge, polarity, and hydrophobicity in the C-region and so on. Then record whether or not that’s the cleavage site.
Skip to 13 minutes and 49 seconds So for a couple of randomly chosen residues which are not the cleavage site, we’ll compute these same features. In fact, I’ve created
Skip to 13 minutes and 58 seconds a dataset which just includes the following four features: the position, as we had before – the same as the length we had in the previous dataset – the overall hydropathy of the approximate H-region, the side-chain size for the –1 residue, and the charge of the –3 residue. If we go back to Weka here, we’ll just load in file 3, the one I prepared here. I’ll just load it in. Here we can see the position, the charge at the –3 position, whether or not it’s small in the –1 position, and the overall hydophobicity here of the H-region, which you’ll see is a numeric value.
Skip to 14 minutes and 31 seconds There are charts of general hydrophobicity for amino acids, and I’ve just summed them up for a region upstream of the cleavage site. Let’s go back to J48. It’s still all set up here for 10-fold cross-validation. We’ll start her off under the default settings. If we look at our accuracy here, we’ve got – holy smokes – 91.5% accuracy. That’s great! Now, is this all just because we’re predicting one class? We look at the true positive rate, and we’ll see we’ve got an average true positive rate of almost 92%. That’s quite good. But, we might ask ourselves, are we overfitting the data? Now, if we look at the model, it’s going to be quite small, because we don’t have very many features.
Skip to 15 minutes and 14 seconds Maybe this is a little on the big side. (Fit to screen here.)We might wonder, are we overfitting the data? Have we got a problem of data sparseness? Well, once again, I can generate three times as many negative instances to see if we’re just getting a sort of random outcome. We’ll go back to Preprocess here, open the file sigdata4. It’s the same as sigdata3, but with three times as many negative instances. I’ll load them all in. We’ve got 5,620 instances. I’ll go back to Classify. Same default settings. Go ahead and start it up, and let’s look at the accuracy first of all. Accuracy has gone up to almost 94%, but let’s look at those true positive rates.
Skip to 15 minutes and 59 seconds Here, we see that our average true positive rate for our two classes still remains high, 94%. This indicates, in fact, that the model has been relatively good at discriminating between cleavage sites and non-cleavage sites. In fact, if we look at the model, if we visualize the tree, we can see a number of features here. At the top of the tree, it’s looked at the H-region, which we knew was useful in predicting the cleavage site, and then it’s looked at the smallness of the –1 position and so on. Overall, this looks like it might possibly be capturing, in a formal model, the general principles biologists told us all about.
Skip to 16 minutes and 40 seconds When we’re doing bioinformatics, the considerations we have for doing data mining is we have to ask ourselves what’s our overall goal? Do we want predictive accuracy or explanatory power? How do we prepare the data to generate features which are actually going to be useful for solving our problem? How can we evaluate how good the model is that we get, knowing that Weka’s going to do its best to come up with a highly accurate model, and it may do so under spurious circumstances. Most importantly, bioinformatics is an instance where data mining really is a collaborative experience. So seek expert advice whenever you can.
Signal peptide prediction
Tony Smith introduces signal peptide prediction, an application of data mining to a problem in bioinformatics. A sequence of amino acids that makes up a protein begins with an initial portion of 20 or 30 amino acids called the “signal peptide” that unlocks a membrane for the protein to pass through. The problem is to determine the “cleavage point” where the signal peptide ends. An important question is whether we seek an accurate prediction or an explanatory model. One potentially useful feature is the length of the signal peptide; another is the amino acids immediately upstream and immediately downstream of the cleavage point. Overfitting is a problem, and domain knowledge from experts is an important ingredient for success – data mining is a collaborative process.
© University of Waikato, New Zealand. CC Creative Commons Attribution 4.0 International License.