## Want to keep learning?

This content is taken from the The Open University & Persontyle's online course, Advanced Machine Learning. Join the course to learn more.
4.14

## The Open University

Skip to 0 minutes and 1 second OK, so now we’re going to look at the first of the two missing data– example exercises. In this one we’re going to be looking at expectation-maximisation and using that for estimating missing data values. Now, again, we’ve generated– we produced our own manual implementation of the expectation maximisation algorithm. So that interested students can have a look at the code and see the mathematics in action that we looked at in the article but actually see how it’s implemented and how it’s working here. Though there’s discussion on how it works in the code itself and comments. There’s also a lot of comments just discussing around what we do in today’s exercise in the code.

Skip to 0 minutes and 46 seconds So have a read through them when you look at the code script yourself. But– as always, if you’re not interested in looking at the mathematics and the implementation of the mathematics of the EM algorithm, it’s possible to just use what we’ve given you as any third-party implementation. And that’s what we’ll be doing in this video. So let’s look at the code.

Skip to 1 minute and 13 seconds We– set up our data.

Skip to 1 minute and 19 seconds Now, what we’ve got here is from the careers data set– the Duncan dataset that has these careers from 1950 America. We have– this time the columns that we’re using. We’re working with type, income, and prestige. So we’re going to presume, like in the other cases, that we’re trying to estimate prestige based on type and income. The headache this time is that some of the values in the Type column are missing. And they’re represented by N/A cases in the data. Now the Type column has three different classes– Prof, WC, and BC. That stands for professional, white collar, and blue collar. Now, the EM algorithm is really useful in the situation where your missing data comes from discrete or nominal-type variables.

Skip to 2 minutes and 21 seconds And that’s exactly what we’re seeing here. And that can be used in other cases. But we prefer the MCMC missing data algorithm that we’ll show in the next example exercises for situations where the missing values are real valued. So let’s go through and see– how to use the EM algorithm to try to work out these missing nominal values. And then we’ll also look at what we can do with the results of the EM algorithm. First thing we’re going to do is to work out which rows of the Type column contain missing values.

Skip to 3 minutes and 1 second And we’ll also work out or let the computer work out what values the Type column actually has. Here we see Prof, WC, BC. So now we have that stored for the computer to work with. Now, what we’re going to do is get the EM algorithm to work out probability distributions for each of the missing values. Easy enough to do here– just run the EM function with the data set. To use our manually implemented card, of course, you have to source this file because it’s at the bottom. Now we can run it.

Skip to 3 minutes and 39 seconds And what we get as a result are probability distributions for each of the missing values. So let’s have a look at those. Here we see the missing values were in rows 5, 6, 11, 41, and 45. And here we have probability distributions over the three– values– these type variables can take– professional, WC, or BC. So for example, missing value in row 5, the EM algorithm says 76% likely to be a professional. 16% WC, white collar, 0.06, blue collar, and similarly for the other lines. Now, because we actually have the true values here, we can have a look at how well this performed.

Skip to 4 minutes and 32 seconds We see that, yes, row 5– it thought it was professional. It gave a very high probability to the missing value being professional, and it was. Row 6, it does badly. It thought it was blue collar but, really, it was professional. Row 11– similarly. And then the final two rows gives very high probabilities to blue collar. And they were blue collar.

Skip to 4 minutes and 59 seconds Now, of course, it’s working out an estimate of whether the row or the career was professional white collar or blue collar based only on the income associated with that row. We’re not letting it see, for example, the prestige value because that’s the variable we want to predict. And we’re also not letting it see the education– value because we’re not using that in this example. So it did OK but not brilliantly. But reasonably good– better than just guessing, certainly. Now, at any rate, those are the values that we got out. What can we do with these things?

Skip to 5 minutes and 42 seconds Well, we can actually use them in any supervised learning algorithm that allows for weighted data input, which is to say that each row is given a weight. Now, what we’re going do is to set up such a weighted data input. We’re going to create a data set that includes all the rows with no missing data. And then we’re going to add a row for each of– we’re going to add three rows for each row that had a missing type value. These three rows will correspond to the type value being professional, white collar, or blue colour.

Skip to 6 minutes and 29 seconds Once we’ve done that, we’ll create a white factor. All of the original data rows that had no missing columns will be given away to one. All of the rows have had a missing type column will be given a weight that corresponds to the probability the EM algorithm gave of the type-value taking the particular value specified in our new row. So we remember row five was missing the type value. And EM algorithm said this missing value had a 76% chance of being professional, 16% chance of being white collar, 6% chance of being blue collar. And those are the weights we’ll give to the three rows corresponding to the completion of that row with a missing value.

Skip to 7 minutes and 22 seconds The first completion we’ll say that the missing value is professional. We’ll be given a weight of 0.76. Second completion will correspond to the missing value being specified as white collar and will be given a weight of 0.16, and so forth. Now, once we’ve done that, we’re able to use this weighted data, as I said, in any simplified floating algorithm that accepts weighted data. Now, in theory, almost any supervised learning algorithm will accept weighted data. The question is whether the implementations you’re working with allow you to give weighted data.

Skip to 8 minutes and 11 seconds Now, in our case, the linear models given by the LM function do accept weighted data. They’re also so simple to use that we might as well make use of them. Just to make things a little bit more interesting, we’ll create two types of linear regression models. One, just with an additive formula where we’re predicting prestige based on type plus income. And one with a multiplicative formula where we predict prestige based on type times income. Now, the difference here is that first, it’s going to just give different intercepts for each type, that’s to say, prestige will be based on–

Skip to 8 minutes and 57 seconds some intercept if the type is professional, some other intercept if the type is white collar, and some third intercept value if the type is blue collar. And then– plus sum coefficient times income where all the different types get the same income coefficient. When we’re doing the multiplicative formula, all the different types will not only have a different intercept. They’ll also have a different income coefficient. That to say, not only will they all have different intercepts, they’ll also have different slopes the line. So let’s do this. You’ll see it’s just like a normal call to the LM function when we’re doing ordinary least squares, except, as well as giving the formula– and the data that the variables come from.

Skip to 9 minutes and 47 seconds We also have to specify weights equals our weights factor. Now we didn’t do a split into training and validation or training validation test. So what we can do to evaluate the performance of these two models is to use a statistical evaluation score like the BIC. Now, to do that, of course, we need to be able to work out the likelihood of the data or the likelihood of the model given the data. That means we need to produce an error distribution around the regression curve and use that to work out the probability of the data given the model.

Skip to 10 minutes and 30 seconds That’s exactly what I do here.

Skip to 10 minutes and 35 seconds And what I’m going to do is to–

Skip to 10 minutes and 43 seconds first of all, before we calculate the BIC scores, we will draw the models that we get. Here’s is the additive formula. You’ll note that there are different heights, but they all have the same slope. The points, by the way, their green, red, and blue, based on the types. But the grey ones we’re missing.

Skip to 11 minutes and 10 seconds The multiplicative formula, once again, they all have different intercepts. There are different heights. But they also have different slopes. You see the green one in particular is markedly different in the two cases. And the green one corresponds to the white collar workers. OK? So those are the plots. Let’s actually evaluate these models statistically.

Skip to 11 minutes and 35 seconds To do this– we’re going to calculate the BIC score. And to do that we need to work out the– log likelihood of the models, which is to say the log probability of the data given the models. We do this here. You can go through these lines in detail, yourselves.

Skip to 12 minutes and 0 seconds Once we’ve done that, it’s a simple matter of plugging in this log likelihood to the BIC formulas that you’ll see in the article from evaluation of statistical models from week two. We give the log likelihood, the number of rows in the data set, and the number of free parameters in each model. And of course, the one that gives different coefficients for the income for each type will have more parameters. It’s more complex.

Skip to 12 minutes and 30 seconds Let’s output the result to the console and see what we have. There we go.

Skip to 12 minutes and 45 seconds That didn’t work very well when we’re doing it line by line– wasn’t very clear. So let’s just output the BIC scores for the first and second model. There we go. BIC for the first model. The additive model, 13.3. BIC for the second model. The multiplicative model, 20.9. Lower is better. So we would choose the additive model over the multiplicative model. OK, so we did, actually, quite a lot more there than just look at the expectation maximisation missing data algorithm. But we did do that. We also ended up looking at additive or multiplicative linear regression formulas and also used the BIC score, the first of the– I think this is the first time we’ve used a statistical evaluation method.

Skip to 13 minutes and 33 seconds That sets the second, but any rate, gives you a bit of practise at using one of these statistical evaluation methods, as well as the normal holdout and cross-validation methods.

# Missing Data: Missing Value Imputation Exercise 1 (EM)

A code exercise for using the EM algorithm to impute missing data values. The associated code is in the Missing Data Ex1.R file. Interested students are encouraged to replicate what we go through in the video themselves in R, but note that this is an optional activity intended for those who want practical experience in R and machine learning.

We look at how the EM algorithm can be used to impute missing values of discrete valued variables. We work though how to use the EM algorithm to do this, and look at the weighted data the algorithm outputs and this can be used to fit a statistical model. In this case we use this weighted data to generate two OLS models, one with an additive formula and one with a multiplicative formula (we explain what this means for the models in the video). We also take the chance to use a statistical evaluation method for model selection, choosing our preferred model based on their BIC scores.

We have implemented a manual implementation of the expectation maximization algorithm able to be used for imputing missing values of discrete valued variables for this data. Interested students are able to examine this to get a look at the mathematics, and the implementation of the mathematics, of this technique. Uninterested students can simply use this code as any third party library. In any case, this code will need to be sourced before it can be used.

Note that the car R package is used in this exercise. You will need to have it installed on your system. You can install packages using the install.packages function in R.