Skip to 0 minutes and 1 second All right. So now we’re going to look at the second of the two missing data example exercises. This time we’re going to be using Markov chain Monte Carlo. And in particular, we’re going to be using Metropolis within Gibbs. And Metropolis within Gibbs missing data algorithm that we described in the article.
Skip to 0 minutes and 24 seconds Exactly like in the first exercise, we’ll use this as an opportunity to look at a few other things. In fact, we’re going to look at the same sorts of things that we did in the first missing data example. Because we’re going to be generating a weighted data set again, we’ll use the linear regression formula to create some linear regression models from this weighted data set. We’ll again create an additive and multiplicative one and compare them, and we’ll compare them statistically like we did last time except instead of the BIC score, we’ll use the AIC score. OK. Let’s look at the code. Now this is an important difference this time when we prepare the data.
Skip to 1 minute and 5 seconds It’s the same data as last time except that it’s now not only the type variable that has some missing values, it’s also the income variable. So we have missing values not only in a discrete or nominal variable, but also in a real valued variable. And that makes the expectation maximisation algorithm less appealing to use, which is why we will use the MCMC alternative. Once again, we’ll start off by finding which rows and columns contain missing values.
Skip to 1 minute and 45 seconds Here we go.
Skip to 1 minute and 52 seconds And then we will use the Metropolis and Gibbs. Now, once again actually, we’ve created our own manual version that you can find at the bottom of this code. And you can go through it yourself if you’re interested in looking at the mathematical implementation or you can just use it as a third party implementation. We need to source the file to be able to use it, so we do that. Now to use a MCMC algorithm, you need to specify both the burn and the samples. The burn is the number of samples that we throw away at the beginning that are contaminated by the initial random failures.
Skip to 2 minutes and 30 seconds And then the samples is the number of samples we collect after the burn to make use of in estimating the probability distributions of the variables and interest. In this case, the missing values in our features. Now this is just a simple implementation of Metropolis and Gibbs. In more complicated ones, there will be additional parameters that you may need to specify. We will just need these two. So let’s run our manual implementation.
Skip to 3 minutes and 2 seconds And here we’re generating the samples. We’ve gone through generating 100 burn samples. And now it generated 200 samples we’ll use.
Skip to 3 minutes and 14 seconds Have a quick look at them.
Skip to 3 minutes and 19 seconds What we have here are values for the missing items in our features, 200 instances of each.
Skip to 3 minutes and 37 seconds Now, once again, what we’re going to want to do is to create a new weighted data set where all the rows of the image, original data set that did not contain a missing item, will be given a weight of 1. And then those 200 samples will replace the rows with missing items and each of those will be given a weight of 1 over 200. Once we’ve done that, we’ve got a weighted data set that we can then plug into any supervised learning algorithm that accepts weighted data sets. We will use the linear regression models exactly like last time using an additive and multiplicative version. So two linear regression models, one with an additive formula, one with a multiplicative formula.
Skip to 4 minutes and 37 seconds Once we’ve done that, we can work out the log probabilities of the data given the model, which is to say the log likelihood of the model given the data. And we can calculate the AIC scores using the formula from week two in the evaluation of statistical models article. And we can output that to the console, though like last time it got a bit messy. So maybe we could just simply directly output the AIC scores for the two models. AIC 1, AIC two. Smaller is better. So, just like last time, it turns out that the additive model outperforms the multiplicative model in virtue no doubt of its simplicity and the small data set.
Skip to 5 minutes and 32 seconds So the AIC and the BIC appear to agree on this. Now let’s also plot the two resulting models just like we did in the last exercise.
Skip to 5 minutes and 49 seconds Let’s zoom in because the graphs are a little small in this resolution. OK. So just like last time, the additive formula, they have different intercepts but the same slope. Multiplicative formula different intercepts and different slopes. Just like last time the additive formula perform better according to our statistical evaluation. And just like last time the colours of the points correspond to blue colour for red, white colour for green and professional is blue.
Skip to 6 minutes and 27 seconds And I think we’re done.
Missing Data: Missing Value Imputation Exercise 2 (MCMC)
A code exercise for using the Metropolis in Gibbs MCMC algorithm to impute missing data values. A code exercise for using the EM algorithm to impute missing data values. The associated code is in the Missing Data Ex2.R file. Interested students are encouraged to replicate what we go through in the video themselves in R, but note that this is an optional activity intended for those who want practical experience in R and machine learning.
We look at how the Metropolis within Gibbs algorithm can be used to impute missing values of both discrete and real valued variables. Like in the previous exercise, we work though how to use the algorithm to do this, and look at the weighted data the algorithm outputs and this can be used to fit a statistical model. Again we use our weighted data to generate two OLS models, one with an additive formula and one with a multiplicative formula. We also again use a statistical evaluation method for model selection, this time choosing our preferred model based on their AIC scores.
We have implemented a manual implementation of the Metropolis in Gibbs MCMC algorithm able to be used for imputing missing values of discrete and real valued variables for this data. Interested students are able to examine this to get a look at the mathematics, and the implementation of the mathematics, of this technique. Uninterested students can simply use this code as any third party library. In any case, this code will need to be sourced before it can be used.
Note that the car R package is used in this exercise. You will need to have it installed on your system. You can install packages using the install.packages function in R.
© Dr Michael Ashcroft