# Advanced Machine Learning: Missing Value Imputation

## Missing Data Imputation

When data is missing at random the best option is often to*impute*, or estimate, the missing values. We have already seen two powerful approaches that can be used for this: Expectation Maximization and the Metropolis within Gibbs Markov Chain Monte Carlo algorithms. If you need to, re-read the

*Clustering: Gaussian Mixture Models*and

*Topic Modeling: Latent Dirichlet Allocation*articles in week three that explain these algorithms. Here we will go through an application of these algorithms to a missing data problem.

## Expectation Maximization

There are a number of ways that the EM algorithm could be applied to impute missing nominal data values. We choose the simplest and most common. Let us assume we have the following data:\(X_1\) | \(X_2\) | \(X_3\) |
---|---|---|

A | a | 2.4 |

A | a | 3.4 |

– | a | 3.1 |

B | a | 1.7 |

B | a | 2.9 |

B | a | 2.1 |

– | – | 1.1 |

A | b | 3.8 |

A | b | 2.9 |

– | b | 4.7 |

B | b | 1.4 |

B | b | 2.4 |

B | – | 2.7 |

– | b | 2.1 |

\(X_1\)=A | \(X_1\)=B |

\(p_1\) | 1-\(p_1\) |

\(X_1\) | \(X_2\)=a | \(X_2\)=b |

A | \(p_2\) | 1-\(p_2\) |

B | \(p_3\) | 1-\(p_3\) |

\(X_1\) | \(X_2\) | \(X_3 : \mu\) | \(X_3 : \sigma^2\) |

A | a | \(\mu_1\) | \(\sigma_1^2\) |

A | b | \(\mu_2\) | \(\sigma_2^2\) |

B | a | \(\mu_3\) | \(\sigma_3^2\) |

B | b | \(\mu_4\) | \(\sigma_4^2\) |

Parameter | Initial Value |

\(p_1\) | .4 |

\(p_2\) | .5 |

\(p_3\) | .6 |

\(\mu_1\) | 2.9 |

\(\mu_2\) | 3.35 |

\(\mu_3\) | 2.23 |

\(\mu_4\) | 1.9 |

\(\sigma_1^2\) | .5 |

\(\sigma_2^2\) | .41 |

\(\sigma_3^2\) | .37 |

\(\sigma_4^2\) | .5 |

\(X_1\) | \(X_2\) | \(X_3\) | Weight |
---|---|---|---|

A | a | 2.4 | 1 |

A | a | 3.4 | 1 |

A |
a | 3.1 | \(\frac{(.4)(.5)(.74)}{(.4)(.5)(.74)+(.6)(.6)(.07)} = .85\) |

B |
a | 3.1 | \(\frac{(.6)(.6)(.07)}{(.4)(.5)(.74)+(.6)(.6)(.07)} = .15\) |

B | a | 1.7 | 1 |

B | a | 2.9 | 1 |

B | a | 2.1 | 1 |

A |
a |
1.1 | .0043 |

A |
b |
1.1 | \(6.82e^{-7}\) |

B |
a |
1.1 | .067 |

B |
b |
1.1 | .93 |

A | b | 3.8 | 1 |

A | b | 2.9 | 1 |

A |
b | 4.7 | .99996 |

B |
b | 4.7 | .00004 |

B | b | 1.4 | 1 |

B | b | 2.4 | 1 |

B | a |
2.7 | .77 |

B | b |
2.7 | .23 |

A |
b | 2.1 | .009 |

B |
b | 2.1 | .991 |

*in theory*, if you are relying on a third party implementation you will have to check whether such functionality is supported.

## Metropolis within Gibbs MCMC

In this case we will look at a situation where we have missing real values too. Let us assume we have the following data:\(X_1\) | \(X_2\) | \(X_3\) |
---|---|---|

A | a | 2.4 |

A | a | 3.4 |

– | a | 3.1 |

B | a | 1.7 |

B | a | 2.9 |

B | a | 2.1 |

– | – | 1.1 |

A | b | 3.8 |

A | b | 2.9 |

– | b | 4.7 |

B | b | – |

B | b | 2.4 |

B | – | – |

– | b | 2.1 |

\(X_1\) | \(X_2\) | \(X_3\) |
---|---|---|

A | a | 2.4 |

A | a | 3.4 |

A |
a | 3.1 |

B | a | 1.7 |

B | a | 2.9 |

B | a | 2.1 |

B |
b |
1.1 |

A | b | 3.8 |

A | b | 2.9 |

B |
b | 4.7 |

B | b | 1.8 |

B | b | 2.4 |

B | a |
2.9 |

A |
b | 2.1 |

*other*value to the one currently assigned. For \(X_3\), we can use a candidate function that, for example, adds to the currently assigned value a number sampled from a normal distribution with mean 0 and variance 0.5. We then work out the values we would estimate for the parameters of our three distributions given (i) the currently assigned value; and (ii) the candidate value. Using these we would work out the probabilities of \(P_{old}\), the values in that row with the currently assigned value and the parameters estimated using the currently assigned value, and \(P_{new}\), the values in that row with the candidate value replacing the currently assigned value and the parameters estimated using the candidate value. We replace the currently assigned value with the candidate value with the probability \(min(1,\frac{P_{new}}{P_{old}})\). Having done that for all missing values, we have generated our first sample. We would proceed to generate many, many samples! And we would throw away the first \(n\) samples since they are contaminated by our initial assignment. We would use the retained samples to create a weighted dataset that we could use with a supervised learning algorithm as in the EM case. Since each sample consists of the entire dataset (with missing values completed), each row in each sample would have a weight of 1 over the number of samples taken. Deciding the number of samples to take, and how many initial samples to throw away (as well as other options, such as whether to ‘thin’ the samples) is an artform in itself. We suggest the interested student refer to the literature, of which there is a huge amount.

## Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.

You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education