• # Advanced Machine Learning: Missing Value Imputation

We examine how Expectation Maximization and MCMC techniques can be used for missing data imputation in advanced machine learning.

## Missing Data Imputation

When data is missing at random the best option is often to impute, or estimate, the missing values. We have already seen two powerful approaches that can be used for this: Expectation Maximization and the Metropolis within Gibbs Markov Chain Monte Carlo algorithms. If you need to, re-read the Clustering: Gaussian Mixture Models and Topic Modeling: Latent Dirichlet Allocation articles in week three that explain these algorithms. Here we will go through an application of these algorithms to a missing data problem.

## Expectation Maximization

There are a number of ways that the EM algorithm could be applied to impute missing nominal data values. We choose the simplest and most common. Let us assume we have the following data:
$$X_1$$ $$X_2$$ $$X_3$$
A a 2.4
A a 3.4
a 3.1
B a 1.7
B a 2.9
B a 2.1
1.1
A b 3.8
A b 2.9
b 4.7
B b 1.4
B b 2.4
B 2.7
b 2.1
So we have three features: two nominal valued features, $$X_1$$ and $$X_2$$, and a real valued feature, $$X_3$$. We specify some parameteric statistical model that relates the features. Using the chain rule of probability, we know that $$P(X_1,X_2,X_3)=P(X_1)P(X_2$$|$$X_1)P(X_3$$|$$X_1,X_2)$$ allowing us to work with three conditional distributions.
In this case, we will assume a conditional Gaussian relationship between the nominal features and the real valued feature $$X_3$$. We also assume that nominal features can only take values that are seen. We will model them with appropriate categorical/conditional categorical distributions.
Accordingly, we have the following distributions:
$P(X_1)$
 $$X_1$$=A $$X_1$$=B $$p_1$$ 1-$$p_1$$
$$P(X_2$$|$$X_1)$$
 $$X_1$$ $$X_2$$=a $$X_2$$=b A $$p_2$$ 1-$$p_2$$ B $$p_3$$ 1-$$p_3$$
$$P(X_3$$|$$X_1,X_2)$$
 $$X_1$$ $$X_2$$ $$X_3 : \mu$$ $$X_3 : \sigma^2$$ A a $$\mu_1$$ $$\sigma_1^2$$ A b $$\mu_2$$ $$\sigma_2^2$$ B a $$\mu_3$$ $$\sigma_3^2$$ B b $$\mu_4$$ $$\sigma_4^2$$
We provide some initial values for these parameters using standard estimation techniques from the known data (values rounded to two decimal spaces where required):
 Parameter Initial Value $$p_1$$ .4 $$p_2$$ .5 $$p_3$$ .6 $$\mu_1$$ 2.9 $$\mu_2$$ 3.35 $$\mu_3$$ 2.23 $$\mu_4$$ 1.9 $$\sigma_1^2$$ .5 $$\sigma_2^2$$ .41 $$\sigma_3^2$$ .37 $$\sigma_4^2$$ .5
The EM algorithm will proceed to a local optimum based on these initial values, and in some cases this will not be the global optimum so it can be worthwhile running the algorithm multiple times starting with randomly assigned initial values and chosing the best solution. A solution can be evaluated by calculating the probability of the known data given the parameter values arrived at by the EM algorithm.
We begin by (i) providing all possible completions of the data, and (ii) providing wieghts for each row in this new dataset, where the weight is the probability density of the ‘completed’ missing values otherwise, where this is calculated using the probability distributions and parameter values above (using Bayes theorem to take into account $$X_3$$, which means dividing the total by the sum of all the probability density values for all completions of the row). Note that this means the weight is 1 if there are no missing values in the row. See the first case of rows involving completed values in the data below for an example of how the values were calculated.
$$X_1$$ $$X_2$$ $$X_3$$ Weight
A a 2.4 1
A a 3.4 1
A a 3.1 $$\frac{(.4)(.5)(.74)}{(.4)(.5)(.74)+(.6)(.6)(.07)} = .85$$
B a 3.1 $$\frac{(.6)(.6)(.07)}{(.4)(.5)(.74)+(.6)(.6)(.07)} = .15$$
B a 1.7 1
B a 2.9 1
B a 2.1 1
A a 1.1 .0043
A b 1.1 $$6.82e^{-7}$$
B a 1.1 .067
B b 1.1 .93
A b 3.8 1
A b 2.9 1
A b 4.7 .99996
B b 4.7 .00004
B b 1.4 1
B b 2.4 1
B a 2.7 .77
B b 2.7 .23
A b 2.1 .009
B b 2.1 .991
We then proceed to iteratively (i) recalculate the parameters given the weighted data (the expectation step of the EM algorithm), and (ii) recalculate the weights given the calculated parameters (the maximization step of the EM algorithm). Eventually we will converge on a given set of weight values, which provide us with a weighted data set we can use in a supervised learning algorithm. Note that while almost all supervised learning algorithms can deal with weighted data in theory, if you are relying on a third party implementation you will have to check whether such functionality is supported.

## Metropolis within Gibbs MCMC

In this case we will look at a situation where we have missing real values too. Let us assume we have the following data:
$$X_1$$ $$X_2$$ $$X_3$$
A a 2.4
A a 3.4
a 3.1
B a 1.7
B a 2.9
B a 2.1
1.1
A b 3.8
A b 2.9
b 4.7
B b
B b 2.4
B
b 2.1
Once again, the algorithm requires that we specify some parameteric statistical model that relates the features and we assume the same as in the EM example above.
The first step in the Metropolis within Gibbs algorithm is to assign random (valid) values to all unknown cases. We might assign $$X_3$$ values by sampling from a uniform distribution within the interval given by the known $$X_3$$ values. Let us assume we get the following (assigned values in italics):
$$X_1$$ $$X_2$$ $$X_3$$
A a 2.4
A a 3.4
A a 3.1
B a 1.7
B a 2.9
B a 2.1
B b 1.1
A b 3.8
A b 2.9
B b 4.7
B b 1.8
B b 2.4
B a 2.9
A b 2.1
We then proceed iteratively through the unknown/assigned values. These are $$X_1$$ row three, $$X_1$$ row seven, $$X_2$$ row seven, $$X_1$$ row ten, $$X_3$$ row eleven, $$X_2$$ row thirteen, $$X_3$$ row thirteen, and $$X_1$$ row fourteen.
For each, we generate a candidate value different from the currently assigned value (all other missing values are assumed to take whatever their currently assigned value is). Candidate functions must be symmetric, so for variables $$X_1$$ and $$X_2$$ our candidate function can generate the other value to the one currently assigned. For $$X_3$$, we can use a candidate function that, for example, adds to the currently assigned value a number sampled from a normal distribution with mean 0 and variance 0.5.
We then work out the values we would estimate for the parameters of our three distributions given (i) the currently assigned value; and (ii) the candidate value. Using these we would work out the probabilities of $$P_{old}$$, the values in that row with the currently assigned value and the parameters estimated using the currently assigned value, and $$P_{new}$$, the values in that row with the candidate value replacing the currently assigned value and the parameters estimated using the candidate value. We replace the currently assigned value with the candidate value with the probability $$min(1,\frac{P_{new}}{P_{old}})$$.
Having done that for all missing values, we have generated our first sample. We would proceed to generate many, many samples! And we would throw away the first $$n$$ samples since they are contaminated by our initial assignment. We would use the retained samples to create a weighted dataset that we could use with a supervised learning algorithm as in the EM case. Since each sample consists of the entire dataset (with missing values completed), each row in each sample would have a weight of 1 over the number of samples taken.
Deciding the number of samples to take, and how many initial samples to throw away (as well as other options, such as whether to ‘thin’ the samples) is an artform in itself. We suggest the interested student refer to the literature, of which there is a huge amount.  