Missing Data Basics
We now turn to the problem of missing data, and solutions to this problem. Missing data occurs when there are values in your dataset that are missing, or unknown. This is a problem because most machine learning algorithms require training data to be complete (without missing values). We will first look at the different types of missing data, and then at some methods to deal with it so as to be able to use your data with machine learning algorithms.
Types of Missing Data
There are three types of missing data. It is important to know what sort of missing data you are dealing with when addresses missing values in your data set.
Missing Completely at Random (MCAR)
Data values are MCAR if they event that causes them to be missing is independent of other variables. Accordingly, there is no relationship between the presence of missing values, and the values of either the present or absent variables.
In such a situation, the rows that have missing values can be discarded without biasing the data. Effectively using only the complete rows is using a random subset of the original data.
Unfortunately, data is seldom MCAR. For example, you might that that if we had a dataset that contained humidity and temperature readings, with occassional missing values in the humidity readings due to (say) electricity outages, then this would be MCAR. But if there is any pattern in the electricity outages - perhaps they occur during day-time peak use - that is at all correlated with the readings from the humidity or temperature sensors then the data is not MCAR.
An example of MCAR would be we made a print off of the data, in random order. Then, split coffee on some rows. If we tried to use the print-off dataset, and could not read the humidity values because of the coffee stains, these values would be MCAR.
Missing at Random (MAR)
Surprisingly, data values that are MAR are not missing at random - there is a patten present! Rather, we say data is MAR if the event that caused the values to be missing is independent of the variable whose values are missing given the other, non-missing, variables. For example, if a faulty humidity sensor does not work when and only when the temperature is above 50 degrees and both humidity and temperature are in the data, then missing humidity readings are MAR. Likewise, if it fails because a of electricity outages, but these outages are independent of the humidity level given the temperature.
With MAR data, we can use the observed values in the data to provide an unbiased estimate of the missing values. How good our estimates will be will, of course, depend on how much information the other variables contain about the missing values.
Throwing away rows that contain MAR data will result in a biased data set. In our example, it would make it appear that the temperature readings were never higher than 50 degrees, even though we actually observed such readings!
Missing not at Random (MNAR)
Data values are MNAR if the event that causes the value to be missing is correlated with the variable whose values are missing. For instance, if the humidity sensor fails whenever there is a high humidity. Or if it fails because of power outages that are caused by high humidity.
There is no way to deal with MNAR values that completely avoids bringing bias into the data. In reality, many data scientists use various methods for dealing with missing values regardless and hope for the best, but you have been warned!
Dealing with Missing Data
Given the types of missing data, how do people deal with missing values? Here we divide missing data techniques into three types.
One approach is to remove either the rows or columns that contain missing values. Removing a columns, i.e. a variable, obviously results in a reduction in the number of variables being used. Removing rows also reduces the information available in the data, since you throw away the non-missing values of observed variables in the row as well.
Removing the whole variable will result in an unbiased dataset, but one which might contain a lot less information (if the variable was important for the problem). Throwing away rows will result in unbiased data in the MCAR case. Otherwise it will bias the data.
A second approach is to estimate the values with basic statistics, such as the mean or mode of the variable whose values are missing. This is unfortunately popular, but is, in general, a bad idea and we suggest it be avoided.
A final set of approaches attempts to estimate the missing values using advanced statistical methods, such as expectation maximization or MCMC sampling. This will produce unbiased data in the MCAR and MAR cases, but will result in biased data in the MNAR case. As noted above, it is often used in this final case anyway, on the assumption that the bias will be small and less than the information lost by discarding the variables with missing values.
In the next steps we will look at the use of expectation-maximization and Metropolis withing Gibbs MCMC for estimating missing values in data. If you need to, read the Clustering: Gaussian Mixture Models and Topic Modeling: Latent Dirichlet Allocation articles from week 3 to refresh yourself about how these algorithms work.
© Dr Michael Ashcroft