Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

Find out more

Dealing with missing data

A summary of different ways to deal with missing data
A table of data containing numbers  and text with numbered rows and labelled columns. One data item is highlighted in yellow and labelled NaN.

We’ve seen in the previous videos various ways in which we might encounter missing data, and different ways in which we can handle this.

In this article we look at some of those ways of dealing with missing data, and how they can be implemented using Python.

Representing missing values

There are various ways in which missing data can be represented, that vary depending on context and the computing platform on which the data has been produced or recorded. These are the common ones you’ll encounter in Python.

NaN

This stands for ‘Not a number’ and in Python and Numpy can be produced when you try and perform a mathematical function to which there is no answer, for example dividing by zero. In this case, there is no numerical answer possible, so a NaN flag is used.

None

Like NaN, None is a specific object in Python, and means ‘there is nothing here’, or ‘this is an empty value’.

In other programming languages this concept might be referred to as:

NA

or

NULL

NA in particular is used by the programming language R, a common platform for analysing data, and means ‘Not Available’. Pandas is able to read in NA values and interpret them as missing data.

Dropping data

Once you’ve established there are missing items in your data, you will need to make a decision on how to deal with it. As discussed in the videos, this depends on how much data you have, and how much of it is missing.

Dropping data items

If you have a lot of training data, and just a few missing pieces here and there, or some data items with several fields missing, the best thing to do might be to just drop those items of data in which a null or missing value appears.

As we’ve seen, generally speaking your data will in a matrix or table format with each row representing a single piece of data, and the columns representing the features.

In Pandas there’s an easy way to drop any rows from a table that contain missing values, using the dropna() function. Here’s a simple example using some dummy data that we will create using a Pandas DataFrame:

import pandas as pd
import numpy as np

df = pd.DataFrame([[1, 2.2, 4.5],
[2, 3.3, np.nan],
[3, None, 6.7],
[4, 5.1, 7.2]],
columns=['names','height','yield'])

print(df)
 names height yield
0 1 2.2 4.5
1 2 3.3 NaN
2 3 NaN 6.7
3 4 5.1 7.2
df_new= df.dropna()

print(df_new)
 names height yield
0 1 2.2 4.5
3 4 5.1 7.2

You should notice that dropna() omits all rows in which any of the values are missing (in this case NaN or None). While this is useful if you have the odd missing value here and there, if you have lots of missing values in a particular field or feature, one of the other approaches might be preferable to make sure you aren’t throwing away more data than you need to.

Dropping features

Sometimes you might find that a specific feature column has a lot of missing values. Perhaps some sensor was faulty during collection, or the data was corrupted in some other way. In this case you will need to assess how important this particular field is likely to be to your machine learning problem. If it’s impossible to recollect the data, and you think you have enough other features to tackle your problem, you might consider dropping that feature entirely from your dataset.

Again it’s easy to do this in Pandas, using the name of the feature and the drop() function. In this example we have added an extra column called ‘width’, but half of the values are missing. To make a new DataFrame without the problem column, we just need to give the drop() function the name of the column, and specify we want to drop a column rather than a row using the code axis=1:

df = pd.DataFrame([[1, 2.2, 2.3, 4.5],
[2, 3.3, np.nan, 5.2],
[3, 4.6, np.nan, 6.7],
[4, 5.1, 4.4, 7.2]],
columns=['names','height','width','yield'])

print(df)
 names height width yield
0 1 2.2 2.3 4.5
1 2 3.3 NaN 5.2
2 3 4.6 NaN 6.7
3 4 5.1 4.4 7.2
df_new = df.drop('width',axis=1)

print(df_new)
 names height yield
0 1 2.2 4.5
1 2 3.3 5.2
2 3 4.6 6.7
3 4 5.1 7.2

Replacing values

In many cases, especially in smaller datasets it’s preferable not to throw away whole datapoints or features. An alternative strategy is to replace the missing values in a particular column with some fixed value. For example, using the data from the previous example, we can use the Pandas function fillna() and the column name to replace all the NaN values in the ‘width’ column with zero:

df = pd.DataFrame([[1, 2.2, 2.3, 4.5],
[2, 3.3, np.nan, 5.2],
[3, 4.6, np.nan, 6.7],
[4, 5.1, 4.4, 7.2]],
columns=['names','height','width','yield'])

print(df)
 names height width yield
0 1 2.2 2.3 4.5
1 2 3.3 NaN 5.2
2 3 4.6 NaN 6.7
3 4 5.1 4.4 7.2
df['width']=df['width'].fillna(0)

print(df)
 names height width yield
0 1 2.2 2.3 4.5
1 2 3.3 0.0 5.2
2 3 4.6 0.0 6.7
3 4 5.1 4.4 7.2

In some cases, replacing with zero can be useful, but often isn’t ideal. It can be better to use the mean of the other values instead:

df = pd.DataFrame([[1, 2.2, 2.3, 4.5],
[2, 3.3, np.nan, 5.2],
[3, 4.6, np.nan, 6.7],
[4, 5.1, 4.4, 7.2]],
columns=['names','height','width','yield'])

print(df)
 names height width yield
0 1 2.2 2.3 4.5
1 2 3.3 NaN 5.2
2 3 4.6 NaN 6.7
3 4 5.1 4.4 7.2
df['width']=df['width'].fillna(df['width'].mean())

print(df)
 names height width yield
0 1 2.2 2.30 4.5
1 2 3.3 3.35 5.2
2 3 4.6 3.35 6.7
3 4 5.1 4.40 7.2

Depending on the data in question, it may be better to use the median or mode here, which you can do easily by replacing the mean() function in the code above with median() or mode(). Replacing values in this way requires some care and judgement, as well as some knowledge of what feature the data is representing, so use with care!

This article is from the free online

Machine Learning for Image Data

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now