Skip main navigation

Transformations

Transformations
© Coventry University. CC BY-NC 4.0

We have seen that the NumPy library in Python provides useful functions for summarising and counting values in a dataset.

Another common operation is to transform a variable to a new scale so that variables on different scales can be meaningfully compared.

For example, one variable might only have tiny numbers on a scale of 0.001 to 0.005, whereas another variable might have huge numbers on a scale of 1 billion to 5 billion. Therefore, it’s helpful to transform the two variables to the same scale. Each value in the array is transformed to a new value.

Here are some examples of different transformations commonly applied to data.

Scaling to values between zero and one

For this type of transformation, we transform the maximum value to one and the minimum value to zero. All other values are transformed proportionally in between. So, a value halfway between the maximum and minimum values will be transformed to 0.5. Consider the heights (metres) example below, the 0.86 is the minimum value and therefore will be transformed to zero. The 2.02 is the maximum value and will, therefore, be transformed into one.

height = np.array([0.86,2.02,1.87,1.44,1.80])
scaled_height = (height-np.min(height))/(np.max(height)-np.min(height))
print(scaled_height)

Centring and standardising

Centring is the process of subtracting the average from each value so that the transformed values have an average of zero. Does that mean they’re now on a polar +/- scale?

Standardising is the process of dividing the centred values by the standard deviation so that the transformed values have an average of zero and a standard deviation of one. Standardised values are commonly known as z-scores.

Consider the weights (kg) example below. Notice how the whole array is transformed in one line of code.

weight = np.array([15,112,106,91,85])
centered_weight = weight - np.mean(weight)
standardised_weight = (weight-np.mean(weight))/np.std(weight)
print(np.mean(weight))
print(centered_weight)
print(standardised_weight)

Construct new variables from old

Body mass index (BMI) is a measure designed to roughly classify people as underweight, normal weight, overweight or obese. It is calculated as weight (in kg) of a person divided by the square of height (in metres) of that person. We can do this calculation in Python. Note that ** is the ‘to the power of’ operator in Python.

height = np.array([0.87,2.02,1.87,1.78,1.80])
weight = np.array([15,112,106,91,85])
bmi = weight/(height**2)
print(bmi)

We have seen how to use basic mathematical ideas in Python to calculate summary statistics and to transform variables. In this way, we see that analysis of data involves transformations (each value in the array is transformed to a new value) and summary values (one value is calculated to summarise all values in an array). Notice that these calculations and transformations involved mathematical expressions (formulas). We, therefore, need to be sure exactly how Python will evaluate these. We’ll look at this in the next step.

© Coventry University. CC BY-NC 4.0
This article is from the free online

Get ready for a Masters in Data Science and AI

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education