Want to keep learning?

This content is taken from the Coventry University's online course, Get ready for a Masters in Data Science and AI. Join the course to learn more.

Transformations

We have seen that the NumPy library in Python provides useful functions for summarising and counting values in a dataset.

Another common operation is to transform a variable to a new scale so that variables on different scales can be meaningfully compared.

For example, one variable might only have tiny numbers on a scale of 0.001 to 0.005, whereas another variable might have huge numbers on a scale of 1 billion to 5 billion. Therefore, it’s helpful to transform the two variables to the same scale. Each value in the array is transformed to a new value.

Here are some examples of different transformations commonly applied to data.

Scaling to values between zero and one

For this type of transformation, we transform the maximum value to one and the minimum value to zero. All other values are transformed proportionally in between. So, a value halfway between the maximum and minimum values will be transformed to 0.5. Consider the heights (metres) example below, the 0.86 is the minimum value and therefore will be transformed to zero. The 2.02 is the maximum value and will, therefore, be transformed into one.

height = np.array([0.86,2.02,1.87,1.44,1.80])
scaled_height = (height-np.min(height))/(np.max(height)-np.min(height))
print(scaled_height)

Centring and standardising

Centring is the process of subtracting the average from each value so that the transformed values have an average of zero. Does that mean they’re now on a polar +/- scale?

Standardising is the process of dividing the centred values by the standard deviation so that the transformed values have an average of zero and a standard deviation of one. Standardised values are commonly known as z-scores.

Consider the weights (kg) example below. Notice how the whole array is transformed in one line of code.

weight = np.array([15,112,106,91,85])
centered_weight = weight - np.mean(weight)
standardised_weight = (weight-np.mean(weight))/np.std(weight)
print(np.mean(weight))
print(centered_weight)
print(standardised_weight)

Construct new variables from old

Body mass index (BMI) is a measure designed to roughly classify people as underweight, normal weight, overweight or obese. It is calculated as weight (in kg) of a person divided by the square of height (in metres) of that person. We can do this calculation in Python. Note that ** is the ‘to the power of’ operator in Python.

height = np.array([0.87,2.02,1.87,1.78,1.80])
weight = np.array([15,112,106,91,85])
bmi = weight/(height**2)
print(bmi)

We have seen how to use basic mathematical ideas in Python to calculate summary statistics and to transform variables. In this way, we see that analysis of data involves transformations (each value in the array is transformed to a new value) and summary values (one value is calculated to summarise all values in an array). Notice that these calculations and transformations involved mathematical expressions (formulas). We, therefore, need to be sure exactly how Python will evaluate these. We’ll look at this in the next step.

Share this article:

This article is from the free online course:

Get ready for a Masters in Data Science and AI

Coventry University