What operations can you do?
Mathematics is a key foundation of data science. We see mathematics whenever we summarise or transform data, investigate relationships between variables and describe algorithms.
You might remember that last week we saw that high-level questions need to be broken down into smaller, simpler and more precise component questions, otherwise referred to as low-level questions. Answering these low-level questions often involves operations such as comparing, counting, averaging, classifying, clustering, predicting, and fitting models. Some of these operations are simple mathematical operations, whereas others require more complex calculations.
Two simple types of mathematical operations we often use are the calculation of summary statistics and the transformation of data. Here we will see how to do these mathematically using Python.
A summary statistic is a single value that is calculated from (and represents) all numbers in an array. Examples include the total, average, median, maximum or minimum.
To provide some examples and motivation, we will roll a single six-sided fair dice. The term ‘fair’ just means that each side is equally likely to appear on any roll of the dice.
Rather than roll real dice, we can use Python to ‘roll’ the dice and perform the calculations. The functions we use are part of the NumPy library in Python.
import numpy as np myrolls = np.random.randint(low=1, high=7, size=60) print(myrolls)
Here the variable
myrolls is a NumPy array (a bit like a list but only containing numbers). We can see the value rolled for each roll of the dice. In this case, 60 rolls of the dice. We should roughly see about 10 of each value, one to six.
You might remember, perhaps from school, that the average of an array of numbers is just the sum divided by how many there are. This is exactly what is calculated below using np.sum to find the sum and np.size to give how many numbers there are.
myaverage = np.sum(myrolls)/np.size(myrolls) print(myaverage)
The average is calculated so often that NumPy already has a function that does this: np.mean.
myaverage = np.mean(myrolls) print(myaverage)
What value were you expecting to see for the average? Well, 1/6 of the time we should see value 1, another 1/6 of the time we should see value 2, and so on. So the value we expect to see for the average is \((1/6)\times(1+2+3+4+5+6)=3.5\). We don’t see exactly 3.5 because each time we roll the dice the result is a random value, but if we conduct a large number of rolls we should see an average value somewhere close to 3.5.
Maximum, minimum and standard deviation
NumPy has functions for finding the maximum and minimum of an array of numbers, and also for calculating the standard deviation (which measures how spread out the numbers are).
The function np.max returns the maximum value in an array of numbers. The function np.min returns the minimum value. The function np.std returns the standard deviation. If you have not heard of the term ‘standard deviation’ before, have a look at this explanation.
mymax = np.max(myrolls) mymin = np.min(myrolls) mystd = np.std(myrolls) print(mymax) print(mymin) print(mystd)
Sorting and counting occurrences
Suppose we wish to sort the values in an array into increasing order.
sorted_rolls = np.sort(myrolls) print(sorted_rolls)
This confirms the minimum and maximum values. We can also quickly count (by eye) the number of ones, twos, etc. In this way, we build up a frequency table showing how many times each unique value occurs in a dataset. Well, you guessed it, NumPy has a function that does this too: np.unique.
(unique_rolls, counts) = np.unique(myrolls,return_counts=True) print(unique_rolls) print(counts)
Notice that, in this case, the function np.unique returns two arrays as its result: unique_rolls and counts. This function will be extremely useful when we look at how to build a bar chart in a later step.
Suppose we use Python to roll two dice at once (call them dice A and dice B).
myrollsA = np.random.randint(low=1, high=7, size=60) print(myrollsA) myrollsB = np.random.randint(low=1, high=7, size=60) print(myrollsB)
We wish to see how often dice A has a higher value than dice B, how often dice A has a lower value than dice B, and how often the two dice have the same value. We could count these directly from the results above. But we can get the computer to do the calculations for us.
mysum_win = np.sum(myrollsA > myrollsB) mysum_loss = np.sum(myrollsA < myrollsB) mysum_draw = np.sum(myrollsA == myrollsB) print(mysum_win) print(mysum_loss) print(mysum_draw)
Here, the ‘>’, ‘<’ and ‘==’ are the comparison operators greater than, less than and equal to. Beware that equal to has two equal signs rather than one.
What value were you expecting to see for the number of wins, losses and draws? Because random numbers are involved, we are unlikely to see exactly 6/36 of the rolls having a draw as the outcome, or exactly 15/36 of the rolls having a win as the outcome. But we should see something close to \((15/36)\times60=25\) wins and \((6/36)\times60=10\) draws.
We have seen how to use basic mathematical ideas in Python to calculate summary statistics (average, standard deviation, maximum and minimum), sort data and count occurrences, and compare values stored in two arrays. The NumPy library provides functions that do useful calculations for us. All operations are performed directly on data in an array. In the next step, you’ll get to experiment with carrying out an analysis only by counting.
Maths is fun. (n.d.). Standard deviation and variance. https://www.mathsisfun.com/data/standard-deviation.html
© Coventry University. CC BY-NC 4.0