Relationships and patterns
It is natural to look for relationships and patterns to emerge from a data set. While there are formal, statistical tests and machine learning techniques to establish such relationships, we are only going to consider informal, non-automated approaches at this stage. Generally, we want to look at two features in a data set, to try and identify some kind of pattern or correlation between them.
If both variables are numerical, we can draw a scatter plot as described in earlier steps. We might try to draw a trend line, as known as a line of best fit. There is a positive correlation when one feature value increases and so does the other feature value. For example, when the length of a train journey increases, then the ticket prices also increases. There is a negative correlation when one feature value increases and the other feature value decreases. For example, when the average daily wind speed increases, then the use of fossil fuels in Scottish power stations decreases.
If both variables are categorical, we might examine the data with the help of a contingency table. We should try to look for distinctions between the category combinations. For example, are people who live on their own more likely to own a pet? This kind of analysis is similar to calculating conditional probabilities in Maths.
Finally, when one variable is numerical and the other variable is categorical, we should calculate the median value for each category, along with a measure of the dispersion within the category. Are there significant differences for some categories? For example, do young people spend more time watching Youtube videos than older people?
What are we looking for, really? We want to discover something interesting. We might come to the data set with some intution about likely findings, or we might explore with an open mind. Either approach is fine, but we need data to back up any interpretation we might make.
© University of Glasgow