Want to keep learning?

This content is taken from the Coventry University's online course, Get ready for a Masters in Data Science and AI. Join the course to learn more.

Analysing different types of data

Data is a means to an end. Be aware of the actual target of your analysis, the endpoint you’re aiming for.

As emphasised in Week 1, we need to ensure our research question is well defined, as this will help us target our data collection and analysis. Let’s look at an example and consider how the data collected relates to the research question that is being sought to address.

Example: WHO contact tracing

The World Health Organization (WHO) proposes contact tracing as a means to control the COVID-19 outbreak. Contact tracing involves gathering data about individuals and their peers.

Earlier this week, we looked at types of structured data and how they can be plotted. When collecting data, we need to keep in mind how we can turn it into structured data and what information the data actually contains. Data discussed in this step will be structured using the concepts we studied in the earlier Step 2.8: Representing concepts mathematically.

The data gathered is of different types: categorical variables such as names, and quantitative variables such as exposure frequency.

Unique ID Age Full name Fever
1 42 Alishia Ford y
2 20 Morgan Derrick y
3 65 Ronny West n

In Python, this data looks like this:

data = {
  1: {'Age': 42, 'Full Name': 'Alishia Ford', 'Fever': True},
  2: {'Age': 20, 'Full Name': 'Morgan Derrick', 'Fever': True},
  3: {'Age': 65, 'Full Name': 'Ronny West', 'Fever': False}
}

With the above data, we may answer a number of questions. For example, let’s compute what percentage of our data subjects reported ‘fever’ as a symptom:

fever = [key for key in data if data[key]['Fever']]
fever_percentage = float(len(fever)) / len(data)
print fever_percentage
0.666666666667

We first select the data subjects that exhibit fever, then compute the ratio of those with a fever over the total number of data subjects, computing the answer as 66.7%.

Another piece of information contained in our data are names. We can ask: what is the average length of first names? To find out the answer to this question, we’d write the following:

names = [data[key]['Full Name'].split() for key in data]
first_name_lengths = [len(name[0]) for name in names]
avg_length = float(sum(first_name_lengths)) / len(first_name_lengths)
print avg_length
6.0

Here we have created a list of names, where each list element is a list of words (the names).

  1. The first names will be the first element of each list of words (the Python list uses index 0 for the first element).
  2. We then create a new list (called first_name_lengths) of the lengths of the first names, to make it easier to read.
  3. This then enables Python to compute that the average length of first names is 6.

Starting from data gathered for the purpose (answering the research question) of contact tracing, we’re able to compute answers to further questions. This situation frequently arises, and we need to consider the following situations:

  • We have gathered data and still cannot answer the original question. For example, we do not know how the data subjects relate to each other, and thus cannot necessarily trace contacts.
  • We have gathered data, and are now able to answer additional questions. As shown above, answering data questions about names is possible – which need not relate at all to the question of contact tracing.
  • While gathering data, we have found additional information that must not be included. For instance, the data subjects’ names may raise privacy concerns.

It is important to be aware of these situations when processing data.


References

World Health Organization. (2020). Contact tracing in the context of COVID-19. https://www.who.int/publications/i/item/contact-tracing-in-the-context-of-covid-19

Share this article:

This article is from the free online course:

Get ready for a Masters in Data Science and AI

Coventry University