Where will we find our data?
Now it’s your turn to select an interesting data set and ask some relevant research questions. In this step, you want to:
- identify your data set
- save a local copy
- start looking through the data
- propose a few research questions
Let’s go through each of these stages in turn, to walk through the whole of this process.
Identify your data set
As you know, many data sets are available online. You might search a large repository like Kaggle or UCI to find some interesting, publicly released data. This has the advantage that the data set will be carefully cleaned and checked, since many other people will have inspected it before you.
Alternatively, you might choose to generate or extract your own data set. This could be the result of a questionnaire you conducted, or some local information you can access and summarize efficiently. When generated your own custom data, be careful to respect data privacy regulations. Are you allowed to process this data? Can you release it, or must it remain within the organization? Does it involve any personally identifiable information? If so, have the data subjects given their consent for you to use this data?
Save a local copy
Ideally, your data set should comprise a single CSV file which you can download and store on your laptop. This file can be uploaded to the interactive notebook server for analysis and processing using the techniques you explored earlier in the course.
Start looking through the data
This is the initial phase of exploratory data analysis. At this stage, you simply want to know some rough numbers. How many rows of data are there? Each row should correspond to a single observation. How many columns of data are there? Each column should correspond to a single feature, made as part of an observation.
Have the columns been labelled sensibly, or do you need to add some labels by hand? Are there missing values in some data columns, or is the data set complete?
Look for numerical data and find out the range of values. Look for categorical data and find out the size of the set. Then think about the distributions of these values. This should help you to consider possible research questions.
Propose a few research questions
Generally, your research questions will involve some kind of a relationship between multiple features in the data. Does weight correlate with height? Do more people borrow books from the library on Fridays? Is older music more likely to be played on the radio at night-time? These kinds of questions will emerge as you look at your data. Write down a list of two or three questions you might investigate more closely.
© University of Glasgow