Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

Find out more

Where will we find our data?

There are many data sets available online. Alternatively, you may choose to generate your own data. How and where will you acquire your data set?
Group of students gathered around a drone about to take off in a field
© University of Glasgow

Now it’s your turn to select an interesting data set and ask some relevant research questions. In this step, you want to:

  1. identify your data set
  2. save a local copy
  3. start looking through the data
  4. propose a few research questions

Let’s go through each of these stages in turn, to walk through the whole of this process.

Identify your data set

As you know, many data sets are available online. You might search a large repository like Kaggle or UCI to find some interesting, publicly released data. This has the advantage that the data set will be carefully cleaned and checked, since many other people will have inspected it before you.

Alternatively, you might choose to generate or extract your own data set. This could be the result of a questionnaire you conducted, or some local information you can access and summarize efficiently. When generated your own custom data, be careful to respect data privacy regulations. Are you allowed to process this data? Can you release it, or must it remain within the organization? Does it involve any personally identifiable information? If so, have the data subjects given their consent for you to use this data?

Save a local copy

Ideally, your data set should comprise a single CSV file which you can download and store on your laptop. This file can be uploaded to the interactive notebook server for analysis and processing using the techniques you explored earlier in the course.

Start looking through the data

This is the initial phase of exploratory data analysis. At this stage, you simply want to know some rough numbers. How many rows of data are there? Each row should correspond to a single observation. How many columns of data are there? Each column should correspond to a single feature, made as part of an observation.

Have the columns been labelled sensibly, or do you need to add some labels by hand? Are there missing values in some data columns, or is the data set complete?

Look for numerical data and find out the range of values. Look for categorical data and find out the size of the set. Then think about the distributions of these values. This should help you to consider possible research questions.

Propose a few research questions

Generally, your research questions will involve some kind of a relationship between multiple features in the data. Does weight correlate with height? Do more people borrow books from the library on Fridays? Is older music more likely to be played on the radio at night-time? These kinds of questions will emerge as you look at your data. Write down a list of two or three questions you might investigate more closely.

© University of Glasgow
This article is from the free online

Getting Started with Teaching Data Science in Schools

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now