Getting numerical data into Python
Suppose we have a question, some data, and we know a little bit of Python. How do we get our data into Python in order to start exploring it?
Getting data into Python is usually quite easy. Here we’re going to focus on numerical data, rather than text, audio or video.
For tiny datasets, we can just type (or copy-and-paste) the data directly into our Python code. For larger datasets, we can import data from a text file, an Excel spreadsheet, a database or from many other sources. For huge datasets, referred to as Big Data, we would need a different approach because the data may not fit onto one computer.
The table below shows the number of births in the USA for each possible day of the week, covering the years 2000-2014.
|Day of the week||Number of births|
Since this table is such a tiny dataset (even though the counts are quite large), we can type this data directly into Python. In the Python code below, we store the two columns separately as two Python lists.
weekdays = ["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"] births = [ 9316001, 10274874, 10109130, 10045436, 9850199, 6704495, 5886889] print(weekdays) print(births)
A list is a data structure that holds several values and keeps them in the order that they are given. We recognise a list from the square brackets. The items in the list can be mixtures of strings, numbers, and data of other types (even lists, so you could have lists of lists).
For tiny datasets, it’s ok to store it within your actual Python code, but for all other data, it’s not a good idea. There’s always a danger of miscopying data or not remembering where you got the data from. Also, if you wish to update your dataset, you will have to change the values in your code.
Getting a bit more advanced
Python is one of the most commonly used programming languages for data science. One reason for this is that it is easy to get started performing data analysis using these four Python libraries: NumPy, Pandas, SciPy and Matplotlib. They are part of a collection of Python libraries known as the SciPy ecosystem. They provide data structures and functions for loading, storing, processing, analysing and plotting data.
The Pandas library provides a function for reading data directly from a comma separated values (CSV) file. That file could be a local file or a file on a website.
Have a look at the original dataset US_births_2000-2014_SSA.csv from which the table above is a summary. Notice how there is one header line (giving the names of the columns of data) and then one row for each day. The Python code below reads the data from this CSV file directly into Python, ready for further processing.
import pandas as pd url = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv' mydata = pd.read_csv(url) print(mydata)
Of course, Pandas has functions for reading from all sorts of different sources. CSV is commonly used because it’s easy to read and write using Microsoft Excel or any text editor. In the next step, we’ll go through an example of how you can use a CSV file in Python.
In summary, for a very small amount of data, it is often quick and easy to type (or copy, paste and edit) data directly into a Python script, but for larger datasets, we can read the data directly from a file.
FiveThirtyEight. (2020, June 26). FiveThirtyEight / data. GitHub. https://github.com/fivethirtyeight/data
FiveThirtyEight. (n.d.). US_births_2000-2014_SSA.csv [Dataset]. GitHub. https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv
Hoffman, C. (2018). What is a CSV file, and how do I open it? How-To Geek. https://www.howtogeek.com/348960/what-is-a-csv-file-and-how-do-i-open-it/
© Coventry University. CC BY-NC 4.0