Want to keep learning?

This content is taken from the Purdue University & The Center for Science of Information's online course, Introduction to R for Data Science. Join the course to learn more.

Skip to 0 minutes and 12 seconds So before we go discussing the data that we gonna work with directly, you’re gonna notice the first time you download the data that you don’t have a program that can uncompress it once it’s downloaded. So let’s go get an uncompression program first. You might already have one but chances are you don’t. For instance, let’s go to www.7-zip.org. This is one place you can get an uncompression program that’ll be able to uncompress the file we’re gonna download. So I’m gonna click on the 64-bit downloader here. I’m gonna save that file.

Skip to 0 minutes and 59 seconds And I’m gonna install this app in my program files. Again, you may have a different one that you prefer, but you need something that can open ZIP files.

Skip to 1 minute and 11 seconds So once you’ve got that downloaded and installed, we can go take a look at the data expo file that we’re gonna look at. You can just go to data expo 2009 in Google if you like or if you want to go directly to the website you can put in stat-computing.org/dataexpo/2009. Now the dataexpo is now sponsored by the American Statistical Association, they have data expos every few years. They’re not offered every single year, but every two or three years, a data expo is offered. And the data expo from 2009 talks about airline on-time performance. It’s got about 29 variables from many many many flights. Gigabytes and gigabytes of data here. We’re just gonna download one years worth today.

Skip to 2 minutes and 1 second So that you can see how to open the data inside R. So once you’ve gotten to the data site for the date expo 2009, you can click on the link to the data here and you can download the data for a particular year. You’re gonna notice that the data isn’t in CSV format, it’s in a compressed ZIP file. So the file we’re gonna download will be called 2008.csv.bz2 and that uncompression program is going to let us unzip the program so we just have a regular CSV file but let’s click on the 2008 file and save it.

Skip to 2 minutes and 40 seconds You’ll notice this is large data. It’s 100 megabytes even when it’s compressed. So it might take just a little time to download to your computer.

Skip to 2 minutes and 52 seconds I’ve already downloaded it once earlier and that’s why mine says 2008 with a parentheses one afterwards because I’ve already got a 2008.csv.bz2 file in that same directory.

Skip to 3 minutes and 8 seconds Okay, our file is been downloading so I’m gonna go to that folder. Gonna click on 2008.csv.bc2 file. And I’m right-clicking now, I’m gonna click on the right mouse button and I’m gonna open with 7-Zip. Okay, I’m gonna open that archive. There’s the CSV file that I want. So I’m gonna single click on it and extract it.

Skip to 3 minutes and 36 seconds It says Copy to my Downloads folder, that’s great. It’ll show you the progress as it’s uncompressing. See, if I just downloaded the raw CSV file over the web, it would take a lot of time, and so that’s why it comes in a compressed format.

Skip to 3 minutes and 56 seconds Okay, so the file is uncompressed now. Let’s go compare here. If I single-click on this file that I had on the zipped file. And leave my mouse over it, it’ll show me that that file is 108 MB. If I click on the file I’ve just uncompressed there and mouse over it, it’ll show me that’s 657 MB, more than half a gigabyte of data, just for the 2008 airline flights that we’re going to work with. It’s pretty impressive that you’ve got a factor of more than six when the file was compressed and downloaded.

Download Airline Data on Windows

Download the data set here: http://stat-computing.org/dataexpo/2009/

In the comments below, describe what type of data you are most interested in learning about in terms of airline flights (e.g. how many flights landed in Chicago over the course of a year).

Share this video:

This video is from the free online course:

Introduction to R for Data Science

Purdue University