Want to keep learning?

This content is taken from the Purdue University & The Center for Science of Information's online course, Introduction to R for Data Science. Join the course to learn more.

Skip to 0 minutes and 12 seconds Okay, one thing we haven’t done yet is gone and downloaded data from multiple years of the airline data set, and then used it together in R. So, let’s go back to the Data Expo 2009 website. Let’s go to Download the Data. And download the 2008, 2007, and 2006 data. While that’s going, I’m just going to reopen my R Studio. And I’m gonna start a new file called called moreconcepts.r. You’ll notice since I’ve closed Mars studio, I don’t have anything in my global environment. I don’t have any variables, I’m just doing myself a fresh start. So let’s wait for the downloads to finish, and then we’ll come back in a minute.

Skip to 0 minutes and 57 seconds So now that the downloads have finished, if I go into my downloads folder, I should see a 2006.csv.bz2 file and 2007.csv.bz2 file, and same for 2008. If I just double click on those, they should open on my MAC. I’ve already got an uncompression utility built in. And on Windows, again, we went and got such an uncompression utility and already used it before. So, just go back to the very beginning of the course and remind yourself how we did that if you don’t remember. So let’s give ourselves a temporary data frame here, from reading in this first data set from 2006. And while that’s loading in, I’m gonna copy and paste, prepare to load in the next data frame from 2007.

Skip to 1 minute and 42 seconds And then one more from 2008. Okay, so I’m gonna have three temporary data frames. It’s gonna take it a second to run. I’ll just pause this while it’s running, but I wanna go run each of those three lines. All right, now that we’ve got all three of those loaded in that took a few minutes. Notice, it makes sense that it took a little while, right, because each of those data frames has more than 7 million rows and 29 variables. If I go take a look at any of these individual ones, I’m gonna see that. I get the same number of rows and columns as we see over here in our little window, all right?

Skip to 2 minutes and 18 seconds So, let’s go build ourselves a new data frame that has all three of these together. And to do that, we can use something called rbind. Rbind is a row bind. It’s gonna bind all of the rows of the first data frame, and then all of the rows of the second data frame, then all of the rows of the third data frame. And it’s gonna take and bind those into one really large data frame. And then it’ll name the columns of this new my DF with the same names as the columns from these three small data frames.

Skip to 2 minutes and 46 seconds Okay, so when this is done row binding all of these together we should have a new data frame with more than 21 million rows, right. And we see that it finished, and indeed, it’s got more than 21 million rows there. And it didn’t take as long for it to do that as to load each of the three smaller data frames because it’s already got all three of those other data frames in memory in the RAM. So, it didn’t have to go out to the disk to get those. So, there is no sense in we having these three smaller data frames around anymore.

Skip to 3 minutes and 15 seconds And so, I can just go remove each of those now from mars memory so that all I’m working with is just this new larger data frame. And I can go check and see that it looks very much like the data frame we’ve been dealing with in the past from the 2008 flights, this time it’s got all the 2006 flights, the 2007 flights, and the 2008 flights. If I go look at the tail, I’m gonna see the 2008 stuff. Here I see the first six flights from 2006, the tail will shown me the last six flights from 2008.

Skip to 3 minutes and 48 seconds If you wanna go make sure that you’ve got all three years in there, since you’re not seeing the 2007 stuff, you can go look at the year column, and just look at all the unique entries in the year column. And of course, you see there’s three unique entries. There’s the 2006 year data, 2007 year data, the 2008 year data. Unique just says go through an entire vector, in this case, the year column from my DF, and tell what are all the unique entries that are in there.

Assembling Multiple Years of Airline Data

When downloading and opening the 2006.csv, 2007.csv and 2008.csv, your computer might add a (1) after the year (e.g. 2008(1).csv) since we already downloaded the data set previously. If you receive a “file not found” error, you may need to either change that download name by removing the (1) or you will need to add the (1) into the command line.

Share this video:

This video is from the free online course:

Introduction to R for Data Science

Purdue University