Want to keep learning?

This content is taken from the Purdue University & The Center for Science of Information's online course, Introduction to R for Data Science. Join the course to learn more.

Skip to 0 minutes and 12 seconds Before, we’d used the table command up here to find out how many flights there were within different months. Let’s use the table command in another couple ways here. Okay, if I go look at my DF and I go find all of the origin cities, I could make a table of the results there. And what would happen is R would go through every possible origin city for a flight, find out how many times that city appeared. It’s gonna go through Detroit and find how many times it appeared, Los Angeles, Boise, Idaho, Columbus, Ohio, Miami, Florida.

Skip to 0 minutes and 49 seconds It’s gonna go through every single city whose airport appears, every airline code, every airport code there, and build a table of how many times each one of those appears. It’s gonna do it really fast, as well. Now the results come back in alphabetical order. We’re still getting used to looking at how our output is. Again, if the screen were really wide, you’d have just two rows for your output. You’d have a row of the airport codes and accounts that go with them. Now since your screen isn’t really wide, you’ve got another row of airport codes and accounts that go with them and another row of airport codes and accounts that go with them.

Skip to 1 minute and 24 seconds And already, you see it’s probably doing the right thing because there’s Atlanta. Wow, it’s got 414,000 flights just in 2008. Then you might say to yourself, wonder which airport had the most altogether. You can take that table and sort the results, of course, it’s not gonna sort by what’s in the header, but rather by what the values actually are, the numeric values. Okay, so if you sort, indeed, Atlanta actually had the most flights. Chicago O’Hare had 350,000, Dallas Fort Worth had 281,000, and so on. Where’s our Indy, Indy’s right here with 42,000 flights, which is what we had thought earlier. Okay, so let’s make a note of what we’ve done here.

Skip to 2 minutes and 6 seconds Let’s build a table that shows how many flights are the origin throughout 2008. I said how many flights, how many cities are the origin for the flights throughout 2008. Okay, then we sort the results and see that Atlanta is most often used as the city of origin for flights. Let’s look at another example with table, and I’ve just saved my code, I’m encouraging you again to save early and often. Let’s suppose that you wanna take all of the departure times, and we’re gonna cut them using some different breaks, okay? We’re gonna chop the departure times up according to different values, okay? So I haven’t told you what the breaks are yet, okay? Let’s go look at the departure times.

Skip to 2 minutes and 52 seconds The head of the departure times is like this, right?

Skip to 2 minutes and 55 seconds This one left at 8:03 in the evening at 2003.

Skip to 2 minutes and 57 seconds And this one left at 7:54 in the morning and so on. So a natural thing to do would be to break every 100 values, right? Because these are listed as four digits numbers here, right? So if I broke it, 0, 100, 200, 300, 400, 500, 600, that would be like breaking it midnight, 1 AM, 2 AM, 3 AM, 4 AM, 5 AM, and so on. So let’s build a sequence of what breaks we want. Okay, let’s make a sequence from 0 to 2400 by 100. Those would be natural places to chop up our data. So here are some sample departure times.

Skip to 3 minutes and 33 seconds We might want to break them up into these categories, and when I send to these categories, I don’t mean exactly at the time 1 AM, 2 AM, 3 AM, and so on, but in between those times. So let’s chop up the data, let’s cut the data using these as the brakes, okay? It takes in a minute because ours gotta go through all 7 million of those. And now R has gone and for every single departure time found which chunk it goes into, like this flight here, the 985th flight departed between 8 and 9 AM. The 989th flight departed between 6 AM and 7 AM.

Skip to 4 minutes and 8 seconds This one’s a little hard to tell, the 988th flight departed between, let’s see, 2,000 and 2,100, like between 8 PM and 9 PM. So if we table these, that’s kind of neat. We can see that there were 20,000 flights that departed between midnight and 1 AM, 5,700 flights between 1 AM and 2 AM, only 1,700 flights between 2 AM and 3 AM. Okay, so this enumerates the number of flights that departed within each hour range during the course of the day. And then we could go plot that. Wrap the output into a plot, okay?

Skip to 4 minutes and 45 seconds So there weren’t that many departing at 1 AM, 2 AM, 3 AM, 4 AM, and some, but look between 6 AM and 7 AM, the flights start to pick up, and wow, by 7 AM to 8 AM, there’s a ton of flights. It’s kind of hard to tell if that’s exactly the right one because you see our labels on our x axis are done very well. The 6 AM to 7 AM might actually be this one. This might be 5 AM to 6 AM. We’ll learn more about setting the values of the x axis soon. In the meantime, we can go tell where the big jump occurs. And indeed, it does occur at actually 5 AM to 6 AM.

Skip to 5 minutes and 17 seconds You see we’re up to 164,000 flights here and then 445,000 flights here. So busy, busy, busy all day and then tapering down at the end of the day and so on. Lots and lots more to learn, but I’m just emphasizing the ways that we can kind of quickly and easily discern a lot of information about these airline flights. Okay, so let’s make ourselves a note here. Here’s the corresponding plot, we will improve the way the x axis looks later.

Introduction to Plotting in R

Feel free to pause the videos as needed as you follow along in R.

Share this video:

This video is from the free online course:

Introduction to R for Data Science

Purdue University