Want to keep learning?

This content is taken from the Purdue University & The Center for Science of Information's online course, Introduction to R for Data Science. Join the course to learn more.

Skip to 0 minutes and 12 seconds Now what I’d like to do is go beyond the Indianapolis Airport and some of the ones we’ve already studied and take a broader look at what are some of the most popular airports. Let’s see if we can figure that out. For instance, if we wanna look at the most popular airports for departures, we can look at the origin column, the myDF data frame. We can make a table of all of the entries there. They come in alphabetic order, naturally, but if we are concerned about the most popular ones the thing we can do is sort them.

Skip to 0 minutes and 45 seconds If we sort them, the most popular ones, the ones with the most flights, come at the end of the list. For instance, I’d like to take the top five or the top ten of them. It might be more natural to put them in decreasing order and there’s a parameter for that. You can see if you look at the help for Sort that there’s a decreasing, which by default is false. So what we can do is, let’s look at the origin airports in order, from most popular to least popular, by using the decreasing equals true parameter when we sort.

Skip to 1 minute and 26 seconds Okay, so I put that in there with my sort.

Skip to 1 minute and 30 seconds Now you notice the least popular ones are the ones at the end and the most popular ones are the ones at the beginning. Now, I can put a bracket here, for instance, and say, well, what’s the third one of those? And it should be DFW. Take a look. Or the second one, for instance, was O’Hare. In R you can make a whole vector of numbers by just using a colon.

Skip to 1 minute and 55 seconds For instance, if I put 1:10, I’m gonna get all the numbers from one up to ten. So let’s use that as our way of indexing into the vector. Pretty powerful stuff here. You can just go get the first ten most popular airports according to the origin of the flights in this way.

Skip to 2 minutes and 16 seconds These are the most popular ten airports according to the number of origins of different flights. Lets do the same for the destination airports. It might be that they’re in exactly the same order or pretty comparable order, because there’s big gaps in between those numbers. So we might not expect them to be out of order if we just look at destinations. Look at that. They’re in exactly the same order. Remember, altogether in 2008, there were about 7 million flights. If we go look at the dimension from myDF, we can remember that we have about 7 million flights altogether in 2008. You can see that from the dim(myDF). There’s about 7 million rows and 29 columns.

Skip to 3 minutes and 5 seconds If we were, for instance, to just go look at the origins of all the flights that were found in this command percent, in% checks each element of the thing on the left, and sees if it’s in the element on the right. Let’s see if it was among the most popular of the flights. So I haven’t created this most popular vector yet, but we’ll do that in a second. And we’re gonna go through each of the origin airports and see if it’s in a vector called mostpopular. And I’m gonna make a vector called mostpopular just so I don’t have to keep writing out what are the ten most popular airports.

Skip to 3 minutes and 44 seconds I wanna take those ten most popular airports and I don’t wanna get their counts, which we see here, I wanna get the names of the airports. I wanna get this information here. So that’s stored in the names of the table.

Skip to 3 minutes and 58 seconds Before we run the next command, let’s go look at what the most popular turn out to be. Indeed, it’s characters there that show you the ten most popular airports. So if I go check each of the origins of the flights and see if they’re in mostpopular, we’re gonna get a vector of trues and falses. Let’s make some notes here. These are the names of the ten most popular airports in 2008.

Skip to 4 minutes and 24 seconds Check each flight to see whether its origin was one of these ten most popular airports.

Skip to 4 minutes and 34 seconds And we’ll get trues and falses, so we could sum up all the trues as ones and the falses as zeros. Wow, 2.3 million of those flights originated in one of the ten most popular cities.

Skip to 4 minutes and 55 seconds You could the same for the destinations. We would just change origin to destination. Again, it’s very close to being 2.3 million. And another thing you could go do is you could see how many of the origin flights most popular, and at the same time how many of the destinations were in the most popular. Okay, this is gonna be a vector of trues and falses. This is gonna be a vector of trues and falses. And the ampersand means and. It means to be true, there has to be an analogous true entry in this vector and in this vector.

Skip to 5 minutes and 33 seconds Let’s find the flights for which the origin and the destination were among the ten most popular airports.

Skip to 5 minutes and 43 seconds And we can sum up how many trues we have in there.

Skip to 5 minutes and 48 seconds Half a million of the flights had both origin and destination at one of the ten most popular airports.

Identifying the Most Popular Airports

Share this video:

This video is from the free online course:

Introduction to R for Data Science

Purdue University