Want to keep learning?

This content is taken from the Purdue University & The Center for Science of Information's online course, Introduction to R for Data Science. Join the course to learn more.

Skip to 0 minutes and 12 seconds All right let’s continue this example and let’s make a data frame called longdelayedDF.

Skip to 0 minutes and 19 seconds Let’s get all of the flights from myDF that have departure delay more than 30 minutes. And I could go and look and see how many of them there are. Are more than 800,000 of them. So we make a data frame with all of the the flights that are delayed more than 30 minutes when departing. There’s over 800,000 such flights. That’s pretty remarkable, all right. So now if I go double check here, for instance, I’d supposed I look at long delay DF, look at the departure delays. If I look at the head of those, it better be more than 30 minutes.

Skip to 1 minute and 3 seconds Indeed they are, they are all more than 30 minutes, right, because I’ve taken a subset of the whole data frame of all the flights and just restricted the attention to the ones that the departure delay more than 30 minutes. So if you remember what I had done before was I’ve gone and found accounts, all of the flights from ORD or IND according to the month

Skip to 1 minute and 30 seconds And there are those counts. I put it all on one line, so it looks a little nicer there.

Skip to 1 minute and 38 seconds Let’s do the same thing, but now, for the flights with delays more than 30 minutes. So what do I do? I just take that idea I already have, but I apply it to the long delay data frame. I look at all the departure delays in the long delay data frame there,

Skip to 1 minute and 59 seconds break them up according to origin and month and take a length. Take the length of how many of them there are. In other words, try and take the departure delays, chop them up according to the origin and according to the month, see how many there are, and then I just grab the rows with ndnord in there. The counts again have to be smaller than before, right? Because the first result was for all of the flights broken up according to the city of origin and the month. And in the second result we’ve only looked at the really long delayed flights. Again, split up according to the city of origin and the month.

Skip to 2 minutes and 38 seconds The neat thing now is I can go through and divide the two.

Skip to 2 minutes and 43 seconds I can go take this new result of the flights that have really long delays, and divide by the total number of such flights. Now this looks really long and kind of awful, so rather than do that I’m gonna save these things into some intermediate place so that I don’t have to write my division across three lines. We can divide entry by entry to get the percentage of flights that have really long delays, more than 30 minutes from Indie or from Chicago, broken up month by month. Okay, so I’m gonna write this as matrix one and go right the second one is matrix two. Let’s just go double check what I did.

Skip to 3 minutes and 28 seconds There’s the counts of all the really long delayed flights. There’s all the flights and I can divide entry by entry. Here’s the division yields the percentage of flights with long delays. Long delay again meaning more than 30 minutes. And the neat thing here is now I can put this into a Dot chart and if you go read about Dot charts you can go load the Help menu for those. I won’t go into all of the details here. This book by William Cleveland is a great book to use, I use this with my student elements of graphing data something students can understand pretty readily, pretty accessible, has lots of nice examples in there, highly recommended.

Skip to 4 minutes and 12 seconds I also like Paul Murrell’s book here, R Graphics. These are both worth checking out. But let’s go just directly make a dotchart of that result. After we divide these guys, we can go make a dotchart there. It’s a little bit overlapping because my window’s relatively small.

Skip to 4 minutes and 34 seconds So ORD is the first line and IND is the second line on each of these. And you see, if you just look line by line, the delay out of O’Hare is always not just a little bit longer, but often quite a bit longer. In fact, if you wanted to check and see whether really and truly the ones from O’Hare are always more than from Indie, you could go and check. And 1 divided by M2 and save that in another matrix, M3. M3 has the percentages of the flights with long delays. Again, from IND or ORD and month by month.

Skip to 5 minutes and 14 seconds So the same thing could have been achieved by just taking a dotchart of M3. But the neat thing about doing this is if I look in M3 and I take all of the flights from Indie I forgot my comma. Yeah, there’s all the percentages month by month for the flights from Indie and I can also do the same for ORD. I can go subtract now and make sure that there’s always a larger percent from O’Hare, and indeed there are. I can go make sure these are always positive, indeed they are. They’re always whole positive.

Skip to 5 minutes and 46 seconds There’s always a larger percentage of flights from O’Hare that have this long delay of 30 minutes or more than there are of flights from the smaller airport Indianapolis.

Calculating Percentages of Flights with Long Delays

Share this video:

This video is from the free online course:

Introduction to R for Data Science

Purdue University