Want to keep learning?

This content is taken from the Purdue University & The Center for Science of Information's online course, Introduction to R for Data Science. Join the course to learn more.

Skip to 0 minutes and 11 seconds So here is another question we could answer with the tapply function. Which day of the week should we fly if we want to minimize the expected arrival delay of the flight? Can use the tapply function. The data that we wanna know about is the arrival delays. We wanna break the data up according to the day of the week. If we look at the different variables we have, we see that DayOfWeek is a variable that has Monday as the value 1, Tuesday as the value 2 and so on up through Sunday as the value 7.

Skip to 0 minutes and 44 seconds So we’ll break the data up according to the day of the week, and the function we might chose to take is again the mean, and again we’ll remove any nas that we have. This isn’t gonna look at one specific airport, this is gonna look across all of the airports and see what’s the arrival delay. And which day of the week has the smallest arrival delay. If you like, we can go plot the results. There’s not a huge difference between 6-minute arrival delay on average, and 11-minute arrival delay on average. But there is some variation amongst the days of the week. We’ll make ourselves a note here that 1 denotes Monday, 2 denotes Tuesday, etc., and 7 denotes Sunday.

Skip to 1 minute and 26 seconds Now sometimes we want to further specify what data we’re working on. So we might, for instance, only wanna look at arrival delays that are coming into Indianapolis. And only look at the days of the week for flights corresponding to Indianapolis, and we can do that. What we do is we put a bracket after each one of them and we’re gonna put the exact same conditions in each of those brackets. What condition are we gonna put? Let’s suppose, for instance, we want to look at flights coming into Indianapolis, so where the destination is IND. I look at the head of those, I need to put a ==. The first six of them are all false.

Skip to 2 minutes and 4 seconds If I look at the tail, the last six of them are all false. But altogether, there’s 42,000 such flights. So I can use these trues and falses as indices, as ways of looking into these other two vectors here. Okay, so I’m gonna look at the arrival delays, only for flights that have IND as the destination. And I’ll write on another line here, so it’s a little more readable. Also look at the days of the week, only for flights that have IND as the destination.

Skip to 2 minutes and 31 seconds In general, if you’re wondering if you’ve got the tapply set up correctly, I always encourage my students to go take a look at the length of each of the vectors and make sure that they’re the same. There ought to be about 42,000 of those, and there ought to be about 42,000 of these. And if the lengths are different, if I don’t have the right lengths, if I’ve indexed one vector differently than another, my tapply function isn’t gonna work right. Okay, the tapply function has to have every element in the first position corresponding to the same element in the second position. So what are we doing now?

Skip to 3 minutes and 1 second We’re answering the same question but restricting attention to flights that have IND as the destination airport, right? And there we get the average arrival delay of the flights. We get the average arrival day of the flights for each day of the week, and restricting only to IND arrivals. And here again, this is just a way of checking, okay? Just double checking that we are working on two vectors that have the same lengths. Now we’re recognizing that, wow, we really know quite a bit about R. We’re using very powerful functions in R. Again, notice we haven’t had to used any loops. R is vectorizing all of these things for us.

Skip to 3 minutes and 46 seconds R is applying these functions exactly the way we want it to. And breaking the data up in the ways that we want to, in really powerful ways. But it takes some practice. Once you get used to the tapply function, it is really powerful. I think you’ll really enjoy knowing how to use it. And I just wanna continue giving you examples of how to use the tapply function, so you get comfortable with it.

Arrival Delays by Day of the Week

Make a plot of the average departure delays for each airport of origin. Make a comment below to work on this with others.

When you are finished with your plot, you may mark this step complete!

Share this video:

This video is from the free online course:

Introduction to R for Data Science

Purdue University