Skip to 0 minutes and 12 seconds One more thing we might wanna do is go through and look day to day and find out when there’s delays. So, here’s a reality check. There should be this many days during 2006 to 2008. How many should there be? There should be 365 days in 2006, 365 days in 2007 and 366 days in 2008 because that’s a leap year. Remember 2008 is a leap year. So always good to just go checking. So we should have 1,096 days in our 3 year dataset that we’ve been looking at. And one other piece of information we might, want so
Skip to 0 minutes and 49 seconds we don’t have to type these out, is handy: we have the month abbreviations. Stored in R in something called month.abb. So if you go look at month.abb, there they all are. We can use numbers from 1 to 12 as indices into this vector. Okay, so for instance if I take month abb, and I’d like to print out September, September, January, February, March, September, December, January, January, January, June, another June, and let’s say and on July. I can do that, write this crazy, almost random sequence is exactly what I had specified. I had asked for September, September, January, February, March, September, December, January, January, January, June, June, July. Okay? But you can use other things as the months.
Skip to 1 minute and 44 seconds For instance, I could take month.abb and feed in there the column of months from our data frame. Okay, let’s go do that.
Skip to 1 minute and 56 seconds The first six flights in our data frame myDF are all from January. And where do you think the last six are gonna be from? Presumably from December. Let’s just go check. From December, the last six flights are from December 2008 but we do not see it here. Because here we’re just looking at the months. 2006, but do not see the year information here.
Skip to 2 minutes and 30 seconds So now let’s go do something more savvy. Let’s go paste together the day of the month and the year. And instead of just doing the month itself, let’s go wrap that into the month.abb. Okay, so this part here, you can go look at it on its own. We already do, we looked at the head and the tail of that. It’s just, instead of printing the month, it’s using whatever month entry you come to. It’s looking up which abbreviation that should be. If I come to an entry in my DF month, which is a three. Instead of printing a three, it’s gonna go into month abb and print Mar for March.
Skip to 3 minutes and 9 seconds If I come into an entry of my DF month that’s a 12, instead of printing a 12, it’s gonna go into month abb and look at the 12th entry, and that’ll be December. So right here, instead of just printing the month number itself, at each individual entry vectorized, it’s gonna go in, and grab the analogous abbreviation now, and put that in instead. So let’s look at the head of the result. We’ll go look at the tail of the result too, while it’s thinking. Here are the dates of the flights, and they’re given in an international format, okay. In the United States, we usually write month, day, and year.
Skip to 3 minutes and 45 seconds But since there should be many international viewers here, I wanna be respectful of that. So for instance, the first flights there are on the 11th of January 2006. Okay, what about the last flights in our data set? They’re from the 13th of December in 2008. Okay, great. Just so I have these dates and I can use them more easily, I’m gonna take those and store them in a vector I’m gonna call MyDates. So that’ll let me do things with those dates more easily, and not have to type all of that out again. It takes it a minute to save it because there’s [LAUGH] 21 million flights, right?
Skip to 4 minutes and 21 seconds If I go look at the length of the mydates there should be more than 21 million of these. We saved a vector more than 21 million dates corresponding to the 21 million flights. Okay, now what can I do? I could for instance go into the departure delays, break things up according to the dates, and take an average within each date. And remember you might wanna throw away the ones that are in A so that it doesn’t mess up your average, okay, and you can do a tapply. So again, what are you gonna do?
Skip to 4 minutes and 57 seconds Use tapply with the departure delays as the data and split the data up according to the value of mydates and the function we take (within each day) on the data is the mean. Of course we throw away the NA values, that’s the fourth entry. You can always put in extra entries in the tapply, for instance if you wanna throw away the NA values. And there you go, on February 7th 2006, the average departure delay was .45 minutes, okay. It might make more sense to go sort the result there. Okay, I might wanna go sort those departure delays, and let’s take the head.
Skip to 5 minutes and 41 seconds These are the days with the smallest average departure delays, and what about the ones with the longest average departure delays? They’re all in December and then the 2nd of January. Well, then that make sense, right? Because that’s when you have bad weather and on average those days as well the departure delay was more than 30 minutes. There must have been some bad weather in various parts of the country on those days. But that’s pretty neat to know for instance. Let’s say we wanna even see if there’s more of them. Here are the worst 20 days in terms of the average departure delays. On the tale, I could tell that I want 20 entries. Almost all of them were in December.
Skip to 6 minutes and 29 seconds Okay, there was one bad day in June, and a couple in July and so on, but it’s pretty believable because almost all of them were in December, the second of January with just a couple of exceptions. And there will be exceptions, I mean. Pretty interesting. We can do so much with the data analysis tools we know now. I hope you’re having a lot of fun with this, and really finding ways to ask questions then dive in to the data to be able to answer the questions effectively and with very little effort.