Now, I wanna go into our airline data and learn something about the airplanes themselves, okay? So, let’s go look at the distances that the airplanes actually fly, and break it up according to the tail number. And the thing I’d like to do is take a sum, okay? So, I’m gonna use the tapply here. Okay, so, let’s consider the distances flown by the planes. Okay, that’s the first thing in our tapply. This is the data in the tapply broken up or split up according to the tail number of the planes themselves. This is the way to split.
And the function we use within each group of distances is the sum, okay? So, what we’re gonna do is we’re gonna go look at the distance data of how long these airplanes flew. Sum up that data. But, don’t just sum across all flights and all airplanes. Break things up by the tail number of the plane. So, what you see here is, for instance, this airplane which has its unique tail number, it’s painted on the fin of the airplane, has flown this many miles altogether, 2006 to 2008. This airplane has flown this many miles, and so on. Now, you notice again in R, you’re alternating between rows with tail numbers and rows with total miles.
But, that’s only because my screen isn’t very long. If my screen were as wide as we liked, we’d have one row with tail numbers and below it another row with the mileage for that tail number for that airplane, okay? And it would stretch out very, very, very long from left to right, but my screen is, of course, short. So, what could I do? I could sort that data. Uh-huh. It’s not gonna display everything, right? So, it might make sense to look at the very beginning. These are the planes that haven’t flown very much at all. And to look at the end. These are the planes that had flown a lot. And you notice there’s three erroneous ones.
It’s got 0 in there, blank, and 000000, okay? So, there’s sort of some mistakes coded in the data, and that definitely happens. Okay, let’s go plot this data to just get a sense of what the data looks like, of how much typical planes fly. It takes it a minute here because there’s several thousand planes. You can’t see the tail numbers very well here, kind of a horrible looking plot. And these three here are all the erroneous ones, okay, right? I mean, because you notice the first three of these aren’t the kind of data we wanna look at. But, from here downwards, we actually see, even in the millions of miles, there’s many, many planes flying millions of miles here.
So, let’s dive a little more deeply into that, okay? I’m gonna take the same data there, everything except the dot chart. Here’s an admittedly terrible plot of this. We dive in further. Okay, I’m gonna take that data and store it in a vector v. So, the head(v) are those flights that don’t have very many miles. And the tail are the ones that have lots of miles. For instance, if I go look at the tail, and I look at the last 23 flights. At the 23 flights that have the most miles of all, these three are erroneous and I really wanna throw them out, okay?
So, I’m just gonna go look at the 20 flights that have flown the longest, and let’s see what kind of airplanes those are. So, I’m gonna throw away all of my data here except the last 23 entries. I’m gonna save over that variable, okay? Not always the thing you wanna do, but if you do it intentionally, that’s fine. Okay, now, v has the information about the 23 flights that flew the most, the most miles, Okay, but the last three are erroneous so let’s remove them. So, now within this vector v of only being length 23, I just want the first 20 entries. These are the most traveled 20 airplanes. There you go.
And I’ve thrown out the erroneous ones because I dumped the last three, okay? So, we’ve got data about the airplane flights here, again, in our supplemental data sources. There’s data about the flights. I’m gonna right-click, or on a Mac, CTRL+Click on plane-data.csv back in my data expo website here, okay? I’m gonna download it as plane-data.csv and put it in my downloads. All right, so I’m gonna call this plane DF and import into r. Kind of getting familiar with this process now, I think. If we look at the head of the plain data frame, you’ll see there’s tail numbers. It doesn’t have anything about the manufacturers. Let’s go further into that data set.
Yeah, it’s got information about what kind of plane it is and when it was made and when it was introduced and who bought it, many things like that. Let’s now go look in that data frame and find the ones for which the tail number is in v, okay? If I go back up and look at v, I don’t just want it to be in v, I want it to be in the names of v. Okay, the names of the are these guys. They’re the actual tail numbers themselves. If I just put v there you’d be looking for tail numbers inside mileage numbers, which wouldn’t be right. You wanna compare apples to apples.
So, you go and look at which ones have the tail number in this list of tail numbers that we’re looking for. And indeed, it’s these. Okay, these are the airplanes that have flown the longest, and they’re all huge jets. There’s 767s and 757s. And there’s the 737 and some 777s. So, it’s totally believable. You don’t have any small planes in there. It looks like we probably did this right.