Want to keep learning?

This content is taken from the Purdue University & The Center for Science of Information's online course, Introduction to R for Data Science. Join the course to learn more.

Skip to 0 minutes and 12 seconds Now that we’ve got data imported into R, it’s nice to ask some questions about the data to find out about the data. In general R, is really good at helping us answer questions we might wanna ask about different data sets. I’m gonna save this, File here as IndyFlights.R. And just find out a little bit about flights to and from Indianapolis, which is the airport closest to where I live. I’m gonna delete these other lines out of my R code here and save my file. And to save us time, I’ve gone ahead and run the line to read the data back into a dataframe called myDF. You see that the data’s been read in.

Skip to 1 minute and 3 seconds Again, I see I’ve got seven million lines of data stores in myDF, 29 columns there. And if I go run head(myDF), indeed, I’ve got the data stored in there.

Skip to 1 minute and 18 seconds So I don’t even know what all of these variables are. There’s a little bit more information on the website the ASA Data Expo 2009. So I’m gonna open a browser here. Go into Google go to the ASA Data Expo 2009. Go back to the website for the data here. This is, again, if you wanna go directly stat-computing.org/dataexpo/2009. So inside here, we can look at the data. Again, this is where we downloaded the data from. But they’ve got a little bit of information for us here about the variables that are stored in there. Of course, we can also find out about the variables directly in R. But sometimes it’s nice to go look and see what information’s provided about the variables.

Skip to 2 minutes and 8 seconds And in particular, I see that the origin is the 17th column in our data and the destination is the 18th column in our data. Lets go take advantage of that and take a peek at those origins and destinations in our dataset here.

Skip to 2 minutes and 26 seconds Okay, inside my DF, one way I could go look at the column of the origin airports is to write myDF and a $, meaning I wanna look at a column with the name Origin.

Skip to 2 minutes and 41 seconds But if I hit Cmd + Return to run that line, I’m gonna get all seven million origin airports. R won’t actually print all seven million of them out. After 10,000 of them is, that’s enough and just stops the output. But it’s often helpful to go look at the first several of them. So very frequently, you’ll see me ask for the head or the tail of something, so let’s ask for the head of the Origin column. Just the first six entries in the Origin column happen to be IAD IAD, and then four that are Indy. Okay, let’s go ask for the last six that are in that column.

Skip to 3 minutes and 23 seconds The last six happened to be SAV, ATL, I know that’s Atlanta, and then PBI, IAD, and SAT, okay?

Skip to 3 minutes and 33 seconds I think IAD is Dulles. Similarly, let’s go look at the destinations, okay? If I look ahead, (myDF$Dest) those are the first six destinations tail(myDF$Dest), those are the last six destinations. So for instance, the first six flights in our data set, well, the first two of them started at Dulles and the next four at ND. And then the first two landed at TPA. The next three landed at BWI. And the sixth one at JAX. In between you see this levels thing here. Some columns are called factors. And the factors have different possible values that they can take on. Those are called the levels.

Skip to 4 minutes and 24 seconds In this case, we see that there’s 303 possible origin cities for the flights in 2008, and 304 possible destination cities for the flights in 2008. And R just gives you some of them to give you a sense of what those are.

Skip to 4 minutes and 45 seconds Again, so those first six flights all started in either Dulles or in Indy and went to these six locations. And you can verify that again by going back up to the head of myDF and, again, running Cmd + Return there. And we ought to see the same information. Let’s scroll back up here.

Skip to 5 minutes and 7 seconds Let’s see, there they are. There’s the 16 column origin and the 17 column destination. And there are those same six origins and the same six destinations. In fact, if I looked at the last six lines of my data, with tail of myDF, I’d see something similar, right? I’d see those same six origin cities and same six destination cities that I’d seen when I was looking individually at these. And you notice that I can click around my code and run whichever ones I want. If I wanna go run this line next, I can just do Cmd + Return and click on it and run it.

Skip to 5 minutes and 43 seconds You can click around and run, in whatever order you like the commands inside your R code there.

Extracting the Head and Tail of a Data Set

Note: Click on the icon on the lower-right corner of the video to enlarge the video. You will need to do this for many of the videos in order to view what the educator is typing.

Note: Feel free to pause the videos if necessary as you follow along in R.

Share this video:

This video is from the free online course:

Introduction to R for Data Science

Purdue University