Want to keep learning?

This content is taken from the Purdue University & The Center for Science of Information's online course, Introduction to R for Data Science. Join the course to learn more.
4.3

Purdue University

Skip to 0 minutes and 12 seconds I want to take a look at another issue that involves the memory of what we’re doing in some of our analysis here. Let’s suppose that we go look at the origins of all the flights and at the destinations, and we make a list that’s got all the origins and the destinations. And then we build a table from that. Okay, I’m gonna go store the result, the variable called mytable. So we’re used to just making a table with one column where we go and keep track of all of the different types of entries in one column and see how many there are of each type. But now we’ve gone and looked at the origins and the destinations.

Skip to 0 minutes and 47 seconds So we should expect that the thing we get back has two dimensions. In fact it does, if you look at the head of mytable it looks a little strange because it’s only showing you the first three rows, but it’s showing you all of the columns. If my screen were wider it would show you the first three rows, and it would show you all of the columns stretched out left to right. But that would be much too long, in fact let’s go look at the dimensions at mytable. Okay mytable has 315 rows and 321 columns. I could take a look at that, there’s over 100,000 entries all together.

Skip to 1 minute and 23 seconds Okay let’s make a note here, a table with 315 rows and 321 columns. So that each entry corresponds to a unique origin and destination pair. It’s a little wasteful because a lot of these are going to be zeros, okay? So if for instance, I go look at how many of these things in mytable are just equal to 0s. Wow, 94,000 of the entries are 0s. Unbelievable, it’s not really unbelievable when you think about it because usually can’t fly between just any two cities. If you just pick two random cities are quite unlikely to be able to fly between them. There’s only 6,266 pairs of origin to destination cities or airports I should say, that you can fly between.

Skip to 2 minutes and 17 seconds So it’s not really efficient to have this two dimensional matrix that’s got 315 rows, 321 columns and over 100,000 entries all together. It’s more efficient to think about other ways to do things. For instance, if I paste together the origins and the destinations, let’s just go look at the head of what we get there. And we’ve done something like this before. We’re just gonna six origin to destination pairs there, okay. This is probably Atlanta to Phoenix, I’m not completely sure, but I think so.

Skip to 2 minutes and 48 seconds And we’re gonna get all the entries like that but just stored into one long vector, and then instead of just taking the head we could take a tape of all those and save the result and sometimes I gonna call mynewtable. You can call it anything you like I just wanna contrast it with the thing I called mytable up above. And if there’s justice in the world, there ought to be the same number of entries in mynewtable as there were non-zero entries up here in mytable. There ought to be 6,266 of them, let’s keep our fingers crossed. Great! 6,266 entries.

Skip to 3 minutes and 24 seconds So as opposed to up above where I made a two dimensional matrix with 315 rows and 321 columns, in which 94,000 of those were empty. That was very wasteful, it’s kind of a sparse matrix. It’s more efficient to just make a vector of length 6,266 that’s got all of the analytics data that we want. And if we go take a look inside the head there, or the tail, you can go see for every origin to destination pair. For instance, from ABE to ATL, those are two airports, there’s 3,007 flights between those two. There’s only two flights from ABE to AVP. Okay, there’s 3,061 flights from YUM to LAX.

Skip to 4 minutes and 7 seconds All right, but there we’ve got the 6,266 possible flights pass in the years 2006 to 2008 with no wasted spaces at all. I think it’s worth while to think about efficiency like that, to think about what’s the most efficient way to store your data and what you wanna be working with ahead of time.