Skip to 0 minutes and 9 secondsHello. Welcome back. So last session, we got as far as seeing that Wikipedia makes data available on how often people have looked at different pages. We can get this data in raw format, clicking on this page in JSON format. And we'd seen that although this looks a bit messy, full of punctuation symbols, we could get it into R and RStudio and make it look a bit more like a table. Now, since the last session, you may well have closed your RStudio and R, and you want to know how to get back to where you were last week. Now last week, we didn't look at saving your work so far, but we'll have a look at that this week.
Skip to 0 minutes and 50 secondsSo if you go to RStudio, one of the first things you want to do is create a new project, which is going to be a place that you can save the work that you're doing. Now I've already got the new project screen up here, but you get that by going to File and then New Project. Okay, now I'm going to create a new project in a directory I've already got. The directory's already selected here, but just pick an area you'd like to create your project, and click on Create Project. Excellent. Now, on the course website, we've made an R file available called "LoadParsedData."
Skip to 1 minute and 30 secondsAnd if you download that, put it in the folder that you've created your project in, and click on it like I just did, then you'll see it's got four commands in it. And those are the most important commands from the last session. We're assuming you did do the work last session, so you've installed the RCurl library, the RJSONIO library. If you have, you can just click on Source, and the code will work away. And you can see over here in environment, we've got now "raw data," which was the first variable we created last week, and "parsed data," which was the second variable we created last week. Okay, so where had we got up to?
Skip to 2 minutes and 13 secondsIf we type in parsed data, this is what it looked like. So you've got, at the top there, under $daily_views, the table. And we saw that if we type in parsed data dollar data underscore views, then we can get access to that table. Brilliant. So, we're back to where we were last week. Now how can we plot this data so we can have a better look at it ourselves? It's often easiest to work with data in R if your data is in a data frame. A data frame is basically a table of named columns containing your data. So let's try and move this data into a data frame.
Skip to 2 minutes and 56 secondsWhat we want is a table where we've got one column with the dates, and another column with the views. In "parsedData$daily_views"-- very snappy name-- the dates are actually the names of the data points. We can get the names out like this. Just type in names and then the variable we're interested in. There we go. There you go. So there, the dates come out. Now we wanted to put the names in one column and the views data in another. So try typing this in. Type in data frame date equals names parsed data daily views. Okay, that's going to set up a first column called "date," which has got the dates we just pulled out. We want another column called "views."
Skip to 3 minutes and 54 secondsAnd there we can just put in the variable we referred to before, parsed data dollar daily underscore views. Okay, enter. Good. All right. That's sort of looking okay, but you can see we've got the dates twice, which isn't really what we want. The problem is that R is using the dates as names as it was doing before. We don't really want that. Probably something like row numbers would be better. So let's try this instead. I'm going to press the Up key, which is going to save me a bit of typing. Brings back what I had before. And I'm going to add an extra bit. Row dot names equals null. Brilliant. Excellent.
Skip to 4 minutes and 35 secondsThe dates-- well, the superfluous dates at least-- have gone away, and we've got row numbers instead. Right, so last week, we saw we don't really want to just print things to the screen. We want to save them somewhere so we can do something with them. So again, I'm going to press Up again. Go back to the beginning, type viewsData put the arrow in that we had before, press Enter. Now if I type in viewsData again, brilliant. So our data frame has been saved to "viewsData." If we want to get columns of this table back out, that's actually quite straightforward.
Skip to 5 minutes and 14 secondsSo in the same way we were using the dollar sign before, if we type in views data dollar dates --the name of the date column-- then we get some dates out. If we type in views data dollar views then we get the viewsData out. You can also-- so say you want the third row of the table, for example-- you can type in viewsData and then these square brackets, three, a comma, and then nothing after the comma before the closing square bracket, and you get the third row out. Okay. So it looks like our table works. However, you might have spotted that the dates are not really in the right order.
Skip to 5 minutes and 54 secondsWe can see in the data set I've got here, you might have a slightly different data set, but the data set I've got here, on row seven we can see the first day of the month. And we'd probably prefer that to be on the first row. But right now, it's on row seven. Now the first thing we have to do to try and sort this out is we have to explain to R that the date column contains dates. So if you type in views data dollar dates then we have to tell R to change that data to a date. There we go. We just use as date, press Enter.
Skip to 6 minutes and 37 secondsIf you bring up the table again, you're not really going to see a difference. But under the covers, R now knows that that column is a date. Now, we now need to tell R to order all the rows according to the order of the dates. Now there's a command called Order which will work out the earliest dates in the list, which is the second earliest, and so on. We saw before that the first day of the month is currently on row seven in my data set. If I type in order and pass it that column, then you can see the first number that comes out is seven. That's telling me that on row seven is the earliest date.
Skip to 7 minutes and 19 secondsNow we've got a whole list of the rows there. It's telling us which row has the earliest dates, which row has the second earliest dates. And we can pass that information back to R, and tell it to reorder the table according to this information. This is how we do it. Where we typed in that three before, to get the-- only the third row out, we can type in this order command instead. There you go. And you can see that's now in order. So let's overwrite what we had saved in "viewsData" with what we just typed in there. So you could press Up. That would've been quicker if I'd done that, but I'm going to just type it in again.
Skip to 8 minutes and 6 secondsThere you go. So we're going to Save. It's exactly the command we had before, and we're saving it into "viewsData." So now if I type in viewsData, our dates are in the right order. Brilliant. Now the row orders are still in a strange order, which can be a bit distracting. So let's reset those. It's quite simple. We just tell it to set the row names to null, like that. If we look at the table again, there you go. It's in the right order. Okay, so after all that, we can finally get to plotting our data. Now there's a great package for plotting data in R called ggplot2. So let's try and install that.
Skip to 8 minutes and 53 secondsSo we use the install Packages command again. Once again, you only have to do this once for the computer you're running R on. And now let's load that package using the Library command again.
Skip to 9 minutes and 11 secondsExcellent. So to make a plot in ggplot, there is a function called ggplot where the first thing we want to do is tell it where the data that we'd like to plot is. So that's in "viewsData." We then have to give it some information about what we'd like to have on the x-axis and what we'd like to have on the y-axis. Now what we're trying to do here is plot a line graph of the views data. So let's have the dates on the x-axis, and the views on the y-axis. We also have to tell ggplot that all of this data belongs in one group together. So we just want one line. Let me just add to this. Add geom line.
Skip to 10 minutes and 7 secondsWe press Enter. Excellent. There you go. We've got a line graph of the data that we downloaded. Now there's some clear peaks. So given that we're looking at the page about Friday, what might this be due to? We've only got less than a month's worth of data here, however. So next session, what we'll look at is how we can get R to help us get even more data. Now at the beginning of this session, we created a project which we've been doing this work in. And so that you get your work back next session-- you don't have to do it all again-- we need to Save this work in the project. This is actually quite straightforward.
Skip to 10 minutes and 48 secondsIf you just quit RStudio, it'll ask you, do you want to Save your workspace image to the "dot R data?" Now if you just click on Save, all the work you've done this session is nice and safe. So we look forward to seeing you next session.
Visualise what people are looking for on Wikipedia
Note: In order to see the details of the commands and screen-based information it is best to view this video full screen and in HD. However, if you’d like to have a go yourself, we strongly recommend you follow the instructions in the text below, where we tell you what the commands in the video are. You’ll probably find this a lot easier than copying the commands from the video.
Welcome to the second exercise of the Big Data course!
The video above is a representation, a walkthrough, of the full exercise below. You can watch the video to gain an understanding of the use of data and R or, if you are interested in trying the processes for yourself, then follow the exercise text below that will guide you through the various steps.
Please refer back to Step 1.10 for guidance on installing and running R and RStudio.
Last session, we had got as far as downloading data on how often people had viewed the Friday page from the website http://stats.grok.se/ and had “parsed” this data, so it looked a bit more like a table.
First, let’s look at how we can get back to where we were last session. We are going to create a project to save all of our work in RStudio. To do this, go to the “File” menu, and select “New Project”. You can now create a project in a new directory by selecting “New Directory”, or in an existing directory by selecting “Existing Directory”, as you wish.
Let’s assume you pick “Existing Directory”. Select the directory you would like to use, and click “Create Project”.
We’ve made an R file available called LoadParsedData.R. Download this file, and save it in the directory you created your project in. You should now be able to see this file in the “Files” window in RStudio. Open this file now.
This file contains the most important four commands from the last session.
Assuming you already did the work last session, you will have already installed the “RCurl” and “RJSONIO” libraries. In this case, you can click on “Source” to run this code. We’re now back to where we were last session.
Last session, we saved the parsed data in parsedData:
And we saw that the data on the page views was within parsedData under the label “$daily_views”
How can we plot this data so we can have a better look?
It’s often easiest to work with data in R when your data is in a “data frame” - a table of named columns containing your data. So let’s move this data into a data frame.
We want one column with the date, and one column with the views.
In parsedData$daily_views, the dates are actually the “names” of the data points. We can get the names out like this:
Now we want to put the names in one column, and the views data in another. Try typing this in:
data.frame(Date=names(parsedData$daily_views), # get the names Views=parsedData$daily_views) # get the data points
Note: Where the code consists of more than one line of code you must hit ‘Enter’ at the end of each line.
This looks sort of OK, but we have the dates twice! This is because R is still trying to use the dates as names, now for the rows.
Try this instead:
data.frame(Date=names(parsedData$daily_views), # get the names Views=parsedData$daily_views, # get the data points row.names=NULL) # tell R to stop using the dates as names
OK - this looks more like it. Let’s save this table in a variable called “viewsData”.
viewsData <- data.frame(Date=names(parsedData$daily_views), # get the names Views=parsedData$daily_views, # get the data points row.names=NULL) # stop using the dates as names
Have a look:
We’ve got our table.
We can get the Date column back out like this:
… and the Views column back out like this:
You can get the third row of the table out like this:
Great - our table works. However, you might have spotted that the dates are in the wrong order. To sort this out, we have to tell R that this column contains dates:
viewsData$Date <- as.Date(viewsData$Date)
We now need to tell R to order all the rows according to the order of the dates.
There is a command called “order” which will work out which is the earliest date in the list, which is the second earliest and so on.
We can run it on the Date column like this:
We can tell R to re-order the rows in the order specified by the order function, so that the first date is first, the second date is second, and so on.
This now looks a bit better… So let’s replace the old “viewsData” with this new table
viewsData <- viewsData[order(viewsData$Date),]
Let’s have a look:
The row numbers being in a strange order can be distracting, so let’s reset them!
row.names(viewsData) <- NULL
Finally, we’ve tidied our data up! So let’s plot the data and see what it looks like…
The ggplot package lets us draw some plots.
Here’s how to plot a line graph of our data with ggplot2:
ggplot(data=viewsData, # Make a plot using our views data aes(x=Date, # with Date on the x-axis y=Views, # and Views on the y-axis group=1)) + # Use all the data as one data series geom_line() # and draw a line of this data
There are some clear peaks…
Given that we’re looking at the page about “Friday”, what might this be due to?
We’ve still only got less than a month’s worth of data, however. Next session, we will look at how we can get R to help us get more data.
© Warwick Business School, The University of Warwick