Skip to 0 minutes and 9 seconds Welcome back. So last session, we saw that Wikipedia actually makes information available on how many people have viewed different pages. So for example, if you go to stats.grok.se, and you type in the name of the page, like Friday, then you get a nice graph which looks a little bit like this. Now the good news is that you can actually get the raw data if you want to analyse this information, not just plot graphs on the website. You can get that data by clicking on this page in JSON format down here. Unfortunately, what turns up is not a nice table as you might hope, of data which is really easy to analyse.
Skip to 0 minutes and 51 seconds Instead what you get is a lot of strange punctuation symbols, so funny brackets and colons and things. Now thankfully, it’s not actually that difficult to take this data and turn it into a table that you can work with. What we need to do is get this information out of the web browser and into a programme which can understand it. Now to do our data analysis here and throughout the rest of the course, we’re going to use R. Now, R is an industry leading package for analysing data. And conveniently, thanks to the generosity of the many programmers who’ve contributed to it, it’s free.
Skip to 1 minute and 30 seconds Here, I’m going to assume you’ve already followed the instructions that we’ve made available for installing R and RStudio. Go and have a look at the course website if not. So let’s open RStudio. I’ve opened it in the background already. Here we go. Now, lots of people have actually already written programmes which can help us solve parts of our problem. For example, there’s already a package, or in other words, a collection of programmes which will let us download data from the web and get it into R. This package is called RCurl. So let’s install it. We can use a command called install.packages.
Skip to 2 minutes and 15 seconds Now if you want to install a package, you only have to do this once. Once you’ve installed it on a computer, you don’t need to install it again. It will always be there when you load R again. So that’s worked. We’ve installed it. If you want to use that programme, you also have to load it. So we use the command Library to do that. So RCurl, this time without quotes. There we go. Excellent. It’s loaded. So finally, let’s download the raw data from the website. We can use exactly the URL that you saw in your web browser.
Skip to 2 minutes and 52 seconds So if we go back to the web browser, copy the URL, take it back to RStudio, and then we can type in a function called getURL and just paste in that URL between quotes. If you hit Enter, there you go. Excellent. So good news is that R has successfully managed to get the data. The bad news is it pretty much looks exactly like what we had in the browser. And possibly even worse. We’ve got some further strange slash symbols in there. Okay. So let’s try and turn this information into a table which we can understand. Now in order to help us work with this data, we want to save it somewhere rather than just printing it to the screen.
Skip to 3 minutes and 40 seconds What we’re looking at here is what we call raw data. Data which we haven’t processed at all. So let’s label the place we’re going to save it “rawData.” So here we go. Let’s type in raw data and put some information in there using the arrow symbol. And then we type in getURL, as we did before, and paste in the URL. There we go. Excellent. So it seems to have done something. If we type in raw data, we can see what it’s done. Brilliant. It’s there, just like it was before. So let’s try and turn this into a table. Now the good news is this strange looking format is actually a very well-known format. It’s called JSON.
Skip to 4 minutes and 23 seconds And other people have written programmes which will read the information you can see on your screen and turn it into a table that you can work with in a much easier fashion. So first we need to install one of these programmes for parsing this data. So parsing is basically interpreting this format and turning it into a different format that we can work with more easily. So let’s install this package. Again, you only need to do this once on the computer you’re working on.
Skip to 4 minutes and 56 seconds And the package is called RJSONIO. There we go. Excellent. It’s installed. And now, again as before, we need to load it. This is a bit you have to do every time you want to use this package in R. Again, without quotes.
Skip to 5 minutes and 14 seconds Brilliant. Okay. So in order to turn that strange looking data into a table, in fact we just need one function called fromJSON, which we write like this. And we pass the information we saved before in “rawData” to this function. Like that, just by putting in the brackets afterwards. If we press Enter, then you can see that what comes out is looking a lot easier to understand. There’s still some strange complicated bits there, but we can see at the top, especially under “$daily_views,” there’s something which looks a bit like a table, which we might be able to work with. So as we did before, let’s try and save this data somewhere. Now, we’ve parsed the data, as we were discussing before.
Skip to 6 minutes and 2 seconds So rather than calling it “raw data,” let’s call it “parsed data” now. So type in parsed data. And if I just type in exactly what I typed in before, instead of printing the parsed data to the screen, it’ll save it in “parsedData,” another variable. Let’s check what’s there, make sure it’s worked. There you go. Excellent. Printed the same thing to the screen. So how do we get access to that bit of the data which is labelled “$data_views?” Well, let’s try typing this in. The place we just saved it is called “Parsed Data.” And now let’s type dollar daily underscore views.
Skip to 6 minutes and 45 seconds There you go. There’s the table. So we’ve got something which looks like it might be a bit easier to work with. It’d be nice if we could make a graph of our own to see what’s going on. So next week, we’ll have a look at how to do this. See you then.
Finding out what people are looking for on Wikipedia using R
Note: In order to see the details of the commands and screen-based information it is best to view this video full screen and in HD. However, if you’d like to have a go yourself, we strongly recommend you follow the instructions in the text below. Here, we tell you what the commands in the video are, and we also have made a couple of small updates to the text since we recorded this video which will help you get your code to work. You’ll find following the text a lot easier than copying the commands from the video.
Welcome to this first exercise of the Big Data course. The exercises in this course are designed for you to follow and run yourself in RStudio and R.
The video above is a representation, a walkthrough, of the full exercise below. You can watch the video to gain an understanding of the use of data and R or, if you are interested in trying the processes for yourself, then follow the exercise text below that will guide you through the various steps. If you would like to try the processes yourself, note that the text below has been updated a little since the video was recorded, so make sure that you do follow the instructions in the text.
Not only Google makes data on usage of its service available, but other Internet based services too. Another example is Wikipedia. Most people know that Wikipedia makes articles on a range of subjects available for free. However, it also provides free access to data on how many people have looked at these articles.
At the moment, https://tools.wmflabs.org/pageviews only provides data from July 2015 onwards, so we will begin by looking at how to retrieve data from http://stats.grok.se/, which provides years of data from December 2007 to January 2016. At the end of these exercises, we will provide some pointers for anyone who would like to extend their skills to retrieve data from https://tools.wmflabs.org/pageviews too.
Go to http://stats.grok.se/, select the options “English” (for English language Wikipedia), “201512” (for December 2015), and type in the article name “Christmas”. What do you see?
Views to Wikipedia pages also tend to reflect current events in the world. For example, go to http://stats.grok.se/, select the options “English” (for English language Wikipedia), “201509” (for September 2015), and type in the article name “Volkswagen”. What do you see? Where does this surge come from? Wikipedia’s list of events in 2015 might help you work this out, and give you some more ideas for exploration: https://en.wikipedia.org/wiki/2015.
Now, let’s try an example of a simple pattern we might expect to repeat throughout the year. On the http://stats.grok.se/ website, select “English”, “201410” (for October 2014) and type in “Friday”.
To get hold of this data to do our own analysis, we can click on the little link at the bottom of the page: “This page in json format”, which brings us to this page: http://stats.grok.se/json/en/201504/Friday
Unfortunately, we find that the data does not come in a nice table that we can work with easily! Instead, you get a page full of funny brackets, colons, and other punctuation signs.
The good news is that it’s not so hard to turn this strange looking information into a table.
We need to get this information out of the web browser and into a program, which is going to help us analyse it.
We are going to use R, which is an industry leading environment for analysing data, with the excellent editor program RStudio. Conveniently, thanks to the generosity of the many programmers who have contributed to R and RStudio, they are both free!
Let’s open RStudio. These exercises, one per week except in Week 5, have instructions, followed by commands you can type into the window labelled “Console”, or copy from here:
You can ignore anything on a line written after a # sign – this is just a comment. There is no need to copy this across – although if you do, R will ignore it too. Try it out:
cat("Hello again, World!") # This is a comment
This exercise aims to explain a lot of the commands used. However, R has some great built in help too. If you would like to learn more about a command (for example, “install.packages”), try the R ? feature:
So, back to our original problem. How do we use R to find out what people are looking for on Wikipedia?
Now, lots of people have already written programs which can help us solve parts of our problem. In R, these are made available in “packages”. For example, there is already a package which will let us download data from the web, and get it into R. This package is called “RCurl”. Let’s install it.
This command installs the package for you - you only ever need to run it once on the computer you are using right now:
This command loads the package for you. You need to run this command every time you open R and want to use this package:
Now let’s download the raw data from the website. You can use exactly the URL you see in your web browser.
The good news is that we’ve got the data! The bad news is that it still looks like the strange format we saw in the web browser.
So we can try and turn this information into a table which we can understand, let’s save the information.
This is what we would call “raw data”, data which we haven’t processed at all, so let’s label the place we are saving the information “rawData”.
Put the result of getURL in “rawData”:
rawData <- getURL("http://stats.grok.se/json/en/201410/Friday")
The place we are saving the information, “rawData”, is called a variable. It’s just like a box where you can keep information you’d like to do something with later.
Let’s have a look at what’s now in rawData, by typing the variable name into the Console:
OK – this looks like what we had before!
Now let’s try and turn this dataset into a table.
This format is actually a well known format called “JSON”. We can install a program which can read this data. This program is called a “parser”.
Again, there is already a “package” which will help R to read JSON data.. This package is called “RJSONIO”. Let’s install it:
install.packages("RJSONIO") # Install the JSON parser library(RJSONIO) # Load the JSON parser
Note: where the code consists of more than one line of code you must hit ‘Enter’ at the end of each line.
To parse the JSON, we can use a function called “fromJSON”:
What comes out of this function looks a bit more understandable, particularly the bit under “$daily_views”. Let’s save the parsed data in a variable called “parsedData”:
parsedData <- fromJSON(rawData)
Let’s check what’s now saved in parsedData, by typing the variable name into the Console:
… looks good.
How can we access the information in the variable labelled “$daily_views”? Try typing this in
OK - it looks like we’ve got something which is a bit more like a table!
Excellent – you’ve downloaded data on how often people are looking at a Wikipedia page and loaded it into R!
How can we make a graph of this dataset so we can take a better look at what we’ve downloaded? This is what we’ll look at next week.
© Warwick Business School, The University of Warwick