Skip main navigation

Understanding Wikipedia Data Using R

It's not just Google who makes data on usage of its service available - other internet-based services do too. An example is Wikipedia.
Note: In order to see the details of the commands and screen-based information it is best to view this video full screen and in HD. However, if you’d like to have a go yourself, we strongly recommend you follow the instructions in the text below. Here, we tell you what the commands in the video are, and we also have made a couple of small updates to the text since we recorded this video which will help you get your code to work. You’ll find following the text a lot easier than copying the commands from the video.
This exercise forms part of the Big Data: Measuring And Predicting Human Behaviour course. The exercises in this course are designed for you to follow and run yourself in RStudio and R.
The video above is a representation, a walkthrough, of the full exercise below. You can watch the video to gain an understanding of the use of data and R or, if you are interested in trying the processes for yourself, then follow the exercise text below that will guide you through the various steps. If you would like to try the processes yourself, note that the text below has been updated a little since the video was recorded, so make sure that you do follow the instructions in the text.

Wikipedia’s Available Data

It’s not just Google who makes data on usage of its service available – other internet-based services do too. An example is Wikipedia. Most people know that Wikipedia makes articles on a range of subjects available for free. However, it also provides free access to data on how many people have looked at these articles.
There are two main websites which provide access to this data, https://tools.wmflabs.org/pageviews and http://stats.grok.se/.
At the moment, https://tools.wmflabs.org/pageviews only provides data from July 2015 onwards, so we will begin by looking at how to retrieve data from http://stats.grok.se/, which provides years of data from December 2007 to January 2016. At the end of these exercises, we will provide some pointers for anyone who would like to extend their skills to retrieve data from https://tools.wmflabs.org/pageviews too.
Go to http://stats.grok.se/, select the options “English” (for English language Wikipedia), “201512” (for December 2015), and type in the article name “Christmas”. What do you see?
Views to Wikipedia pages also tend to reflect current events in the world. For example, go to http://stats.grok.se/, select the options “English” (for English language Wikipedia), “201509” (for September 2015), and type in the article name “Volkswagen”. What do you see? Where does this surge come from? Wikipedia’s list of events in 2015 might help you work this out, and give you some more ideas for exploration: https://en.wikipedia.org/wiki/2015.
Now, let’s try an example of a simple pattern we might expect to repeat throughout the year. On the http://stats.grok.se/ website, select “English”, “201410” (for October 2014) and type in “Friday”.
To get hold of this data to do our own analysis, we can click on the little link at the bottom of the page: “This page in json format”, which brings us to this page: http://stats.grok.se/json/en/201504/Friday

Understanding Wikipedia Data Using R

Unfortunately, we find that the data does not come in a nice table that we can work with easily! Instead, you get a page full of funny brackets, colons, and other punctuation signs.
The good news is that it’s not so hard to turn this strange looking information into a table.
We need to get this information out of the web browser and into a program, which is going to help us analyse it.

R and R Studio

We are going to use R, which is an industry leading environment for analysing data, with the excellent editor program RStudio. Conveniently, thanks to the generosity of the many programmers who have contributed to R and RStudio, they are both free!
Let’s open RStudio. These exercises, one per week except in Week 5, have instructions, followed by commands you can type into the window labelled “Console”, or copy from here:
 cat("Hello, World...")

 

You can ignore anything on a line written after a # sign – this is just a comment. There is no need to copy this across – although if you do, R will ignore it too. Try it out:

 

 cat("Hello again, World!") # This is a comment

Command R

This exercise aims to explain a lot of the commands used. However, R has some great built in help too. If you would like to learn more about a command (for example, “install.packages”), try the R ? feature:

 

 ?install.packages

 

So, back to our original problem:

Understanding Wikipedia Data Using R

 

Now, lots of people have already written programs which can help us solve parts of our problem. In R, these are made available in “packages”. For example, there is already a package which will let us download data from the web, and get it into R. This package is called “RCurl”. Let’s install it.

 

This command installs the package for you – you only ever need to run it once on the computer you are using right now:

 

 install.packages("RCurl")

 

This command loads the package for you. You need to run this command every time you open R and want to use this package:

 

 library(RCurl)

 

Now let’s download the raw data from the website. You can use exactly the URL you see in your web browser.

 

 getURL("http://stats.grok.se/json/en/201410/Friday")

 

The good news is that we’ve got the data! The bad news is that it still looks like the strange format we saw in the web browser.

 

So we can try and turn this information into a table which we can understand, let’s save the information.

 

This is what we would call “raw data”, data which we haven’t processed at all, so let’s label the place we are saving the information “rawData”.

 

Put the result of getURL in “rawData”:

 

 rawData <- getURL("http://stats.grok.se/json/en/201410/Friday")

 

The place we are saving the information, “rawData”, is called a variable. It’s just like a box where you can keep information you’d like to do something with later.

 

Let’s have a look at what’s now in rawData, by typing the variable name into the Console:

 

 rawData

 

OK – this looks like what we had before!

 

Now let’s try and turn this dataset into a table.

 

This format is actually a well known format called “JSON”. We can install a program which can read this data. This program is called a “parser”.

 

Again, there is already a “package” which will help R to read JSON data.. This package is called “RJSONIO”. Let’s install it:

 

 install.packages("RJSONIO") # Install the JSON parser
 library(RJSONIO) # Load the JSON parser

 

 

Note: where the code consists of more than one line of code you must hit ‘Enter’ at the end of each line.

 

 

To parse the JSON, we can use a function called “fromJSON”:

 

 fromJSON(rawData)

 

What comes out of this function looks a bit more understandable, particularly the bit under “$daily_views”. Let’s save the parsed data in a variable called “parsedData”:

 

 parsedData <- fromJSON(rawData)

 

Let’s check what’s now saved in parsedData, by typing the variable name into the Console:

 

 parsedData

 

… looks good.

 

How can we access the information in the variable labelled “$daily_views”? Try typing this in

 

 parsedData$daily_views

OK – it looks like we’ve got something which is a bit more like a table!

Excellent – you’ve downloaded data on how often people are looking at a Wikipedia page and loaded it into R!

This article is from the free online

Big Data: Measuring And Predicting Human Behaviour

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now