Contact FutureLearn for Support
Skip main navigation
We use cookies to give you a better experience, if that’s ok you can close this message and carry on browsing. For more info read our cookies policy.
We use cookies to give you a better experience. Carry on browsing if you're happy with this, or read our cookies policy for more information.

Skip to 0 minutes and 5 secondsIn this video we will demonstrate how to perform a simple big-data analysis using RHadoop.

Skip to 0 minutes and 12 secondsWe assume that you have: Run your virtual machine and logged in as hduser; started Hadoop and Rstudio; set the system environment for Hadoop within Rstudio; loaded libraries hdfs and rmr2 for RHadoop. If you have not done this, please watch the video or read the article on this topic.

Skip to 0 minutes and 40 secondsFirst, we show how to perform the basic data management with RHadoop. Let us create, step by step, a simple synthetic data frame** called Data with two columns and 1000 rows. First, the column with the label ‘value’ contains random integers following a binomial distribution. The second column is called ‘group’ and contains random letters from ‘a’ to ‘e’. We store this data frame to the distributed file system by calling to.dfs using the name ‘SmallData’. We can see **this dfs data file in the terminal window by typing hadoop fs -ls / If we want later to retrieve this data file back into active memory we simply call from.dfs.

Skip to 1 minute and 44 secondsIn this case we get a key-value pair* with a void key and with a value equal to Data. When needed we can also delete it from the distributed file system by script dfs.rmr. Note that creating and storing big-data files is usually conducted outside R. Also, loading it to the active memory by calling from.dfs dfs can be carried out only for small data or small blocks of data within map calls.

Skip to 2 minutes and 23 secondsWe continue with the demonstration that shows how to perform a simple data analysis with map-reduce. We will consider a not-so-big-data file, which is already stored in the distributed file system in the root directory under the name CEnetBig. You can see it** in the terminal window by typing hadoop fs -ls /

Skip to 2 minutes and 52 secondsThis file is small rather than big: it has only approximately 34 megabytes. Its structure in DFS can be observed by running this command in the terminal window. It is not decomposed into blocks blocks since it is not big enough and since we have only one data node. We can load this data file in RStudio and play with it by a simple call of command from.dfs due to its small size. The file CEnetBig contains data from about 1 million customers of the company CEnet. Here** we can see the data from about the first 10 customers. Each customer is represented by one row, where the data about the customer’s id is in the first column.

Skip to 3 minutes and 45 secondsThe following twelve columns contain values of the customer’s bills for the period January 2016 – December 2016. The last column is named ‘type’ and contains the data about which product out of five possible products the customer has. We can check that CEnetBig has 1 million of rows and 14 columns by command dim.

Skip to 4 minutes and 14 secondsNow we will perform a very simple task using map-reduce. We will count how many customers have a particular product.

Skip to 4 minutes and 25 secondsActually, we can do it directly without map reduce by a simple call of script table*:

Skip to 4 minutes and 34 secondsHowever, to perform this task using map-reduce we need to apply the script table on every data block that is passed to the map function. The map function**, therefore, counts the customers in given data block block that have a particular product and returns these frequencies as a key-value pair, where the key is always 1 and the value is the list with the only element being the table of frequencies. The reducer** simply groups all the returned key-value pairs and performs a matrix sum over all the values, i.e., over all the frequency tables. The final map-reduce code for computing the distribution of customers in terms of the type of product is simple.**

Skip to 5 minutes and 37 secondsThe resulting BigTable* is a key value pair with only one key, which is 1, and with the value, which is the desired frequency distribution of products among all customers in our database*.

First Big Data example with RHadoop

All code in the text below pertains to R unless stated otherwise.

Synthetic data

We create the synthetic data frame called Data with two columns and 1000 rows by

N=1000;
X=rbinom(N,5,0.5)
Y=sample(c("a","b","c","d","e"), 1000, replace=TRUE)
Data=data.frame(X,Y)
colnames(Data)=c("value","group")

This data can be stored to the distributed file system by calling to.dfs using the name ‘SmallData’ by

to.dfs(Data, "SmallData",format="native")

To retrieve this datafile back into active memory simply call from.dfs:

SmallData=from.dfs("/SmallData")

In this case we get a key-value pair with a void key and with a value equal to Data.
When needed we can delete it from the distributed file system by script dfs.rmr:

dfs.rmr("/SmallData")

CEnetBig example

We continue with not-so-big-data file, which is already stored in the distributed file system in the root directory under the name CEnetBig. To see it, type in the terminal window hadoop fs -ls /

This file has only approximately 34 megabytes. Its structure in DFS can be observed by running in the terminal window the following script:

hdfs fsck /CEnetBig

It is not decomposed into blocks since it is not big enough and since we have only one data node - see the video.

The file CEnetBig contains data from about 1 million customers of the company CEnet. We load it into active memory by R script

CEnetBig<-from.dfs("/CEnetBig")

Each customer is represented by one row, where the data about the customer’s id is in the first column. The following twelve columns contain values of the customer’s bills for the period January 2016 – December 2016. The last column is named ‘type’ and contains the data about which product out of five possible products the customer has. We can check that CEnetBig has 1 million of rows and 14 columns by typing R script:

dim(CEnetBig$val)

We will perform a very simple task using map-reduce. We will count how many customers have a particular product. Actually, we can do it directly without map reduce by a simple call of R script table:

T = table(CEnetBig$val[,"type"]);T

However, to perform this task using map-reduce we need to apply the script table on every data block that is passed to the map function. The map function, therefore, counts the customers in given data block that have a particular product and returns these frequencies as a key-value pair, where the key is always 1 and the value is the list with the only element being the table of frequencies:

mapperTable = function (., X) {
  T=table(X[,"type"])
  keyval(1,list(T))
}

The reducer simply groups all the returned key-value pairs and performs a matrix sum over all the values, i.e., over all the frequency tables:

reducerTable = function(k, A) {
  keyval(k,list(Reduce('+', A)))
}

The final map-reduce code for computing the distribution of customers in terms of the type of product is simple:

BigTable<-from.dfs(
  mapreduce(
    input = "/CEnetBig",
    map = mapperTable,
    reduce = reducerTable
  )
)

The resulting BigTable is a key value pair with only one key, which is 1, and with the value, which is the desired frequency distribution of products among all customers in our database:

BigTable$val
[[1]]

     1      2      3      4      5 
199857 250569  99595 299849 150130

Share this video:

This video is from the free online course:

Managing Big Data with R and Hadoop

Partnership for Advanced Computing in Europe (PRACE)