Learn more about this course.

First Big Data example with RHadoop

In this example we show how to perform the first big data analysis using RHadoop available in the virtual machine provided in Week 1.

All code in the text below pertains to R unless stated otherwise.

Synthetic data

We create the synthetic data frame called Data with two columns and 1000 rows by:

N=1000;
X=rbinom(N,5,0.5)
Y=sample(c("a","b","c","d","e"), 1000, replace=TRUE)
Data=data.frame(X,Y)
colnames(Data)=c("value","group")

This data can be stored to the distributed file system by calling to.dfs using the name ‘SmallData’ by:

to.dfs(Data, "SmallData",format="native")

To retrieve this datafile back into active memory simply call from.dfs:

SmallData=from.dfs("/SmallData")

In this case we get a key-value pair with a void key and with a value equal to Data.

When needed we can delete it from the distributed file system by script dfs.rmr:

dfs.rmr("/SmallData")

`CEnetBig` example

We continue with not-so-big-data file, which is already stored in the distributed file system in the root directory under the name CEnetBig. To see it, type in the terminal window:

$ hadoop fs -ls /

This file has only approximately 34 megabytes. Its structure in DFS can be observed by running in the terminal window the following script:

$ hdfs fsck /CEnetBig

or in the R console:

system("hdfs fsck /CEnetBig")

It is not decomposed into blocks since it is not big enough and since we have only one data node – see the video.

The file CEnetBig contains data from about 1 million customers of the company CEnet.
We load it into active memory by R script:

CEnetBig<-from.dfs("/CEnetBig")

Each customer is represented by one row, where the data about the customer’s id is in the first column. The following twelve columns contain values of the customer’s bills for the period January 2016 – December 2016. The last column is named ‘type’ and contains the data about which product out of five possible products the customer has.
We can check that CEnetBig has 1 million rows and 14 columns by typing R script:

dim(CEnetBig$val)

We will perform a very simple task using map-reduce. We will count how many customers have a particular product. Actually, we can do it directly without map reduce by a simple call of R script table:

T = table(CEnetBig$val[,"type"]);T

However, to perform this task using map-reduce we need to apply the script table on every data block that is passed to the map function. The map function, therefore, counts the customers in given data block that have a particular product and returns these frequencies as a key-value pair, where the key is always 1 and the value is the list with the only element being the table of frequencies:

mapperTable = function (., X) {
 T=table(X[,"type"])
 keyval(1,list(T))
}

The reducer simply groups all the returned key-value pairs and performs a matrix sum over all the values, i.e., over all the frequency tables:

reducerTable = function(k, A) {
 keyval(k,list(Reduce('+', A)))
}

The final map-reduce code for computing the distribution of customers in terms of the type of product is simple:

BigTable<-from.dfs(
 mapreduce(
 input = "/CEnetBig",
 map = mapperTable,
 reduce = reducerTable
 )
)

The resulting BigTable is a key value pair with only one key, which is 1, and with the value, which is the desired frequency distribution of products among all customers in our database:

BigTable$val
[[1]]

 1 2 3 4 5 
199857 250569 99595 299849 150130

Alternative code

The code in the following example essentially does the same as the code before, i.e., with keys 1:5 returns the total number of data items with the corresponding type=key in /CEnetBig.

mapperTable = function (., X) {
 T=as.list(table(X[,"type"]))
 key=as.numeric(names(T))
 keyval(key,T)
}

reducerTable = function(k, A) {
 S=Reduce('+', A)
 keyval(k,list(S))
}

BigTable1<-from.dfs(
 mapreduce(
 input = "/CEnetBig",
 map = mapperTable,
 reduce = reducerTable
 )
) 

This example should clarify that the function Reduce actually applies the operator (i.e., '+') to the set of all pairs (key, val) having the same key and gives the pair (key, result of reduce). The function keyval subsequently returns pairs of key and value. The final result is again a key-value pair where:

key contains all keys returned by reducer and
value contains the corresponding value returned by reducer.

Want to keep learning?

This content is taken from Partnership for Advanced Computing in Europe (PRACE) online course

Managing Big Data with R and Hadoop

View Course

See other articles from this course

This article is from the free online

Managing Big Data with R and Hadoop

Created by

Join Now

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now

Learn more about this course.

First Big Data example with RHadoop

Code related to examples in the video

Synthetic data

CEnetBig example

Alternative code

Want to keep learning?

Managing Big Data with R and Hadoop

Share this post

Managing Big Data with R and Hadoop

Managing Big Data with R and Hadoop

Reach your personal and professional goals

Register to receive updates

Learn more about this course.

Learn more about this course.

`CEnetBig` example