Skip main navigation

First Big Data example with RHadoop

In this example we show how to perform the first big data analysis using RHadoop available in the virtual machine provided in Week 1.

All code in the text below pertains to R unless stated otherwise.

Synthetic data

We create the synthetic data frame called Data with two columns and 1000 rows by:

N=1000;
X=rbinom(N,5,0.5)
Y=sample(c("a","b","c","d","e"), 1000, replace=TRUE)
Data=data.frame(X,Y)
colnames(Data)=c("value","group")

This data can be stored to the distributed file system by calling to.dfs using the name ‘SmallData’ by:

to.dfs(Data, "SmallData",format="native")

To retrieve this datafile back into active memory simply call from.dfs:

SmallData=from.dfs("/SmallData")

In this case we get a key-value pair with a void key and with a value equal to Data.

When needed we can delete it from the distributed file system by script dfs.rmr:

dfs.rmr("/SmallData")

CEnetBig example

We continue with not-so-big-data file, which is already stored in the distributed file system in the root directory under the name CEnetBig. To see it, type in the terminal window:

$ hadoop fs -ls /

This file has only approximately 34 megabytes. Its structure in DFS can be observed by running in the terminal window the following script:

$ hdfs fsck /CEnetBig

or in the R console:

system("hdfs fsck /CEnetBig")

It is not decomposed into blocks since it is not big enough and since we have only one data node – see the video.

The file CEnetBig contains data from about 1 million customers of the company CEnet.
We load it into active memory by R script:

CEnetBig<-from.dfs("/CEnetBig")

Each customer is represented by one row, where the data about the customer’s id is in the first column. The following twelve columns contain values of the customer’s bills for the period January 2016 – December 2016. The last column is named ‘type’ and contains the data about which product out of five possible products the customer has.
We can check that CEnetBig has 1 million rows and 14 columns by typing R script:

dim(CEnetBig$val)

We will perform a very simple task using map-reduce. We will count how many customers have a particular product. Actually, we can do it directly without map reduce by a simple call of R script table:

T = table(CEnetBig$val[,"type"]);T

However, to perform this task using map-reduce we need to apply the script table on every data block that is passed to the map function. The map function, therefore, counts the customers in given data block that have a particular product and returns these frequencies as a key-value pair, where the key is always 1 and the value is the list with the only element being the table of frequencies:

mapperTable = function (., X) {
T=table(X[,"type"])
keyval(1,list(T))
}

The reducer simply groups all the returned key-value pairs and performs a matrix sum over all the values, i.e., over all the frequency tables:

reducerTable = function(k, A) {
keyval(k,list(Reduce('+', A)))
}

The final map-reduce code for computing the distribution of customers in terms of the type of product is simple:

BigTable<-from.dfs(
mapreduce(
input = "/CEnetBig",
map = mapperTable,
reduce = reducerTable
)
)

The resulting BigTable is a key value pair with only one key, which is 1, and with the value, which is the desired frequency distribution of products among all customers in our database:

BigTable$val
[[1]]

1 2 3 4 5
199857 250569 99595 299849 150130

Alternative code

The code in the following example essentially does the same as the code before, i.e., with keys 1:5 returns the total number of data items with the corresponding type=key in /CEnetBig.

mapperTable = function (., X) {
T=as.list(table(X[,"type"]))
key=as.numeric(names(T))
keyval(key,T)
}

reducerTable = function(k, A) {
S=Reduce('+', A)
keyval(k,list(S))
}

BigTable1<-from.dfs(
mapreduce(
input = "/CEnetBig",
map = mapperTable,
reduce = reducerTable
)
)

This example should clarify that the function Reduce actually applies the operator (i.e., '+') to the set of all pairs (key, val) having the same key and gives the pair (key, result of reduce). The function keyval subsequently returns pairs of key and value. The final result is again a key-value pair where:

  • key contains all keys returned by reducer and
  • value contains the corresponding value returned by reducer.
This article is from the free online

Managing Big Data with R and Hadoop

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now