## Want to keep learning?

This content is taken from the Partnership for Advanced Computing in Europe (PRACE)'s online course, Managing Big Data with R and Hadoop. Join the course to learn more.
3.9

## Partnership for Advanced Computing in Europe (PRACE)

Skip to 0 minutes and 5 seconds In this video we will demonstrate how to perform a simple big-data analysis using RHadoop.

Skip to 0 minutes and 12 seconds We assume that you have: Run your virtual machine and logged in as hduser; started Hadoop and Rstudio; set the system environment for Hadoop within Rstudio; loaded libraries hdfs and rmr2 for RHadoop. If you have not done this, please watch the video or read the article on this topic.

Skip to 0 minutes and 40 seconds First, we show how to perform the basic data management with RHadoop. Let us create, step by step, a simple synthetic data frame** called Data with two columns and 1000 rows. First, the column with the label ‘value’ contains random integers following a binomial distribution. The second column is called ‘group’ and contains random letters from ‘a’ to ‘e’. We store this data frame to the distributed file system by calling to.dfs using the name ‘SmallData’. We can see **this dfs data file in the terminal window by typing hadoop fs -ls / If we want later to retrieve this data file back into active memory we simply call from.dfs.

Skip to 1 minute and 44 seconds In this case we get a key-value pair* with a void key and with a value equal to Data. When needed we can also delete it from the distributed file system by script dfs.rmr. Note that creating and storing big-data files is usually conducted outside R. Also, loading it to the active memory by calling from.dfs dfs can be carried out only for small data or small blocks of data within map calls.

Skip to 2 minutes and 23 seconds We continue with the demonstration that shows how to perform a simple data analysis with map-reduce. We will consider a not-so-big-data file, which is already stored in the distributed file system in the root directory under the name CEnetBig. You can see it** in the terminal window by typing hadoop fs -ls /

Skip to 2 minutes and 52 seconds This file is small rather than big: it has only approximately 34 megabytes. Its structure in DFS can be observed by running this command in the terminal window. It is not decomposed into blocks blocks since it is not big enough and since we have only one data node. We can load this data file in RStudio and play with it by a simple call of command from.dfs due to its small size. The file CEnetBig contains data from about 1 million customers of the company CEnet. Here** we can see the data from about the first 10 customers. Each customer is represented by one row, where the data about the customer’s id is in the first column.

Skip to 3 minutes and 45 seconds The following twelve columns contain values of the customer’s bills for the period January 2016 – December 2016. The last column is named ‘type’ and contains the data about which product out of five possible products the customer has. We can check that CEnetBig has 1 million of rows and 14 columns by command dim.

Skip to 4 minutes and 14 seconds Now we will perform a very simple task using map-reduce. We will count how many customers have a particular product.

Skip to 4 minutes and 25 seconds Actually, we can do it directly without map reduce by a simple call of script table*:

Skip to 4 minutes and 34 seconds However, to perform this task using map-reduce we need to apply the script table on every data block that is passed to the map function. The map function, therefore, counts the customers in given data block block that have a particular product and returns these frequencies as a key-value pair, where the key is always 1 and the value is the list with the only element being the table of frequencies. The reducer simply groups all the returned key-value pairs and performs a matrix sum over all the values, i.e., over all the frequency tables. The final map-reduce code for computing the distribution of customers in terms of the type of product is simple.**

Skip to 5 minutes and 37 seconds The resulting BigTable* is a key value pair with only one key, which is 1, and with the value, which is the desired frequency distribution of products among all customers in our database*.

# First Big Data example with RHadoop

All code in the text below pertains to R unless stated otherwise.

### Synthetic data

We create the synthetic data frame called Data with two columns and 1000 rows by:

N=1000;
X=rbinom(N,5,0.5)
Y=sample(c("a","b","c","d","e"), 1000, replace=TRUE)
Data=data.frame(X,Y)
colnames(Data)=c("value","group")


This data can be stored to the distributed file system by calling to.dfs using the name ‘SmallData’ by:

to.dfs(Data, "SmallData",format="native")


To retrieve this datafile back into active memory simply call from.dfs:

SmallData=from.dfs("/SmallData")


In this case we get a key-value pair with a void key and with a value equal to Data.
When needed we can delete it from the distributed file system by script dfs.rmr:

dfs.rmr("/SmallData")


### CEnetBig example

We continue with not-so-big-data file, which is already stored in the distributed file system in the root directory under the name CEnetBig. To see it, type in the terminal window:

$hadoop fs -ls /  This file has only approximately 34 megabytes. Its structure in DFS can be observed by running in the terminal window the following script: $ hdfs fsck /CEnetBig


or in the R console:

system("hdfs fsck /CEnetBig")


It is not decomposed into blocks since it is not big enough and since we have only one data node - see the video.

The file CEnetBig contains data from about 1 million customers of the company CEnet. We load it into active memory by R script:

CEnetBig<-from.dfs("/CEnetBig")


Each customer is represented by one row, where the data about the customer’s id is in the first column. The following twelve columns contain values of the customer’s bills for the period January 2016 – December 2016. The last column is named ‘type’ and contains the data about which product out of five possible products the customer has. We can check that CEnetBig has 1 million rows and 14 columns by typing R script:

dim(CEnetBig$val)  We will perform a very simple task using map-reduce. We will count how many customers have a particular product. Actually, we can do it directly without map reduce by a simple call of R script table: T = table(CEnetBig$val[,"type"]);T


However, to perform this task using map-reduce we need to apply the script table on every data block that is passed to the map function. The map function, therefore, counts the customers in given data block that have a particular product and returns these frequencies as a key-value pair, where the key is always 1 and the value is the list with the only element being the table of frequencies:

mapperTable = function (., X) {
T=table(X[,"type"])
keyval(1,list(T))
}


The reducer simply groups all the returned key-value pairs and performs a matrix sum over all the values, i.e., over all the frequency tables:

reducerTable = function(k, A) {
keyval(k,list(Reduce('+', A)))
}


The final map-reduce code for computing the distribution of customers in terms of the type of product is simple:

BigTable<-from.dfs(
mapreduce(
input = "/CEnetBig",
map = mapperTable,
reduce = reducerTable
)
)


The resulting BigTable is a key value pair with only one key, which is 1, and with the value, which is the desired frequency distribution of products among all customers in our database:

BigTable\$val
[[1]]

1      2      3      4      5
199857 250569  99595 299849 150130


## Alternative code

The code in the following example essentially does the same as the code before, i.e., with keys 1:5 returns the total number of data items with the corresponding type=key in /CEnetBig.

mapperTable = function (., X) {
T=as.list(table(X[,"type"]))
key=as.numeric(names(T))
keyval(key,T)
}

reducerTable = function(k, A) {
S=Reduce('+', A)
keyval(k,list(S))
}

BigTable1<-from.dfs(
mapreduce(
input = "/CEnetBig",
map = mapperTable,
reduce = reducerTable
)
)


This example should clarify that the function Reduce actually applies the operator (i.e., '+') to the set of all pairs (key, val) having the same key and gives the pair (key, result of reduce). The function keyval subsequently returns pairs of key and value. The final result is again a key-value pair where:

• key contains all keys returned by reducer and
• value contains the corresponding value returned by reducer.