Skip main navigation

Example 2: Computing the within the group sum-of-squares

Here we demonstrate how to compute the within the group sum-of-squares matrix using `rmr2` library and `mapreduce`.
© PRACE and University of Ljubljana

Data

We consider again the customer data CEnetBig data, which is already stored in the HDFS. You can load it in R using:

CEnetBig=from.dfs("/CEnetBig")

Its data format is as follows:

 id 2016_1 2016_2 2016_3 2016_4 2016_5 2016_6 2016_7 2016_8 2016_9 2016_10 2016_11 2016_12 type
[1,] 100373 137.66 141.57 128.83 133.00 97.39 116.62 123.97 156.83 90.50 98.62 118.61 152.34 4
[2,] 100194 98.32 119.40 120.30 105.67 90.26 80.13 80.62 108.63 104.30 123.31 101.93 140.85 2
[3,] 100565 127.60 133.79 90.15 62.33 87.96 92.20 72.04 113.69 65.95 82.69 85.72 121.81 2
[4,] 100553 154.60 175.10 94.64 123.41 116.96 94.57 124.25 138.89 72.57 121.03 122.09 106.79 4
[5,] 100902 162.26 157.10 114.03 145.30 144.44 73.91 131.93 142.66 125.98 92.90 104.70 161.60 5
[6,] 100883 119.66 148.39 144.38 105.61 66.66 70.84 110.15 114.50 75.60 85.22 125.67 90.76 2
[7,] 100352 147.50 110.85 95.61 77.76 98.78 54.88 104.35 53.52 73.09 101.75 77.65 58.19 1
[8,] 100863 108.84 75.53 105.55 82.24 119.41 49.98 94.74 136.62 101.14 71.08 29.29 131.81 2
[9,] 100626 109.20 107.59 96.95 88.14 94.12 80.71 68.83 87.45 66.52 95.28 83.21 82.38 1
[10,] 100867 114.71 88.94 88.45 75.03 74.58 55.55 126.48 42.78 88.01 124.90 137.59 152.55 2

The last column therefore defines the groups and the interesting data is in columns 2016_1,…,2016_12.

Goal

Compute the within the group sum-of-squares (SS) matrix W, which is defined as the sum of group SS matrices S_i, where the (p,q) entry in S_i is defined as the scalar product of the p-th and q-th columns on group i after subtracting the group mean values for these two columns.
For example, if X1 is the matrix of data corresponding to the group 1 of CEnetBig data (already stored in DFS), we can compute the group SS matrix S_1 directly (without RHadoop, using the textbook formula) as (this is possible since the data is still not too big):

CEnetBig=from.dfs("/CEnetBig");
X1=CEnetBig$val[which(CEnetBig$val[,14]==1),2:13];
m=colMeans(X1);
n=nrow(X1);
M=matrix(rep(m,each=n),nrow=n);
S1=t(X1-M)%*%(X1-M);

Note that we can also compute S1 by cov(X1)*(n-1). Another way to compute S1 is also as (outer stands for the outer product of vectors):

S1=t(X1)%*%X1-n*outer(m,m);

Method

Note that the data is divided into several chunks and each chunk has data from all 5 groups.
To compute Si we first recall a result that can be found in any statistical textbook:
Si=Xi^T *Xi-mi^T*mi, where Xi is the data block containing the data for group i and mi is its mean vector (centroid).
We must therefore compute Xi^T *Xi over all the chunks of data. If Xi can be decomposed into blocks Xi1,...,Xik (k is the number of chunks), then Xi^T *Xi=Xi1^T *Xi1+...+Xik^T *Xik.
Each of Xij^T *Xij is computed with the map function. Likewise, the map function also computes the column sums of Xij and the corresponding numbers of rows in Xij, which will finally (in the reduce part) yield Xi^T *Xi and mi.

Map

The map function computes for each data chunk i the sizes of the groups, the group row sums and the group matrix products Xi1^T *Xi1. We actually use the knowledge that we have 5 groups coded with numbers 1,2,3,4,5 in the 14th column. The map function returns key-value pairs containing as key the group index and the corresponding value is a list containing: the group size, the group row sums and the groups matrix products Xi1^T *Xi1.

mapperSS = function (., X) {
n=nrow(X);
N=5;
n_vec=matrix(nrow = 1,ncol = 5); # vector of group sizes
sum_mat=matrix(nrow = 5,ncol=12); # matrix of group row sums
SS_tensor=array(dim=c(5,12,12)); # tensor containing SS matrices
for (i in 1:N){
Xi=subset.matrix(X[,2:13],X[,14]==i);
si=colSums(Xi);
ni=nrow(Xi);
SSi=t(Xi)%*%Xi;
n_vec[i]=ni;
sum_mat[i,]=si;
SS_tensor[i,,]=SSi;
}
keyval(1:3,list(n_vec,sum_mat,SS_tensor));
}

Reduce

In the reduce part we simply add the key values over all the key-value pairs.

reducerSS = function(k, A) {
keyval(k,list(Reduce('+', A)))
}

Map-Reduce

In this part we perform map-reduce on the data CEnetBig.

GroupRes <- from.dfs(
mapreduce(
input = "/CEnetBig",
map = mapperSS,
reduce = reducerSS,
combine = T
)
)

Final code

Here we finally compute the group means (centroids) and the SS matrices.

N=5 # 5 groups in CEnetBig data
K=12 # 12 relevant data variables
GroupMeans=matrix(nrow=N,ncol=K) # matrix containing mean vectors as rows
GroupSS=vector("list", N); # list with group SS matrices
for (i in 1:N){
GroupMeans[i,] <- GroupRes$val[[2]][i,]/GroupRes$val[[1]][i]
GroupSS[[i]] <- GroupRes$val[[3]][i,,]-GroupRes$val[[1]][i]*outer(GroupMeans[i,],GroupMeans[i,])
}
© PRACE and University of Ljubljana
This article is from the free online

Managing Big Data with R and Hadoop

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education