Example 2: computing the within the group sum-of-squares
Data
We consider again the customers data CEnetBig
data, which is already stored in the HDFS. You can load it in R
using.
CEnetBig=from.dfs("/CEnetBig")
Its data format is the following:
id 2016_1 2016_2 2016_3 2016_4 2016_5 2016_6 2016_7 2016_8 2016_9 2016_10 2016_11 2016_12 type
[1,] 100373 137.66 141.57 128.83 133.00 97.39 116.62 123.97 156.83 90.50 98.62 118.61 152.34 4
[2,] 100194 98.32 119.40 120.30 105.67 90.26 80.13 80.62 108.63 104.30 123.31 101.93 140.85 2
[3,] 100565 127.60 133.79 90.15 62.33 87.96 92.20 72.04 113.69 65.95 82.69 85.72 121.81 2
[4,] 100553 154.60 175.10 94.64 123.41 116.96 94.57 124.25 138.89 72.57 121.03 122.09 106.79 4
[5,] 100902 162.26 157.10 114.03 145.30 144.44 73.91 131.93 142.66 125.98 92.90 104.70 161.60 5
[6,] 100883 119.66 148.39 144.38 105.61 66.66 70.84 110.15 114.50 75.60 85.22 125.67 90.76 2
[7,] 100352 147.50 110.85 95.61 77.76 98.78 54.88 104.35 53.52 73.09 101.75 77.65 58.19 1
[8,] 100863 108.84 75.53 105.55 82.24 119.41 49.98 94.74 136.62 101.14 71.08 29.29 131.81 2
[9,] 100626 109.20 107.59 96.95 88.14 94.12 80.71 68.83 87.45 66.52 95.28 83.21 82.38 1
[10,] 100867 114.71 88.94 88.45 75.03 74.58 55.55 126.48 42.78 88.01 124.90 137.59 152.55 2
The last column therefore defines the groups and the interesting data is in columns 2016_1,…,2016_12.
Goal
Compute the within the group sum-of-squares (SS) matrix W
, which is defined as the sum of group SS matrices S_i
, where the (p,q)
entry in S_i
is defined as the scalar product of p
-th and q
-th column on group i
after subtracting the group mean values for these two columns.
For example, if X1
is the matrix of data corresponding to the group 1
of CEnetBig
data (already stored in DFS), we can compute the group SS matrix S_1
directly (without RHadoop
, using textbook formula) as (this is possible since the data is still not too big):
CEnetBig=from.dfs("/CEnetBig");
X1=CEnetBig$val[which(CEnetBig$val[,14]==1),2:13];
m=colMeans(X1);
n=nrow(X1);
M=matrix(rep(m,each=n),nrow=n);
S1=t(X1-M)%*%(X1-M);
Note that we can compute S1
also by cov(X1)*(n-1)
. Another alternative way to compute S1
is also as (outer
stands for the outer product of vectors):
S1=t(X1)%*%X1-n*outer(m,m);
Method
Note that the data is divided into several chunks and each chunk has data from all 5 groups.
To compute Si
we first recall a result which can be found in any statistical textbook:
Si=Xi^T *Xi-mi^T*mi
, where Xi
is the data block containing data for group i
and mi
is its mean vector (centroid).
We must therefore compute Xi^T *Xi
over all chunks of data. If Xi
can be decomposed into blocks Xi1,...,Xik
(k
is the number of chunks), then Xi^T *Xi=Xi1^T *Xi1+...+Xik^T *Xik
.
Each of Xij^T *Xij
is computed with map function. Likewise map function computes also the column sums of Xij
and the corresponding numbers of rows in Xij
, which will finally (in the reduce
part) yield Xi^T *Xi
and mi
.
Map
The map
function computes for each data chunk i
the sizes of groups, the group row sums and the group matrix products Xi1^T *Xi1
. We actually use the knowledge that we have 5 groups coded with number 1,2,…5 in the 14th column. The map
function returns key-value pairs containing as key the group index and the corresponding value is list containing: the group size, the group row sums and the groups matrix products Xi1^T *Xi1
.
mapperSS = function (., X) {
n=nrow(X);
N=5;
n_vec=matrix(nrow = 1,ncol = 5); # vector of group sizes
sum_mat=matrix(nrow = 5,ncol=12); # matrix of group row sums
SS_tensor=array(dim=c(5,12,12)); # tensor containing SS matrices
for (i in 1:N){
Xi=subset.matrix(X[,2:13],X[,14]==i);
si=colSums(Xi);
ni=nrow(Xi);
SSi=t(Xi)%*%Xi;
n_vec[i]=ni;
sum_mat[i,]=si;
SS_tensor[i,,]=SSi;
}
keyval(1:3,list(n_vec,sum_mat,SS_tensor));
}
Reduce
In the reduce part we simply add the key values over all key-value pairs.
reducerSS = function(k, A) {
keyval(k,list(Reduce('+', A)))
}
Map-Reduce
In this part we perform map-reduce on the data CEnetBig
GroupRes <- from.dfs(
mapreduce(
input = "/CEnetBig",
map = mapperSS,
reduce = reducerSS,
combine = T
)
)
Final code
Here we finally compute the group means (centroids) and the SS matrices
N=5 # 5 groups in CEnetBig data
K=12 # 12 relevant data variables
GroupMeans=matrix(nrow=N,ncol=K) # matrix containing mean vectors as rows
GroupSS=vector("list", N); # list with group SS matrices
for (i in 1:N){
GroupMeans[i,] <- GroupRes$val[[2]][i,]/GroupRes$val[[1]][i]
GroupSS[[i]] <- GroupRes$val[[3]][i,,]-GroupRes$val[[1]][i]*outer(GroupMeans[i,],GroupMeans[i,])
}
© PRACE and University of Ljubljana