We use cookies to give you a better experience, if that’s ok you can close this message and carry on browsing. For more info read our cookies policy.
3.11

# Example 2: computing the within the group sum-of-squares

## Data

We consider again the customers data CEnetBig data, which is already stored in the HDFS. You can load it in R using.

CEnetBig=from.dfs("/CEnetBig")


Its data format is the following:

         id 2016_1 2016_2 2016_3 2016_4 2016_5 2016_6 2016_7 2016_8 2016_9 2016_10 2016_11 2016_12 type
[1,] 100373 137.66 141.57 128.83 133.00  97.39 116.62 123.97 156.83  90.50   98.62  118.61  152.34    4
[2,] 100194  98.32 119.40 120.30 105.67  90.26  80.13  80.62 108.63 104.30  123.31  101.93  140.85    2
[3,] 100565 127.60 133.79  90.15  62.33  87.96  92.20  72.04 113.69  65.95   82.69   85.72  121.81    2
[4,] 100553 154.60 175.10  94.64 123.41 116.96  94.57 124.25 138.89  72.57  121.03  122.09  106.79    4
[5,] 100902 162.26 157.10 114.03 145.30 144.44  73.91 131.93 142.66 125.98   92.90  104.70  161.60    5
[6,] 100883 119.66 148.39 144.38 105.61  66.66  70.84 110.15 114.50  75.60   85.22  125.67   90.76    2
[7,] 100352 147.50 110.85  95.61  77.76  98.78  54.88 104.35  53.52  73.09  101.75   77.65   58.19    1
[8,] 100863 108.84  75.53 105.55  82.24 119.41  49.98  94.74 136.62 101.14   71.08   29.29  131.81    2
[9,] 100626 109.20 107.59  96.95  88.14  94.12  80.71  68.83  87.45  66.52   95.28   83.21   82.38    1
[10,] 100867 114.71  88.94  88.45  75.03  74.58  55.55 126.48  42.78  88.01  124.90  137.59  152.55    2


The last column therefore defines the groups and the interesting data is in columns 2016_1,…,2016_12.

## Goal

Compute the within the group sum-of-squares (SS) matrix W, which is defined as the sum of group SS matrices S_i, where the (p,q) entry in S_i is defined as the scalar product of p-th and q-th column on group i after subtracting the group mean values for these two columns. For example, if X1 is the matrix of data corresponding to the group 1 of CEnetBig data (already stored in DFS), we can compute the group SS matrix S_1 directly (without RHadoop, using textbook formula) as (this is possible since the data is still not too big):

CEnetBig=from.dfs("/CEnetBig");
X1=CEnetBig$val[which(CEnetBig$val[,14]==1),2:13];
m=colMeans(X1);
n=nrow(X1);
M=matrix(rep(m,each=n),nrow=n);
S1=t(X1-M)%*%(X1-M);


Note that we can compute S1 also by cov(X1)*(n-1). Another alternative way to compute S1 is also as (outer stands for the outer product of vectors):

S1=t(X1)%*%X1-n*outer(m,m);


## Method

Note that the data is divided into several chunks and each chunk has data from all 5 groups. To compute Si we first recall a result which can be found in any statistical textbook: Si=Xi^T *Xi-mi^T*mi, where Xi is the data block containing data for group i and mi is its mean vector (centroid). We must therefore compute Xi^T *Xi over all chunks of data. If Xi can be decomposed into blocks Xi1,...,Xik (k is the number of chunks), then Xi^T *Xi=Xi1^T *Xi1+...+Xik^T *Xik. Each of Xij^T *Xij is computed with map function. Likewise map function computes also the column sums of Xij and the corresponding numbers of rows in Xij, which will finally (in the reduce part) yield Xi^T *Xi and mi.

### Map

The map function computes for each data chunk i the sizes of groups, the group row sums and the group matrix products Xi1^T *Xi1. We actually use the knowledge that we have 5 groups coded with number 1,2,…5 in the 14th column. The map function returns key-value pairs containing as key the group index and the corresponding value is list containing: the group size, the group row sums and the groups matrix products Xi1^T *Xi1.

mapperSS = function (., X) {
n=nrow(X);
N=5;
n_vec=matrix(nrow = 1,ncol = 5);       # vector of group sizes
sum_mat=matrix(nrow = 5,ncol=12);      # matrix of group row sums
SS_tensor=array(dim=c(5,12,12));       # tensor containing SS matrices
for (i in 1:N){
Xi=subset.matrix(X[,2:13],X[,14]==i);
si=colSums(Xi);
ni=nrow(Xi);
SSi=t(Xi)%*%Xi;
n_vec[i]=ni;
sum_mat[i,]=si;
SS_tensor[i,,]=SSi;
}
keyval(1:3,list(n_vec,sum_mat,SS_tensor));
}


### Reduce

In the reduce part we simply add the key values over all key-value pairs.

reducerSS = function(k, A) {
keyval(k,list(Reduce('+', A)))
}


### Map-Reduce

In this part we perform map-reduce on the data CEnetBig

GroupRes <-   from.dfs(
mapreduce(
input = "/CEnetBig",
combine = T
)
)


## Final code

Here we finally compute the group means (centroids) and the SS matrices

N=5     # 5 groups in CEnetBig data
K=12    # 12 relevant data variables
GroupMeans=matrix(nrow=N,ncol=K)  # matrix containing mean vectors as rows
GroupSS=vector("list", N);                   # list with group SS matrices
for (i in 1:N){
GroupMeans[i,] <- GroupRes$val[[2]][i,]/GroupRes$val[[1]][i]
GroupSS[[i]] <- GroupRes$val[[3]][i,,]-GroupRes$val[[1]][i]*outer(GroupMeans[i,],GroupMeans[i,])
}