# Example 2: Computing the within the group sum-of-squares

## Data

We consider again the customer data `CEnetBig`

data, which is already stored in the HDFS. You can load it in `R`

using:

```
CEnetBig=from.dfs("/CEnetBig")
```

Its data format is as follows:

```
id 2016_1 2016_2 2016_3 2016_4 2016_5 2016_6 2016_7 2016_8 2016_9 2016_10 2016_11 2016_12 type
[1,] 100373 137.66 141.57 128.83 133.00 97.39 116.62 123.97 156.83 90.50 98.62 118.61 152.34 4
[2,] 100194 98.32 119.40 120.30 105.67 90.26 80.13 80.62 108.63 104.30 123.31 101.93 140.85 2
[3,] 100565 127.60 133.79 90.15 62.33 87.96 92.20 72.04 113.69 65.95 82.69 85.72 121.81 2
[4,] 100553 154.60 175.10 94.64 123.41 116.96 94.57 124.25 138.89 72.57 121.03 122.09 106.79 4
[5,] 100902 162.26 157.10 114.03 145.30 144.44 73.91 131.93 142.66 125.98 92.90 104.70 161.60 5
[6,] 100883 119.66 148.39 144.38 105.61 66.66 70.84 110.15 114.50 75.60 85.22 125.67 90.76 2
[7,] 100352 147.50 110.85 95.61 77.76 98.78 54.88 104.35 53.52 73.09 101.75 77.65 58.19 1
[8,] 100863 108.84 75.53 105.55 82.24 119.41 49.98 94.74 136.62 101.14 71.08 29.29 131.81 2
[9,] 100626 109.20 107.59 96.95 88.14 94.12 80.71 68.83 87.45 66.52 95.28 83.21 82.38 1
[10,] 100867 114.71 88.94 88.45 75.03 74.58 55.55 126.48 42.78 88.01 124.90 137.59 152.55 2
```

The last column therefore defines the groups and the interesting data is in columns 2016_1,…,2016_12.

## Goal

Compute the within the group sum-of-squares (SS) matrix `W`

, which is defined as the sum of group SS matrices `S_i`

, where the `(p,q)`

entry in `S_i`

is defined as the scalar product of the `p`

-th and `q`

-th columns on group `i`

after subtracting the group mean values for these two columns.
For example, if `X1`

is the matrix of data corresponding to the group `1`

of `CEnetBig`

data (already stored in DFS), we can compute the group SS matrix `S_1`

directly (without `RHadoop`

, using the textbook formula) as (this is possible since the data is still not too big):

```
CEnetBig=from.dfs("/CEnetBig");
X1=CEnetBig$val[which(CEnetBig$val[,14]==1),2:13];
m=colMeans(X1);
n=nrow(X1);
M=matrix(rep(m,each=n),nrow=n);
S1=t(X1-M)%*%(X1-M);
```

Note that we can also compute `S1`

by `cov(X1)*(n-1)`

. Another way to compute `S1`

is also as (`outer`

stands for the outer product of vectors):

```
S1=t(X1)%*%X1-n*outer(m,m);
```

## Method

Note that the data is divided into several chunks and each chunk has data from all 5 groups.
To compute `Si`

we first recall a result that can be found in any statistical textbook:
`Si=Xi^T *Xi-mi^T*mi`

, where `Xi`

is the data block containing the data for group `i`

and `mi`

is its mean vector (centroid).
We must therefore compute `Xi^T *Xi`

over all the chunks of data. If `Xi`

can be decomposed into blocks `Xi1,...,Xik`

(`k`

is the number of chunks), then `Xi^T *Xi=Xi1^T *Xi1+...+Xik^T *Xik`

.
Each of `Xij^T *Xij`

is computed with the map function. Likewise, the map function also computes the column sums of `Xij`

and the corresponding numbers of rows in `Xij`

, which will finally (in the `reduce`

part) yield `Xi^T *Xi`

and `mi`

.

### Map

The `map`

function computes for each data chunk `i`

the sizes of the groups, the group row sums and the group matrix products `Xi1^T *Xi1`

. We actually use the knowledge that we have 5 groups coded with numbers 1,2,3,4,5 in the 14th column. The `map`

function returns key-value pairs containing as key the group index and the corresponding value is a list containing: the group size, the group row sums and the groups matrix products `Xi1^T *Xi1`

.

```
mapperSS = function (., X) {
n=nrow(X);
N=5;
n_vec=matrix(nrow = 1,ncol = 5); # vector of group sizes
sum_mat=matrix(nrow = 5,ncol=12); # matrix of group row sums
SS_tensor=array(dim=c(5,12,12)); # tensor containing SS matrices
for (i in 1:N){
Xi=subset.matrix(X[,2:13],X[,14]==i);
si=colSums(Xi);
ni=nrow(Xi);
SSi=t(Xi)%*%Xi;
n_vec[i]=ni;
sum_mat[i,]=si;
SS_tensor[i,,]=SSi;
}
keyval(1:3,list(n_vec,sum_mat,SS_tensor));
}
```

### Reduce

In the reduce part we simply add the key values over all the key-value pairs.

```
reducerSS = function(k, A) {
keyval(k,list(Reduce('+', A)))
}
```

### Map-Reduce

In this part we perform map-reduce on the data `CEnetBig`

.

```
GroupRes <- from.dfs(
mapreduce(
input = "/CEnetBig",
map = mapperSS,
reduce = reducerSS,
combine = T
)
)
```

## Final code

Here we finally compute the group means (centroids) and the SS matrices.

```
N=5 # 5 groups in CEnetBig data
K=12 # 12 relevant data variables
GroupMeans=matrix(nrow=N,ncol=K) # matrix containing mean vectors as rows
GroupSS=vector("list", N); # list with group SS matrices
for (i in 1:N){
GroupMeans[i,] <- GroupRes$val[[2]][i,]/GroupRes$val[[1]][i]
GroupSS[[i]] <- GroupRes$val[[3]][i,,]-GroupRes$val[[1]][i]*outer(GroupMeans[i,],GroupMeans[i,])
}
```

© PRACE and University of Ljubljana