Skip main navigation
We use cookies to give you a better experience, if that’s ok you can close this message and carry on browsing. For more info read our cookies policy.
We use cookies to give you a better experience. Carry on browsing if you're happy with this, or read our cookies policy for more information.

Example 2: Computing the within the group sum-of-squares

Data

We consider again the customer data CEnetBig data, which is already stored in the HDFS. You can load it in R using.

CEnetBig=from.dfs("/CEnetBig")

Its data format is as follows:

         id 2016_1 2016_2 2016_3 2016_4 2016_5 2016_6 2016_7 2016_8 2016_9 2016_10 2016_11 2016_12 type
 [1,] 100373 137.66 141.57 128.83 133.00  97.39 116.62 123.97 156.83  90.50   98.62  118.61  152.34    4
 [2,] 100194  98.32 119.40 120.30 105.67  90.26  80.13  80.62 108.63 104.30  123.31  101.93  140.85    2
 [3,] 100565 127.60 133.79  90.15  62.33  87.96  92.20  72.04 113.69  65.95   82.69   85.72  121.81    2
 [4,] 100553 154.60 175.10  94.64 123.41 116.96  94.57 124.25 138.89  72.57  121.03  122.09  106.79    4
 [5,] 100902 162.26 157.10 114.03 145.30 144.44  73.91 131.93 142.66 125.98   92.90  104.70  161.60    5
 [6,] 100883 119.66 148.39 144.38 105.61  66.66  70.84 110.15 114.50  75.60   85.22  125.67   90.76    2
 [7,] 100352 147.50 110.85  95.61  77.76  98.78  54.88 104.35  53.52  73.09  101.75   77.65   58.19    1
 [8,] 100863 108.84  75.53 105.55  82.24 119.41  49.98  94.74 136.62 101.14   71.08   29.29  131.81    2
 [9,] 100626 109.20 107.59  96.95  88.14  94.12  80.71  68.83  87.45  66.52   95.28   83.21   82.38    1
[10,] 100867 114.71  88.94  88.45  75.03  74.58  55.55 126.48  42.78  88.01  124.90  137.59  152.55    2

The last column therefore defines the groups and the interesting data is in columns 2016_1,…,2016_12.

Goal

Compute the within the group sum-of-squares (SS) matrix W, which is defined as the sum of group SS matrices S_i, where the (p,q) entry in S_i is defined as the scalar product of the p-th and q-th columns on group i after subtracting the group mean values for these two columns. For example, if X1 is the matrix of data corresponding to the group 1 of CEnetBig data (already stored in DFS), we can compute the group SS matrix S_1 directly (without RHadoop, using the textbook formula) as (this is possible since the data is still not too big):

CEnetBig=from.dfs("/CEnetBig");
X1=CEnetBig$val[which(CEnetBig$val[,14]==1),2:13];
m=colMeans(X1);
n=nrow(X1);
M=matrix(rep(m,each=n),nrow=n);
S1=t(X1-M)%*%(X1-M);

Note that we can also compute S1 by cov(X1)*(n-1). Another way to compute S1 is also as (outer stands for the outer product of vectors):

S1=t(X1)%*%X1-n*outer(m,m);

Method

Note that the data is divided into several chunks and each chunk has data from all 5 groups. To compute Si we first recall a result that can be found in any statistical textbook: Si=Xi^T *Xi-mi^T*mi, where Xi is the data block containing the data for group i and mi is its mean vector (centroid). We must therefore compute Xi^T *Xi over all the chunks of data. If Xi can be decomposed into blocks Xi1,...,Xik (k is the number of chunks), then Xi^T *Xi=Xi1^T *Xi1+...+Xik^T *Xik. Each of Xij^T *Xij is computed with the map function. Likewise, the map function also computes the column sums of Xij and the corresponding numbers of rows in Xij, which will finally (in the reduce part) yield Xi^T *Xi and mi.

Map

The map function computes for each data chunk i the sizes of the groups, the group row sums and the group matrix products Xi1^T *Xi1. We actually use the knowledge that we have 5 groups coded with numbers 1,2,3,4,5 in the 14th column. The map function returns key-value pairs containing as key the group index and the corresponding value is a list containing: the group size, the group row sums and the groups matrix products Xi1^T *Xi1.

mapperSS = function (., X) {
  n=nrow(X);
  N=5;
  n_vec=matrix(nrow = 1,ncol = 5);       # vector of group sizes
  sum_mat=matrix(nrow = 5,ncol=12);      # matrix of group row sums
  SS_tensor=array(dim=c(5,12,12));       # tensor containing SS matrices
  for (i in 1:N){
    Xi=subset.matrix(X[,2:13],X[,14]==i);
    si=colSums(Xi);
    ni=nrow(Xi);
    SSi=t(Xi)%*%Xi;
    n_vec[i]=ni;
    sum_mat[i,]=si;
    SS_tensor[i,,]=SSi;
  }
  keyval(1:3,list(n_vec,sum_mat,SS_tensor));
}

Reduce

In the reduce part we simply add the key values over all the key-value pairs.

reducerSS = function(k, A) {
  keyval(k,list(Reduce('+', A)))
}

Map-Reduce

In this part we perform map-reduce on the data CEnetBig

GroupRes <-   from.dfs(
  mapreduce(
    input = "/CEnetBig",
    map = mapperSS,
    reduce = reducerSS,
    combine = T
  )
)

Final code

Here we finally compute the group means (centroids) and the SS matrices

N=5     # 5 groups in CEnetBig data
K=12    # 12 relevant data variables 
GroupMeans=matrix(nrow=N,ncol=K)  # matrix containing mean vectors as rows
GroupSS=vector("list", N);                   # list with group SS matrices
for (i in 1:N){
  GroupMeans[i,] <- GroupRes$val[[2]][i,]/GroupRes$val[[1]][i]  
  GroupSS[[i]] <- GroupRes$val[[3]][i,,]-GroupRes$val[[1]][i]*outer(GroupMeans[i,],GroupMeans[i,])
}

Share this article:

This article is from the free online course:

Managing Big Data with R and Hadoop

Partnership for Advanced Computing in Europe (PRACE)

Contact FutureLearn for Support