Skip main navigation

Computing Groups Centroids

In this example we demonstrate how to compute groups centroids using mapreduce from rmr2. We consider the data about customers of CEnet, stored in dfs as /CEnetBig. The first 10 rows of the data are as follows:
Scatter plot of 3 groups and their centroids
© PRACE and University of Ljubljana

Computing Centroids of Big Data

In this example we demonstrate how to compute groups centroids using mapreduce from rmr2.
We consider the data about customers of CEnet, stored in dfs as /CEnetBig. The first 10 rows of the data are as follows:

 

Data

 

 id 2016_1 2016_2 2016_3 2016_4 2016_5 2016_6 2016_7 2016_8 2016_9 2016_10 2016_11 2016_12 type
 [1,] 100373 137.66 141.57 128.83 133.00 97.39 116.62 123.97 156.83 90.50 98.62 118.61 152.34 4
 [2,] 100194 98.32 119.40 120.30 105.67 90.26 80.13 80.62 108.63 104.30 123.31 101.93 140.85 2
 [3,] 100565 127.60 133.79 90.15 62.33 87.96 92.20 72.04 113.69 65.95 82.69 85.72 121.81 2
 [4,] 100553 154.60 175.10 94.64 123.41 116.96 94.57 124.25 138.89 72.57 121.03 122.09 106.79 4
 [5,] 100902 162.26 157.10 114.03 145.30 144.44 73.91 131.93 142.66 125.98 92.90 104.70 161.60 5
 [6,] 100883 119.66 148.39 144.38 105.61 66.66 70.84 110.15 114.50 75.60 85.22 125.67 90.76 2
 [7,] 100352 147.50 110.85 95.61 77.76 98.78 54.88 104.35 53.52 73.09 101.75 77.65 58.19 1
 [8,] 100863 108.84 75.53 105.55 82.24 119.41 49.98 94.74 136.62 101.14 71.08 29.29 131.81 2
 [9,] 100626 109.20 107.59 96.95 88.14 94.12 80.71 68.83 87.45 66.52 95.28 83.21 82.38 1
[10,] 100867 114.71 88.94 88.45 75.03 74.58 55.55 126.48 42.78 88.01 124.90 137.59 152.55 2

 

In the first column we have the id of the customer, in the next 12 columns we have the values of their bills for the months January-December 2016 and the last column (type) contains the data about the product (package) that the customer has. This column defines 5 groups for which we compute the centroids (the 12 dimensional vectors with the mean values of columns 2016_12016_12).

 

First we load our data. The data is already stored in the HDFS (you got it with the virtual machine that you installed in Week 1). We have a full data file (CEnetBig) that we can load by:

 

CEnetBig=from.dfs("/CEnetBig")

 

However, we try to avoid calling it directly, but rather pass it directly to the map-reduce function (see below).

 

Method

 

Note that the data chunks containing our data have all 5 groups. Therefore, we compute the centroids of these groups with map-reduce as follows.

 

Map

 

For data chunk i we compute via the map function the i-th group sum
s_1i, s_2i,...,s_5i and the corresponding group sizes: n_1i,n_2i,...,n_5i.
The map function therefore returns key-value pairs (k,{n_ki, s_ki}), where k is the group label (1,2,…,5), n_ki is the number of data rows in k-th group, while s_ki is the sum of all the data rows from the k-th group.

 

mapper = function (., X) {
 n=nrow(X);
 ones=matrix(rep(1,n),nrow=n,ncol=1);
 ag=aggregate(cbind(ones,X[,2:13]),by=list(X[,14]),FUN="sum")
 key=factor(ag[,1]);
 keyval(key,split(ag[,-1],key))
}

 

Comment of the mapper code:

 

 

Line Result
n=nrow(X) gives the number of rows in X, i.e.,
from CEnetBig with 1000000 rows
ones=matrix(rep(1,n),nrow=n,ncol=1) creates a nx1 matrix with 1 values, which is used as a counter during aggregation
ag=aggregate(cbind(ones,X[,2:13]),
by=list(X[,14]),FUN=”sum”)
– adds the values of each row grouped by type (column 14)
ag is a 5x14 matrix:
(type, count, 2016_1,…, 2016_12)
cbind(ones,X[,2:13]): combines matrix ones with the columns 2:13 from X
key=factor(ag[,1]) key is a list from 1 to 5
keyval(key,split(ag[,-1],key)) – creates a key-value object (a collection of key-value pairs) from two R objects, extract keys or values from a key value object or concatenate multiple key value objects
ag[,-1]: indicates all columns but the first one
split(ag[,-1],key): separates ag by key

 

Reduce

 

The REDUCE part computes the final sums of the data rows for each group and returns the key-value pairs (k,{n_ki, s_ki}) for the whole dataset.

 

reducer = function(k, A) {
 keyval(k,list(Reduce('+', A)))
}

 

Map-Reduce

 

Once we have defined the map and reduce function we compute the group sums with mapreduce.

 

GroupSums <- from.dfs(
 mapreduce(
 input = "/CEnetBig",
 map = mapper,
 reduce = reducer,
 combine = T
 )
)

 

Final Code

 

Finally, we compute the group centroids by taking the values of the key-value pairs. Note that the first entry of each value is the group size and the rest of the entry is the row representing the group sums. To obtain the centroids we divide each row of the group sums with the size of the group.

 

GroupSumsM <- matrix(unlist(GroupSums$val), ncol = 13, byrow = TRUE)
Centroids<-GroupSumsM[,-1]/GroupSumsM[,1]

 

If you run the code you should obtain the following results:

 

> Centroids
 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 109.9310 104.9678 94.98656 80.08108 80.00087 64.97981 84.93587 90.07869 70.00322 79.85767 89.94883 100.0320
[2,] 120.0249 115.0168 104.95868 89.96294 89.97693 74.99977 95.04289 100.05404 80.02519 90.05004 99.99359 109.9933
[3,] 130.0478 125.0739 115.14003 100.00783 99.90976 85.01050 105.03596 110.00461 90.08632 100.01509 110.03724 120.0029
[4,] 139.9501 135.0315 124.96959 110.03113 109.99999 94.94679 114.93834 120.03891 100.01716 110.02524 119.95535 129.9921
[5,] 149.9407 145.0123 135.06575 119.97880 120.06271 105.05000 124.97573 129.95944 109.92863 120.02247 130.02287 139.9866
© PRACE and University of Ljubljana
This article is from the free online

Managing Big Data with R and Hadoop

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now