# Example 1: computing groups centroids

## Computing centroids of big data

In this example we demonstrate how to compute groups centroids using `mapreduce`

from `rmr2`

.
We consider the data about customers of CEnet, stored in dfs as `/CEnetBig`

. The first 10 rows of the data ar as follows:

## Data

```
id 2016_1 2016_2 2016_3 2016_4 2016_5 2016_6 2016_7 2016_8 2016_9 2016_10 2016_11 2016_12 type
[1,] 100373 137.66 141.57 128.83 133.00 97.39 116.62 123.97 156.83 90.50 98.62 118.61 152.34 4
[2,] 100194 98.32 119.40 120.30 105.67 90.26 80.13 80.62 108.63 104.30 123.31 101.93 140.85 2
[3,] 100565 127.60 133.79 90.15 62.33 87.96 92.20 72.04 113.69 65.95 82.69 85.72 121.81 2
[4,] 100553 154.60 175.10 94.64 123.41 116.96 94.57 124.25 138.89 72.57 121.03 122.09 106.79 4
[5,] 100902 162.26 157.10 114.03 145.30 144.44 73.91 131.93 142.66 125.98 92.90 104.70 161.60 5
[6,] 100883 119.66 148.39 144.38 105.61 66.66 70.84 110.15 114.50 75.60 85.22 125.67 90.76 2
[7,] 100352 147.50 110.85 95.61 77.76 98.78 54.88 104.35 53.52 73.09 101.75 77.65 58.19 1
[8,] 100863 108.84 75.53 105.55 82.24 119.41 49.98 94.74 136.62 101.14 71.08 29.29 131.81 2
[9,] 100626 109.20 107.59 96.95 88.14 94.12 80.71 68.83 87.45 66.52 95.28 83.21 82.38 1
[10,] 100867 114.71 88.94 88.45 75.03 74.58 55.55 126.48 42.78 88.01 124.90 137.59 152.55 2
```

In first column we have the `id`

of the customer, in the next 12 columns we have values of their bills for the month January-December 2016 and the last column (`type`

) contains the data about the product (package) that the customer has. This column defines 5 groups for which we compute the centroids (the 12 dimensional vectors with the mean values of columns `2016_1`

- `2016_12`

).

First we load our data. The data is already stored in the HDFS (you have got it with the Virtual machine that you have installed). We have full data file (`MyBigData`

) which we can load by

```
CEnetBig=from.dfs("/CEnetBig")
```

However, we try to avoid calling it directly but rather pass it directly to mapreduce function (see below).

## Method

Note that the data chunks containing our data have all 5 groups. Therefore we compute centroids of this groups with MAP-REDUCE as follows.

### Map

For data chunk `i`

we compute via `MAP`

function the `i`

-th group sum
`s_1i, s_2i,...,s_5i`

and the corresponding group sizes: `n_1i,n_2i,...,n_5i`

.
The map function therefore returns key-value pairs `(k,{n_ki, s_ki})`

, where `k`

is group label (1,2,…,5), `n_ki`

is the number of data rows in `k`

-th group while `s_ki`

is the sum of all data rows from `k`

-th group.

```
mapper = function (., X) {
n=nrow(X);
ones=matrix(rep(1,n),nrow=n,ncol=1);
ag=aggregate(cbind(ones,X[,2:13]),by=list(X[,14]),FUN="sum")
key=factor(ag[,1]);
keyval(key,split(ag[,-1],key))
}
```

### Reduce

The `REDUCE`

part computes final sums of data rows for each group and returns key-value pairs `(k,{n_ki, s_ki})`

for whole dataset

```
reducer = function(k, A) {
keyval(k,list(Reduce('+', A)))
}
```

### Map-reduce

One we have defined the `map`

and `reduce`

function we compute groups sums with `mapreduce`

.

```
GroupSums <- from.dfs(
mapreduce(
input = "/CEnetBig",
map = mapper,
reduce = reducer,
combine = T
)
)
```

## Final code

Finally, we compute the group centroids by taking the values of key-value pairs. Note that the first entry of each value is the group size and the rest of the entry is the row representing the group sums. To obtain the centroids we divide each row of group sums with the size of the group.

```
GroupSumsM <- matrix(unlist(GroupSums$val), ncol = 13, byrow = TRUE)
Centroids<-GroupSumsM[,-1]/GroupSumsM[,1]
```

If you run the code you should get the following results:

```
> Centroids
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 109.9310 104.9678 94.98656 80.08108 80.00087 64.97981 84.93587 90.07869 70.00322 79.85767 89.94883 100.0320
[2,] 120.0249 115.0168 104.95868 89.96294 89.97693 74.99977 95.04289 100.05404 80.02519 90.05004 99.99359 109.9933
[3,] 130.0478 125.0739 115.14003 100.00783 99.90976 85.01050 105.03596 110.00461 90.08632 100.01509 110.03724 120.0029
[4,] 139.9501 135.0315 124.96959 110.03113 109.99999 94.94679 114.93834 120.03891 100.01716 110.02524 119.95535 129.9921
[5,] 149.9407 145.0123 135.06575 119.97880 120.06271 105.05000 124.97573 129.95944 109.92863 120.02247 130.02287 139.9866
```

© PRACE and University of Ljubljana