# Basic data operations in R

## Introduction

In this article we give a dense presentation how to get fast overview of (normal size) data. Recall you have to be within RStudio.

Suppose we have a data frame containing 3 numerical ratio variables and 1 categorical variable

```
library(plyr)
set.seed(1000)
M1 <- matrix(rnorm(150,0,1), ncol=3)
colnames(M1)<-c("X1","X2","X3")
M2 <- matrix(round(runif(50,1,4),0),ncol=1)
group=c('Group_A','Group_B','Group_D','Group_E')
M3 <- mapvalues(M2, from = 1:4, to = group)
colnames(M3)<-c("group")
M<-data.frame(cbind(M1,M3))
```

## Descriptive statistics

If we want to see distribution (frequencies) of different categorical values for `M$group`

you shall use `table`

or `summary`

.

```
table(M$group)
summary(M$group)
```

In both cases we obtain

```
Group_A Group_B Group_D Group_E
10 14 20 6
```

We might be also interested in the mean values for columns in `M1`

(the centroid) and the group centroids for `M1`

, where the groups are defined by `M2`

.

```
centr=colMeans(M1) # centroid
aggregate(M1~M3,FUN=mean) # group centroids
group X1 X2 X3
1 Group_A -0.1159164 0.187124494 0.51985248
2 Group_B -0.1770462 0.532357620 -0.16423459
3 Group_D -0.1651467 0.013811050 0.06511391
4 Group_E -0.1651271 -0.005173836 -0.12914429
```

## Matrix operations

Suppose we want to compute the transpose of `M1`

by `M1`

. This can be done by

```
t(M1)%*%M1
```

The covariance matrix of `M1`

we compute directly as

```
cov(M1)
```

Likewise we compute the sum-of-squares and coproducts matrix (SS matrix) by

```
n=nrow(M1) #number of rows in M1
SS=(n-1)*cov(M1)
> SS
X1 X2 X3
X1 42.485852 -6.7437071 -7.3797835
X2 -6.743707 54.7612372 -0.8058014
X3 -7.379783 -0.8058014 40.5334042
```

If we centralise the data (subtract the centroid from each row):

```
M1s=scale(M1,scale=FALSE)
```

then SS matrix can be computed also as

```
SS1=t(M1s)%*%M1s
```

But later we will use also the fact that this covariance matrix can be computed also as

```
SS2 = t(M1)%*%M1-n*outer(centr,centr)
```

This is of particular use for big data computation since `t(X1)%*%X1`

can be computed for each data chunk separately via map function and then summed up via reduce step.

Note that `SS`

is symmetric, hence has 3 real eigenvalues and 3 corresponding eigenvectors. We can compute them by

```
ev = eigen(SS)
ev$values
[1] 58.08193 46.86827 32.83029
ev$vectors
[,1] [,2] [,3]
[1,] 0.4511108 -0.5672369 0.6890147
[2,] -0.8798904 -0.4118343 0.2370348
[3,] -0.1493050 0.7131864 0.6848892
```

© PRACE and University of Ljubljana