3.3

## Partnership for Advanced Computing in Europe (PRACE) # Basic data operations in R

## Introduction

In this article we give a dense presentation how to get fast overview of (normal size) data. Recall you have to be within RStudio.

Suppose we have a data frame containing 3 numerical ratio variables and 1 categorical variable

library(plyr)
set.seed(1000)

M1 <- matrix(rnorm(150,0,1), ncol=3)
colnames(M1)<-c("X1","X2","X3")

M2 <- matrix(round(runif(50,1,4),0),ncol=1)
group=c('Group_A','Group_B','Group_D','Group_E')
M3 <- mapvalues(M2, from = 1:4, to = group)
colnames(M3)<-c("group")

M<-data.frame(cbind(M1,M3))


## Descriptive statistics

If we want to see distribution (frequencies) of different categorical values for M$group you shall use table or summary. table(M$group)
summary(M$group)  In both cases we obtain Group_A Group_B Group_D Group_E 10 14 20 6  We might be also interested in the mean values for columns in M1 (the centroid) and the group centroids for M1, where the groups are defined by M2. centr=colMeans(M1) # centroid aggregate(M1~M3,FUN=mean) # group centroids group X1 X2 X3 1 Group_A -0.1159164 0.187124494 0.51985248 2 Group_B -0.1770462 0.532357620 -0.16423459 3 Group_D -0.1651467 0.013811050 0.06511391 4 Group_E -0.1651271 -0.005173836 -0.12914429  ## Matrix operations Suppose we want to compute the transpose of M1 by M1. This can be done by t(M1)%*%M1  The covariance matrix of M1 we compute directly as cov(M1)  Likewise we compute the sum-of-squares and coproducts matrix (SS matrix) by n=nrow(M1) #number of rows in M1 SS=(n-1)*cov(M1) > SS X1 X2 X3 X1 42.485852 -6.7437071 -7.3797835 X2 -6.743707 54.7612372 -0.8058014 X3 -7.379783 -0.8058014 40.5334042  If we centralise the data (subtract the centroid from each row): M1s=scale(M1,scale=FALSE)  then SS matrix can be computed also as SS1=t(M1s)%*%M1s  But later we will use also the fact that this covariance matrix can be computed also as SS2 = t(M1)%*%M1-n*outer(centr,centr)  This is of particular use for big data computation since t(X1)%*%X1 can be computed for each data chunk separately via map function and then summed up via reduce step. Note that SS is symmetric, hence has 3 real eigenvalues and 3 corresponding eigenvectors. We can compute them by ev = eigen(SS) ev$values
 58.08193 46.86827 32.83029
ev\$vectors
[,1]       [,2]      [,3]
[1,]  0.4511108 -0.5672369 0.6890147
[2,] -0.8798904 -0.4118343 0.2370348
[3,] -0.1493050  0.7131864 0.6848892