Basic data operations in R

Introduction

In this article we give a dense presentation how to get fast overview of (normal size) data. Recall you have to be within RStudio.

Suppose we have a data frame containing 3 numerical ratio variables and 1 categorical variable

library(plyr) 
set.seed(1000)

M1 <- matrix(rnorm(150,0,1), ncol=3)              
colnames(M1)<-c("X1","X2","X3")

M2 <- matrix(round(runif(50,1,4),0),ncol=1)       
group=c('Group_A','Group_B','Group_D','Group_E')
M3 <- mapvalues(M2, from = 1:4, to = group)
colnames(M3)<-c("group")

M<-data.frame(cbind(M1,M3))   

Descriptive statistics

If we want to see distribution (frequencies) of different categorical values for M$group you shall use table or summary.

table(M$group)
summary(M$group)

In both cases we obtain

Group_A Group_B Group_D Group_E 
     10      14      20       6 

We might be also interested in the mean values for columns in M1 (the centroid) and the group centroids for M1, where the groups are defined by M2.

centr=colMeans(M1)          # centroid
aggregate(M1~M3,FUN=mean)   # group centroids
    group         X1           X2          X3
1 Group_A -0.1159164  0.187124494  0.51985248
2 Group_B -0.1770462  0.532357620 -0.16423459
3 Group_D -0.1651467  0.013811050  0.06511391
4 Group_E -0.1651271 -0.005173836 -0.12914429

Matrix operations

Suppose we want to compute the transpose of M1 by M1. This can be done by

t(M1)%*%M1

The covariance matrix of M1 we compute directly as

cov(M1)

Likewise we compute the sum-of-squares and coproducts matrix (SS matrix) by

n=nrow(M1)         #number of rows in M1
SS=(n-1)*cov(M1)

> SS
          X1         X2         X3
X1 42.485852 -6.7437071 -7.3797835
X2 -6.743707 54.7612372 -0.8058014
X3 -7.379783 -0.8058014 40.5334042

If we centralise the data (subtract the centroid from each row):

M1s=scale(M1,scale=FALSE)

then SS matrix can be computed also as

SS1=t(M1s)%*%M1s

But later we will use also the fact that this covariance matrix can be computed also as

SS2 = t(M1)%*%M1-n*outer(centr,centr)

This is of particular use for big data computation since t(X1)%*%X1 can be computed for each data chunk separately via map function and then summed up via reduce step.

Note that SS is symmetric, hence has 3 real eigenvalues and 3 corresponding eigenvectors. We can compute them by

ev = eigen(SS)
ev$values
[1] 58.08193 46.86827 32.83029
ev$vectors
            [,1]       [,2]      [,3]
[1,]  0.4511108 -0.5672369 0.6890147
[2,] -0.8798904 -0.4118343 0.2370348
[3,] -0.1493050  0.7131864 0.6848892

Share this article:

This article is from the free online course:

Managing Big Data with R and Hadoop

Partnership for Advanced Computing in Europe (PRACE)