Basic data operations in R


In this article we give a dense presentation how to get fast overview of (normal size) data. Recall you have to be within RStudio.

Suppose we have a data frame containing 3 numerical ratio variables and 1 categorical variable


M1 <- matrix(rnorm(150,0,1), ncol=3)              

M2 <- matrix(round(runif(50,1,4),0),ncol=1)       
M3 <- mapvalues(M2, from = 1:4, to = group)


Descriptive statistics

If we want to see distribution (frequencies) of different categorical values for M$group you shall use table or summary.


In both cases we obtain

Group_A Group_B Group_D Group_E 
     10      14      20       6 

We might be also interested in the mean values for columns in M1 (the centroid) and the group centroids for M1, where the groups are defined by M2.

centr=colMeans(M1)          # centroid
aggregate(M1~M3,FUN=mean)   # group centroids
    group         X1           X2          X3
1 Group_A -0.1159164  0.187124494  0.51985248
2 Group_B -0.1770462  0.532357620 -0.16423459
3 Group_D -0.1651467  0.013811050  0.06511391
4 Group_E -0.1651271 -0.005173836 -0.12914429

Matrix operations

Suppose we want to compute the transpose of M1 by M1. This can be done by


The covariance matrix of M1 we compute directly as


Likewise we compute the sum-of-squares and coproducts matrix (SS matrix) by

n=nrow(M1)         #number of rows in M1

> SS
          X1         X2         X3
X1 42.485852 -6.7437071 -7.3797835
X2 -6.743707 54.7612372 -0.8058014
X3 -7.379783 -0.8058014 40.5334042

If we centralise the data (subtract the centroid from each row):


then SS matrix can be computed also as


But later we will use also the fact that this covariance matrix can be computed also as

SS2 = t(M1)%*%M1-n*outer(centr,centr)

This is of particular use for big data computation since t(X1)%*%X1 can be computed for each data chunk separately via map function and then summed up via reduce step.

Note that SS is symmetric, hence has 3 real eigenvalues and 3 corresponding eigenvectors. We can compute them by

ev = eigen(SS)
[1] 58.08193 46.86827 32.83029
            [,1]       [,2]      [,3]
[1,]  0.4511108 -0.5672369 0.6890147
[2,] -0.8798904 -0.4118343 0.2370348
[3,] -0.1493050  0.7131864 0.6848892

Share this article:

This article is from the free online course:

Managing Big Data with R and Hadoop

Partnership for Advanced Computing in Europe (PRACE)