This content is taken from the Partnership for Advanced Computing in Europe (PRACE)'s online course, Managing Big Data with R and Hadoop. Join the course to learn more.
3.3

Partnership for Advanced Computing in Europe (PRACE)

Reading and storing the data are always inevitable steps in data analysis.

Basic data operations in R

Introduction

In this article we provide a presentation that describes how to obtain a fast overview of the (normal size) data. Note that all these examples assume that you have

• started RStudio (by executing rstudio & in the terminal) and
• you have opened a new R script file.

If you have not, then with ctrl+shift+n you start a new script file that you have to save first to a local folder. Once you type (copy) the R code into the script file, you run it by, e.g., selecting the part of the code you want to run and typing ctrl+enter. Suppose we have a data frame containing 3 numerical ratio variables and 1 categorical variable

library(plyr)
set.seed(1000)

M1 <- matrix(rnorm(150,0,1), ncol=3)
colnames(M1)<-c("X1","X2","X3")

M2 <- matrix(round(runif(50,1,4),0),ncol=1)
group=c('Group_A','Group_B','Group_D','Group_E')
M3 <- mapvalues(M2, from = 1:4, to = group)
colnames(M3)<-c("group")

M<-data.frame(M1,M3)


Descriptive statistics

If you want to see the distribution (frequencies) of the different category values for the variable group (we address it as M$group) you should use table or summary. table(M$group)
summary(M$group)  In both cases we obtain Group_A Group_B Group_D Group_E 10 14 20 6  We might also be interested in the mean values of the first three columns in M (the centroid) and the group centroids for these columns, where the groups are defined by group. Here is the code and the results. centr=colMeans(M[,1:3]) # CENTROID centr X1 X2 X3 -0.15863016 0.19138859 0.06853306 aggregate(M[,1:3],by=list(M$group),FUN=mean)   # GROUP CENTROIDS
group         X1           X2          X3
1 Group_A -0.1159164  0.187124494  0.51985248
2 Group_B -0.1770462  0.532357620 -0.16423459
3 Group_D -0.1651467  0.013811050  0.06511391
4 Group_E -0.1651271 -0.005173836 -0.12914429