Reading and storing the data are always inevitable steps in data analysis.
Reading and storing the data are always inevitable steps in data analysis.

Basic data operations in R

Introduction

In this article we provide a presentation that describes how to obtain a fast overview of the (normal size) data. Note that all these examples assume that you have

  • started RStudio and
  • you have opened a new R script file.

If you have not, then with ctrl+shift+n you start a new script file that you have to save first to a local folder. Once you type (copy) the R code into the script file, you run it by, e.g., selecting the part of the code you want to run and typing ctrl+enter. Suppose we have a data frame containing 3 numerical ratio variables and 1 categorical variable

library(plyr) 
set.seed(1000)

M1 <- matrix(rnorm(150,0,1), ncol=3)              
colnames(M1)<-c("X1","X2","X3")

M2 <- matrix(round(runif(50,1,4),0),ncol=1)       
group=c('Group_A','Group_B','Group_D','Group_E')
M3 <- mapvalues(M2, from = 1:4, to = group)
colnames(M3)<-c("group")

M<-data.frame(M1,M3)     

Descriptive statistics

If you want to see the distribution (frequencies) of the different category values for the variable group (we address it as M$group) you should use table or summary.

table(M$group)
summary(M$group)

In both cases we obtain

Group_A Group_B Group_D Group_E 
     10      14      20       6 

We might also be interested in the mean values of the first three columns in M (the centroid) and the group centroids for these columns, where the groups are defined by group. Here is the code and the results.

centr=colMeans(M[,1:3])          # CENTROID
centr

         X1          X2          X3 
-0.15863016  0.19138859  0.06853306 

aggregate(M[,1:3],by=list(M$group),FUN=mean)   # GROUP CENTROIDS
    group         X1           X2          X3
1 Group_A -0.1159164  0.187124494  0.51985248
2 Group_B -0.1770462  0.532357620 -0.16423459
3 Group_D -0.1651467  0.013811050  0.06511391
4 Group_E -0.1651271 -0.005173836 -0.12914429

Share this article:

This article is from the free online course:

Managing Big Data with R and Hadoop

Partnership for Advanced Computing in Europe (PRACE)