Skip main navigation

R code for numerical summaries

How to use R codes for creating numerical summaries?

This continues from the content of the video Demonstration: Basics of RStudio for data summaries. In it, you will learn R codes for creating numerical  summaries.

This step covers sample mean, sample median, sample mode, standard deviation and variance, sample percentiles, summary table and box plots.

The R code blocks shown below are not interactive. You can copy and paste them into one of the RStudio  to execute them and see the results.

Download the zip folder ‘Using RStudio for graphical and numerical summaries’ (the R file version of the reading) in the ‘Downloads section’ below.

1. R code for numerical summaries

1.1 Sample mean

The R command to compute the sample mean is mean() .

For instance, for the salary dataset (see Example 2 of the reading Graphical summaries of data) we get:

salary <- c(57:62,64,66,67,70) 
frequency <- c(4,1,3,5,8,10,5,2,3,1)
sal_list <- rep(salary, frequency) # replicate repetitions 
# Mean of the salary data: 
mean(sal_list) 
## 61.7619 

1.2 Sample median

The R command to compute the sample median is  median() :

# Median of the salary data 
median(sal_list) 
## 61.5

1.3 Sample mode

The suitable R code to compute the sample mode is as follows:

salfr <- data.frame(salary, frequency) 
maxfr <- max(frequency) # maximum frequency 
# Mode of the salary data: 
salfr$salary[frequency==maxfr] 
## 62

The last command chooses the value of vector salary  corresponding to frequency equal to maxfr. The equality in R is indicated with a double equal sign  == .

1.4 Standard deviation and variance

The R command to compute the sample standard deviation is  sd():

# Standard deviation of the salary data: 
s <- sd(sal_list) 

## 2.961621 
# Sample variance: 
s^2 
## 8.771196

In the last line, we calculated the sample variance as the square of the standard deviation.

1.5 Sample percentiles

The R command to compute sample percentiles is  quantile():

# Sample quartiles of the salary data: 
quantile(salary, type=2)
## 0% 25% 50% 75% 100% 
## 57.0 59.0 61.5 66.0 70.0 
# Sample 30- and 60-percentiles of the salary data: quantile(salary, c(0.30,0.60), type=2) 
## 30% 60% 
## 59.5 63.0 

Here type=2  specifies that the arithmetic average is used if necessary; the default method  (type=7)  is based on linear interpolation:

quantile(salary) 
## 0% 25% 50% 75% 100% 
## 57.00 59.25 61.50 65.50 70.00
quantile(salary, prob=c(0.30,0.60)) 
## 30% 60% 
## 59.7 62.8 

1.6 Summary table

A useful command succinctly summarising the basic numerical metrics of the data is  summary():

# Summary of the salary data: 
summary(sal_list) 
## Min. 1st Qu. Median Mean 3rd Qu. Max. 
## 57.00 60.00 61.50 61.76 63.50 70.00 

1.7 Box plots

box plot provides a useful visual summary of the data, showing minimum, lower quartile, median, upper quartile and maximum of the dataset.

The R command to produce a box plot is simply boxplot():

# Create box plot for the salary data: 
boxplot(salary, horizontal=TRUE, frame=FALSE, ylim=c(56,70), xlab="Salary ($1000s)", main="Box plot of the salary data")
text(60.0, 0.7, "Lower n quartile", pos=2, col="blue")
text(67.0, 0.7, "Upper n quartile", pos=2, col="blue")
text(62.55, 1.26, "Median", pos=2, col="blue")
text(58.2, 1.15, "Minimum", pos=2, col="blue") 
text(70.8, 1.15, "Maximum", pos=2, col="blue") 

Box plot of salary data, illustrating median, minimum, maximum and quartiles using the R code provided in this reading.

The option  horizontal=TRUE  specifies the horizontal direction of the box plot (default  horizontal=FALSE  means vertical boxes).

Box plots are especially useful when comparing several samples.

Consider a dataset comprising the heights (in inches) of 262 female and 117 male students (see Example of the reading Numerical studies of data):

height <- read.csv( "https://raw.githubusercontent.com/artofstat/data/master/Chapter2/heights.csv") 
attach(height) # to inform R of the column names names(height)
## "HEIGHT" "GENDER"

The dataset inspection reveals that there is an extremely large observation of 92 inches (234 cm) for one female student:

max(HEIGHT) 
## 92 
GENDER[HEIGHT==92] 
## "Female" 

Exclude this observation from the analysis:

New <- subset(height, HEIGHT != 92)

Now, construct side-by-side box plots for male and female heights:

boxplot(New$HEIGHT~New$GENDER, frame=FALSE, horizontal=TRUE, ylim=c(55,80), 
xlab="Height (inches)", ylab=NULL, col=c("lightsalmon","lightblue"), 
outcex=1.3, outlwd=2, outcol=c("red","blue"), 
main="Box plots for males and females") 

Box plot illustrating two different data sets created using the R code in this reading.
The box plots reveal that both samples include unusually low or high heights (outliers), marked on the plots by small circles.

Recall that the value is classified as a potential outlier if it is more than 1.5 IQR below the first quartile or above the third quartile.

This article is from the free online

Statistical Methods

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now