Skip main navigation

R code for graphical summaries

How to use R codes for creating graphical summaries.

This step continues from the content of the video Demonstration: Basics of RStudio for data summaries. In it, you will learn R codes for creating graphical summaries.

This section covers bar plots, Pareto charts, pie charts, dot plots, cumulative plots, histograms and normal approximation.

The R code blocks shown below are not interactive. You can copy and paste them into one of the RStudio  to execute them and see the results.

Download the zip folder ‘Using RStudio for graphical and numerical summaries’ (the R file version of the reading) in the ‘Downloads section’ of the previous step.

First, consider visualisation tools for categorical data, taking, as a primary example, the women’s clothing size dataset.

1 Bar plots

The R command to produce a bar plot is just barplot():

Bar plot illustrating the frequency of women's clothing size created from the R code provided in this reading.
2 Pareto charts

Pareto charts are bar plots with bars arranged in decreasing order of height.

The following R code (based on the  order()  command) can be used to produce a Pareto chart:

# Create Pareto chart for women's clothing size data
# First, define colours of the size bars: 
colour <- c(rainbow(11)) 
# Create a new data frame to include colours: fdf <- data.frame(f, colour) 
# Re-order in decreasing frequencies freqed <- order(fdf$freq, decreasing=TRUE) 
# Now plot Pareto chart: barplot(fdf$freq[freqed], col=fdf$colour[freqed], names.arg=f$size[freqed], cex.names=0.95, main="Pareto chart of women's clothing size data") 

Pareto chart showing the women's clothing size data set created using the R code provided in this reading.

A simpler code is based on the package  dplyr:

If you want to install dplyr. Type in
install.packages("tidyverse")

# Alternatively, install just dplyr:
install.packages("dplyr")

If you want to find out more go to https://dplyr.tidyverse.org/ which has more instructions. 

library(dplyr) # call package "dplyr" 
fdfed <- arrange(fdf,desc(fdf$freq)) # re-order in descending frequencies 
# Now plot Pareto chart: 
barplot(fdfed$freq,col=c(fdfed$colour),names.arg=fdfed$size, cex.names=0.95, 
main="Pareto chart of of women's clothing size data") 

Try it and you’ll see that this code produces the same Pareto chart as before!

3 Pie charts

The R command to produce a pie chart is simply  pie():

# Create pie chart for the women's clothing size data: 
pie(f$freq, labels=f$size, col=rainbow(11), radius=1.0, 
main="Pie chart of women's clothing size data")

Pie chart illustrating the same women's clothing size data as the bar chart above created using the R code provided.

The pie chart shows proportions, e.g, we see that sizes 12, 14 and 16, shown in shades of green, take more than a half in this sample.

4 Dot plots

Dot plots are produced using the R command stripchart():

# Create dot plot for the women's clothing size data:
stripchart(f1, col="blue", method = "stack", pch=16, ylim=c(1,30), xlim=c(6,26), 
offset=0.5, main = "Dot plot of women's clothing size data",
xlab="Size", ylab="Frequency", axes=FALSE) 
# Add axes ticks and labels 
axis(1, at=2*c(3:13)) # to add new labels 
axis(2, at=5*c(0:7))

Dot plot representing the women's clothing size data set created using the R code provided in this reading.

Let us now consider R coding to produce graphical summaries for quantitative data.

Some of the previous tools are applicable here as well. For instance, for the salary dataset, we can construct the following graphical summaries:

5 Bar plot:

# Recall the data set:
salary <- c(57:62,64,66,67,70)
frequency <- c(4,1,3,5,8,10,5,2,3,1)
# Bar plot of the salary data:
barplot(frequency, names.arg=salary, col="blue", xlab="Salary ($1000s)", ylab="Frequency", main="Bar plot of the salary data")

6 Dot plot:

sal_list <- rep(salary, frequency)
# Create dot plot of the salary data: 
stripchart(sal_list, method="stack", pch=16, offset=1.45, xlim=c(56,70), ylim=c(1,11), col="blue", xlab="Salary ($1000s)", ylab="Frequency", main="Dot plot of the salary data") 
axis(2, at=2*c(0:5)) 

7 Cumulative frequency plots

Cumulative frequency plots (also called ogives) are produced using the command  cumsum() to calculate partial sums of frequencies, and the plotting commands  plot()  and  points():

# Create cumulative data: 
cumfreq <- cumsum(frequency) 
# Create cumulative frequency plot:
plot(salary, cumfreq, type="l", col="blue", xlim=c(56,70), xlab="Salary ($1000s)",
ylab="Frequency", main="Cumulative frequency plot of the salary data")
points(salary,cumfreq, pch=19, col ="blue")
# Drawing dashed lines:
f62 <- cumfreq[salary==62]
segments(x0=0, x1=62, y0=f62, y1=f62, lty=2, lwd=2, col="red")
segments(x0=62, x1=62, y0=0, y1=f62, lty=2, lwd=2, col="red") 

Cumulative frequency plot of the salary data set created using the R code provided in this reading.

A similar plot for the relative frequencies (called a cumulative density plot) is produced as follows:

cumfreq <- cumsum(frequency)
total <- sum(frequency)
# Create cumulative density plot:
plot(salary, cumfreq/total, type="l", col="blue", xlim=c(56,70),
xlab="Salary ($1000s)", ylab="Density", 
main="Cumulative density plot of the salary data")
points(salary,cumfreq/total, pch=19, col ="blue") 
# Drawing dashed lines: 
segments(x0=0, x1=62, y0=f62/total, y1=f62/total, lty=2, lwd=2, col="red")
segments(x0=62, x1=62, y0=0, y1=f62/total, lty=2, lwd=2, col="red")

8 Histograms

To plot a histogram of a quantitative dataset, we use the R command  hist():

# Create histogram of the salary data data:
hist(sal_list, xlab="Salary ($1000s)", ylab="Frequency", col="grey",
main="Histogram of the salary data")

Histogram of the salary data set created using the R code provided in this reading.

In this example, the ‘bins’ are chosen of length 2, from 56 to 70.

By a standard convention, the right boundary of the bins is included, while the left one is not, e.g, (56,58], (58,60], and so on. This can be reverted if desired by using the option  right=FALSE:

hist(sal_list, col="grey", xlim=c(56,70), ylim=c(0,15), xlab="Salary ($1000s)", 
ylab="Frequency", right=FALSE, 
main="Histogram of the salary data with redefined bins") 

Histogram showing the salary data set with redefined bins created using the R code.

If a more granular representation is needed, the number of breaks separating the bins and their locations can be specified explicitly, for example:

# Plot histogram with more breaks: 
hist(sal_list, col="grey", breaks=10, xlim=c(56,70), xlab="Salary ($1000s)", 
ylab="Frequency", main="Refined histogram of the salary data") 

Refined histogram illustrating the salary data set created using the R code.

This histogram shows clearly the values observed (or not observed) in the sample.

On the other hand, it is a bit more fragmented, whereas the previous histogram (with fewer bins) seems to capture better the overall shape of the salary distribution.

Histograms may also be plotted with respect to relative frequencies (also referred to as density); this is done by using the option  freq=FALSE :

# Plot histogram with relative frequencies:
hist(sal_list, freq=FALSE, col="grey", #breaks=10, xlab="Salary ($1000s)", 
ylab="Density", xlim=c(56,70), ylim=c(0,0.24), 
main="Density histogram of the salary data")

Density histogram of the salary data set created using the R code.

The shape of the density histogram is exactly the same as before, only the y labels have changed.

9 Normal approximation

Normal approximation of nearly symmetric, bell-shaped data is based on a theoretical curve, referred to as ‘normal probability density function’ or just ‘normal density’.

It has two parameters,  mean  and  sd,  and can be added on top of a histogram plot using the commands  lines(x<-seq(), y=...)  and dnorm(x,mean,sd).

To illustrate, recall the data about student heights already used previously. Let us focus on heights of females (again, excluding an outlier of 92):

height <- read.csv( "https://raw.githubusercontent.com/artofstat/data/master/Chapter2/heights.csv") 
NewF <- subset(HEIGHT, GENDER=="Female" & HEIGHT != 92)

This is the histogram of the data, together with a normal approximation (already plotted in 2.5 Shape of data distribution‘): 

hist(NewF, freq=FALSE, xlab="Height (inches)", 
main="Histogram of female student height data", breaks=17, xlim=c(55.5,79.1)) 
lines(x<-seq(55,80,0.01), y=dnorm(x, mean(NewF), sd(NewF)), col="red", lwd=2)

Next steps

This step gave you a detailed explanation of the R commands you need to create data summaries in RStudio. This completes Activity 2 in Week 2. When you are ready, move on to Activity 3 and read the overview to understand the activities you will complete.

This article is from the free online

Statistical Methods

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now