Skip main navigation

Data visualization using ggplot2

grammar of graphics, ggplot2 basic principles for visualisation

There are many different levels of data exploitation, and visualization constitutes an essential component of this puzzle

This is because it:

  • helps interpretation of even complex high-throughput data
  • provides an easy-to-understand display and message
  • presents an intuitive mean of data interpretation and decision-making for next steps allows you to visually summarize your message to the audience when it comes to publishing

How to do it?

The good news is that there are plenty of creative and graphically powerful ways for doing that automatically. As an open-source, integrated set of tools, RStudio is one of the most popular leaders in the field of data visualization. RStudio comes with an incredible number of diverse packages, including those for data visualization, among many others. Just be creative!

Why ggplot2?

In the ever-expanding world of data visualization and plotting packages, ggplot2 still occupies a primary position. This is because of its ease of use and extreme versatility to output a wide range of plots. It is based on the “Grammar of Graphics” principle of making plots, from which the “gg” in “ggplot2” comes. The power of this visualization structure relies on the fact that any chosen set among different layers of plotting information are added independently and can be combined in a very versatile manner to finally generate a plot. A plot is a combination of layered components that could be: the data itself, aesthetics, geometries, facets, statistics, coordinates, and/or themes.

For an introduction to the principle of “Grammar of Graphics” and a detailed explanation on how “ggplot2” uses each, please refer to the “Bioinformatics for Biologists” introductory course.

It’s important to remember that although “ggplot2” offers many different plots as output, each plotting option (aesthetics, type of plot, etc) might heavily depend on the type of data you want to plot (entire data set or specific columns), and the presence of either continuous or discrete variables in the data.

  • Continuous variables: used for measuring values (height, weight, volume, etc). Can take any infinite numeric and fractional value in an interval, including decimals.
  • Discrete variables: used for counting values. Should be an integer.

It is likely that you’ll have to deal with both of these types of data for analyzing them and plotting them. They are both heavily used in statistics, and will be present in data types, including genomic variants!

Basic ggplot2 requirements for plotting

When it comes to genomics, “ggplot2” is also of great support for publication-ready figures. Let’s follow the principles of layers that are used to build a first plot, iteratively, by defining the data, the variables to plot, and then the type of plot itself, step by step.

1. Define the data

We first need to relate “ggplot2” to a specific data frame using the data argument.

# Link ggplot2 to a specific data frame
> ggplot(data = var_tb)

Nothing will appear at this stage – except a default gray background in the “Plot” Tab. Remember that we need layers of information, and we have not yet chosen the aesthetics or plotting options.

2. Define the aesthetics

Now that we set the data frame, we have to decide on which variables to plot. Defining mapping options can be done with the aes() function. This function mainly allows you to select the variables to plot on the x-axis and y-axis, but it can also help decide on plotting characteristics such as colors or shapes (which can also be defined later).

# Link ggplot2 to specific variables using aesthetics
> ggplot(data = var_tb, aes(x=SAMPLE, y=DP))

At this stage, the plot starts to show the x-axis and y-axis data you called. Of course, without any plotting of the data itself.

screenshot of empty plot

3. Define the plot type

Now that we set the data frame, and which variables to plot, you have to decide on what type of plot you want to display the data. Do you want points? Histograms? Bars? Boxplots? Lines? Well this is all dependent on the message you want to convey, as it was the case for the data types… but the good news is that “ggplot2” offers them all.
Here we have chosen to start with just one continuous variable on the y-axis (DP), so both points or boxplot for example can be used. Note that these belong to the geometries that function as a new layer, and should be added following a “+” operator.

# Points (left-hand plot)
> ggplot(data = var_tb, aes(x=SAMPLE, y=DP)) + geom_point()

# Boxplot (right-hand plot)
> ggplot(data = var_tb, aes(x=SAMPLE, y=DP)) + geom_boxplot()

screenshot of two plots one with points the other one with boxplots

As you can see, these are the minimal requirements that are necessary for building a plot. But there’s a lot more functions in this package to improve the rendering!

© Wellcome Connecting Science
This article is from the free online

Bioinformatics for Biologists: Analysing and Interpreting Genomics Datasets

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now