Skip main navigation

Querying and Manipulating Data from Files Using Dedicated Packages in R

Querying and manipulating data from files using dedicated packages in R
© Wellome Genome Campus Advanced Courses and Scientific Conferences

Querying and manipulating data from existing files

Introduction

Querying and manipulating data from files might require you to use advanced options. In R, packages have been developed for specific purposes such as data manipulation, data analysis, or plotting. Packages need first to be installed and loaded into R. Each package comes with a set of functions.

Let’s see together in this step how to query and manipulate data from an existing file using a dedicated package called dplyr. We will see how some of its functions can be helpful for more complex queries and manipulation of your data than basic R functions.

We will keep using the iris dataset, that you have used already.

We encourage you to find more information on reading, querying and manipulating data frames from files in the following links: https://cran.r-project.org/doc/manuals/r-release/R-intro.html and https://rpubs.com/moeransm/intro-iris

Import and read data from files in R

Step 1. We recommend you to work in the same working sub-directory that you created previously, using one of the following options

Before launching R

$ cd exerciseR
$ pwd
/Users/imac/Desktop/exerciseR
$ R

After launching R

> setwd("/Users/imac/Desktop/exerciseR")
> getwd()
[1] "/Users/imac/Desktop/exerciseR"

Step 2. You should be able now to call again the dataset we want you to work on, the iris data set.

To read the file from your computer

> Iris <- read.table("iris.txt")
> head(Iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

To read the file from the available data sets in R

> library(datasets)
> data(iris)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

Using the dplyr package

Step 1. To use the dplyr package on the iris data set, we will need to call the package.

Because dplyr is not genuinely part of R, we will need to install it first using the “install.packages()” function

> install.packages("dplyr")
--- Please select a CRAN mirror for use in this session ---
Secure CRAN mirrors

1: 0-Cloud [https]
2: Australia (Canberra) [https]
3: Australia (Melbourne 1) [https]
……
76: USA (TX 1) [https]
77: Uruguay [https]
78: (other mirrors)

Once you select your CRAN mirror of interest (the closest to your location) as we did before, the installation will proceed. Once terminated, you will be able to load the package using the “library()” function

> library(dplyr)

Note 1. A package comes with specific functions that would not otherwise be recognized in R. Examples of basic verbs for data manipulation available with the dplyr package are “filter()”, “select()”, “mutate()”, “arrange()”, “rename()”, “relocate()”, “slice()”, “summarise()”. We will see how to use the first 3 verbs, but if you want information on other dplyr functions, or more advanced options, please refer to https://dplyr.tidyverse.org/articles/dplyr.html

Note 2. For all dplyr functions, as it will be the case for other packages in R, the first argument needs to be the data frame, also called tibble.

Step 2. Let’s see how you can now use the “filter()” function to filter specific data from this file, as we used the “subset()” function in base R. Let’s again filter only the data related to the Species “setosa”

Filtering the data after installing and loading the package

> setosa.filt <- filter(iris, Species == "setosa")
> head(setosa.filt)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

To check that the new variable setosa.filter generated contains only data related to setosa species, as it was also the case for the variable setosa generated with the “subset()” function in base R (previous Step9)

> nrow(setosa)
[1] 50
> nrow(setosa.filt)
[1] 50

To filter on multiple conditions: here based on setosa species having a Petal.Length smaller than 2, then > 2

> setosa.filt.pl2 <- filter(iris, Species == "setosa", Petal.Length < 2)
> head(setosa.filt.pl2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> nrow(setosa.filt.pl2)
[1] 50
> setosa.filt.pl2 <- filter(iris, Species == "setosa", Petal.Length > 2)
> nrow(setosa.filt.pl2)
[1] 0

Note. Note that if the 2 variables are of the same name (data filtered based on > 2 or <2), the latter will replace the previous

Step 3. To select specific columns, the “select()” function can be very helpful
To select specified columns that can be distant

> Iris.select <- select(iris, Sepal.Length, Petal.Length) 
> head(Iris.select)
Sepal.Length Petal.Length
1 5.1 1.4
2 4.9 1.4
3 4.7 1.3
4 4.6 1.5
5 5.0 1.4
6 5.4 1.7

To select a group of consecutive columns

> Iris.select.2 <- select(iris, Sepal.Length:Petal.Width) 
> head(Iris.select.2)
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4

Step 4. Now imagine you would like to add new columns to an existing data frame. The “mutate()” function can be used in the following example to add a new column called Test, with the information of whether Sepal.Length is greater than twice the size of Petal.Length (TRUE) or not (FALSE)

> test.col <- mutate(iris, Test = Sepal.Length > 2 * Petal.Length)
> head(test.col)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Test
1 5.1 3.5 1.4 0.2 setosa TRUE
2 4.9 3.0 1.4 0.2 setosa TRUE
3 4.7 3.2 1.3 0.2 setosa TRUE
4 4.6 3.1 1.5 0.2 setosa TRUE
5 5.0 3.6 1.4 0.2 setosa TRUE
6 5.4 3.9 1.7 0.4 setosa TRUE
> tail(test.col)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Test
145 6.7 3.3 5.7 2.5 virginica FALSE
146 6.7 3.0 5.2 2.3 virginica FALSE
147 6.3 2.5 5.0 1.9 virginica FALSE
148 6.5 3.0 5.2 2.0 virginica FALSE
149 6.2 3.4 5.4 2.3 virginica FALSE
150 5.9 3.0 5.1 1.8 virginica FALSE

Exercise

Question 1. Using this last example, how would you count the number of TRUE items in the newly created Test column?

Please try to answer the questions yourself first and then compare the results with other learners. Finally, you can find solutions in the download area.
© Wellome Genome Campus Advanced Courses and Scientific Conferences
This article is from the free online

Bioinformatics for Biologists: An Introduction to Linux, Bash Scripting, and R

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education