Skip main navigation

Querying and Manipulating Data from Files Using Dedicated Packages in R

Querying and manipulating data from files using dedicated packages in R
© Wellome Genome Campus Advanced Courses and Scientific Conferences

Querying and manipulating data from existing files

Introduction

Querying and manipulating data from files might require you to use advanced options. In R, packages have been developed for specific purposes such as data manipulation, data analysis, or plotting. Packages need first to be installed and loaded into R. Each package comes with a set of functions.

Let’s see together in this step how to query and manipulate data from an existing file using a dedicated package called dplyr. We will see how some of its functions can be helpful for more complex queries and manipulation of your data than basic R functions.

We will keep using the iris dataset, that you have used already.

We encourage you to find more information on reading, querying and manipulating data frames from files in the following links: https://cran.r-project.org/doc/manuals/r-release/R-intro.html and https://rpubs.com/moeransm/intro-iris

Import and read data from files in R

Step 1. We recommend you to work in the same working sub-directory that you created previously, using one of the following options

Before launching R

$ cd exerciseR
$ pwd
/Users/imac/Desktop/exerciseR
$ R

After launching R

> setwd("/Users/imac/Desktop/exerciseR")
> getwd()
[1] "/Users/imac/Desktop/exerciseR"

Step 2. You should be able now to call again the dataset we want you to work on, the iris data set.

To read the file from your computer

> Iris <- read.table("iris.txt")
> head(Iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

To read the file from the available data sets in R

> library(datasets)
> data(iris)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

Using the dplyr package

Step 1. To use the dplyr package on the iris data set, we will need to call the package.

Because dplyr is not genuinely part of R, we will need to install it first using the “install.packages()” function

> install.packages("dplyr")
--- Please select a CRAN mirror for use in this session ---
Secure CRAN mirrors

1: 0-Cloud [https]
2: Australia (Canberra) [https]
3: Australia (Melbourne 1) [https]
……
76: USA (TX 1) [https]
77: Uruguay [https]
78: (other mirrors)

Once you select your CRAN mirror of interest (the closest to your location) as we did before, the installation will proceed. Once terminated, you will be able to load the package using the “library()” function

> library(dplyr)

Note 1. A package comes with specific functions that would not otherwise be recognized in R. Examples of basic verbs for data manipulation available with the dplyr package are “filter()”, “select()”, “mutate()”, “arrange()”, “rename()”, “relocate()”, “slice()”, “summarise()”. We will see how to use the first 3 verbs, but if you want information on other dplyr functions, or more advanced options, please refer to https://dplyr.tidyverse.org/articles/dplyr.html

Note 2. For all dplyr functions, as it will be the case for other packages in R, the first argument needs to be the data frame, also called tibble.

Step 2. Let’s see how you can now use the “filter()” function to filter specific data from this file, as we used the “subset()” function in base R. Let’s again filter only the data related to the Species “setosa”

Filtering the data after installing and loading the package

> setosa.filt <- filter(iris, Species == "setosa")
> head(setosa.filt)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

To check that the new variable setosa.filter generated contains only data related to setosa species, as it was also the case for the variable setosa generated with the “subset()” function in base R (previous Step9)

> nrow(setosa)
[1] 50
> nrow(setosa.filt)
[1] 50

To filter on multiple conditions: here based on setosa species having a Petal.Length smaller than 2, then > 2

> setosa.filt.pl2 <- filter(iris, Species == "setosa", Petal.Length < 2)
> head(setosa.filt.pl2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> nrow(setosa.filt.pl2)
[1] 50
> setosa.filt.pl2 <- filter(iris, Species == "setosa", Petal.Length > 2)
> nrow(setosa.filt.pl2)
[1] 0

Note. Note that if the 2 variables are of the same name (data filtered based on > 2 or <2), the latter will replace the previous

Step 3. To select specific columns, the “select()” function can be very helpful
To select specified columns that can be distant

> Iris.select <- select(iris, Sepal.Length, Petal.Length) 
> head(Iris.select)
Sepal.Length Petal.Length
1 5.1 1.4
2 4.9 1.4
3 4.7 1.3
4 4.6 1.5
5 5.0 1.4
6 5.4 1.7

To select a group of consecutive columns

> Iris.select.2 <- select(iris, Sepal.Length:Petal.Width) 
> head(Iris.select.2)
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4

Step 4. Now imagine you would like to add new columns to an existing data frame. The “mutate()” function can be used in the following example to add a new column called Test, with the information of whether Sepal.Length is greater than twice the size of Petal.Length (TRUE) or not (FALSE)

> test.col <- mutate(iris, Test = Sepal.Length > 2 * Petal.Length)
> head(test.col)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Test
1 5.1 3.5 1.4 0.2 setosa TRUE
2 4.9 3.0 1.4 0.2 setosa TRUE
3 4.7 3.2 1.3 0.2 setosa TRUE
4 4.6 3.1 1.5 0.2 setosa TRUE
5 5.0 3.6 1.4 0.2 setosa TRUE
6 5.4 3.9 1.7 0.4 setosa TRUE
> tail(test.col)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Test
145 6.7 3.3 5.7 2.5 virginica FALSE
146 6.7 3.0 5.2 2.3 virginica FALSE
147 6.3 2.5 5.0 1.9 virginica FALSE
148 6.5 3.0 5.2 2.0 virginica FALSE
149 6.2 3.4 5.4 2.3 virginica FALSE
150 5.9 3.0 5.1 1.8 virginica FALSE

Exercise

Question 1. Using this last example, how would you count the number of TRUE items in the newly created Test column?

Please try to answer the questions yourself first and then compare the results with other learners. Finally, you can find solutions in the download area.
© Wellome Genome Campus Advanced Courses and Scientific Conferences
This article is from the free online

Bioinformatics for Biologists: An Introduction to Linux, Bash Scripting, and R

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now