Skip main navigation

Programming: How to Read Data From Files in R

We look at how to read data from an existing file. We will also see how to simply query with basic functions in R this existing data set.
© Wellcome Genome Campus Advanced Courses and Scientific Conferences

Introduction

Indeed, you will often be willing to exploit the usefulness of built-in functions in R to manipulate large data sets. Although these data sets are organized as data frames, with information organized in rows and columns, it is obviously not easy to create it from scratch. Instead, there are specific built-in functions in R that allows us to import existing data contained in a file.

Let’s see together how to read data from an existing file. We will also see how to simply query with basic functions in R this existing data set.

We will use the iris dataset. It contains information about 3 plant species (setosa, virginica, versicolor) and related measures about 4 features (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) of these plants.

Note that this information is organized in a particular form:

  • Each column is named to indicate the features and species to consider
  • Each row contains a row label and information corresponding to features and species

If you want in the future to use another file you generated, make sure that no missing information is left empty. Instead, indicate it as NA (Not Applicable).

We encourage you to find more information on reading, querying and manipulating data frames from files in the following links: https://cran.r-project.org/doc/manuals/r-release/R-intro.html and https://rpubs.com/moeransm/intro-iris

How to import and read data from files in R

Step 1.

Before launching R

$ cd exerciseR
$ pwd
/Users/imac/Desktop/exerciseR
$ R

 

After launching R

 

> setwd("/Users/imac/Desktop/exerciseR")
> getwd()
[1] "/Users/imac/Desktop/exerciseR"

 

Step 2. There are 2 ways to use existing files and access the data sets they contain in R:

 

Option 1. If the file exists in your computer, use the “read.table()” function to read the file / data frame. To view part of its content, you can use many functions as the “head()” or “tail()” functions as in Unix

 

> Iris <- read.table("iris.txt")
> head(Iris)
 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

 

Option 2. R has also pre-built datasets that can be used for exercise purposes. We will use the “iris“ data set. This data set is available under the datasets package. To load a package under R, use the “library()” function. Once you loaded the datasets package, call the data set you want using the “data()” function

 

> library(datasets)
> data(iris)
> head(iris)
 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

 

To review all the pre-build data sets available in R (type q to quit)

 

> data()

 

Note 1. Note that we called the variable in which we imported the file “Iris”, whereas the data set called from R is named “iris”. You are free to choose the name of the variables you use in R, but if you call data sets from the existing datasets package you must use its proper nomenclature

 

Note 2. Note that once your data set of interest is loaded, all the commands and functions that we will use will be applicable to both “Iris” and “iris” equally. As an example, we will use “iris” in this guide.

 

How to make simple queries and data manipulation in R

 

Step 1. To view a summary statistics of the whole data set, use the “summary()” function. You can also view summary statistics of one of the variables using the “$” option.

 

> summary(iris)
 Sepal.Length Sepal.Width Petal.Length Petal.Width 
 Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 
 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 
 Median :5.800 Median :3.000 Median :4.350 Median :1.300 
 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 
 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 
 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 
 Species 
 setosa :50 
 versicolor:50 
 virginica :50
> 
> summary(iris$Species)
 setosa versicolor virginica 
 50 50 50 
> 
> summary(iris$Petal.Length)
 Min. 1st Qu. Median Mean 3rd Qu. Max. 
 1.000 1.600 4.350 3.758 5.100 6.900 

 

Step 2. Let’s now query the names of columns using the “names()” function, and the data set content in terms of number of columns and rows, structure, etc…

 

> names(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" 
>
> names(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" 
> dim(iris)
[1] 150 5
> ncol(iris)
[1] 5
> nrow(iris)
[1] 150
>
> sapply(iris, class)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 
 "numeric" "numeric" "numeric" "numeric" "factor" 
> str(iris)
'data.frame':	150 obs. of 5 variables:
 $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
> 

 

Step 3. To query or manipulate this data set, it is possible to use basic operators in R

 

> setosa1 <- iris[iris$Species == "setosa",]
> head(setosa1)
 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> nrow(setosa)
[1] 50

 

Alternative option

 

> setosa2 <- iris[iris$Species %in% "setosa",]
> nrow(setosa2)
[1] 50

 

To select data related to the setosa species in which Sepal.Length > 5

 

> setosa3<- setosa2<-iris[iris$Species %in% "setosa" & iris$Sepal.Length>5,]
> nrow(setosa3)
[1] 22

 

Step 4. To avoid using operators, conditional subsetting is also possible with base functions in R that can ease the process and using the same principles. An example is the “subset()” function

 

To select only data related to the setosa species

 

> setosa.sub1 <- subset(iris, Species == "setosa")
> head(setosa.sub1)
 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> nrow(setosa.sub1)
[1] 50

 

To select again data related to the setosa species in which Sepal.Length > 5

 

> setosa.sub2 <- subset(iris, Species == "setosa" & Sepal.Length > 5)
> nrow(setosa.sub2)
[1] 22
© Wellcome Genome Campus Advanced Courses and Scientific Conferences
This article is from the free online

Bioinformatics for Biologists: An Introduction to Linux, Bash Scripting, and R

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now