Skip main navigation

Programming: How to Read Data From Files in R

We look at how to read data from an existing file. We will also see how to simply query with basic functions in R this existing data set.
© Wellcome Genome Campus Advanced Courses and Scientific Conferences

4.8

143 Reviews

Introduction

Indeed, you will often be willing to exploit the usefulness of built-in functions in R to manipulate large data sets. Although these data sets are organized as data frames, with information organized in rows and columns, it is obviously not easy to create it from scratch. Instead, there are specific built-in functions in R that allows us to import existing data contained in a file.
Let’s see together how to read data from an existing file. We will also see how to simply query with basic functions in R this existing data set.
We will use the iris dataset. It contains information about 3 plant species (setosa, virginica, versicolor) and related measures about 4 features (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) of these plants.
Note that this information is organized in a particular form:
  • Each column is named to indicate the features and species to consider
  • Each row contains a row label and information corresponding to features and species
If you want in the future to use another file you generated, make sure that no missing information is left empty. Instead, indicate it as NA (Not Applicable).
We encourage you to find more information on reading, querying and manipulating data frames from files in the following links: https://cran.r-project.org/doc/manuals/r-release/R-intro.html and https://rpubs.com/moeransm/intro-iris

How to import and read data from files in R

Step 1.
Before launching R
$ cd exerciseR
$ pwd
/Users/imac/Desktop/exerciseR
$ R
 
After launching R
 
> setwd("/Users/imac/Desktop/exerciseR")
> getwd()
[1] "/Users/imac/Desktop/exerciseR"
 
Step 2. There are 2 ways to use existing files and access the data sets they contain in R:
 
Option 1. If the file exists in your computer, use the “read.table()” function to read the file / data frame. To view part of its content, you can use many functions as the “head()” or “tail()” functions as in Unix
 
> Iris <- read.table("iris.txt")
> head(Iris)
 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
 
Option 2. R has also pre-built datasets that can be used for exercise purposes. We will use the “iris“ data set. This data set is available under the datasets package. To load a package under R, use the “library()” function. Once you loaded the datasets package, call the data set you want using the “data()” function
 
> library(datasets)
> data(iris)
> head(iris)
 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
 
To review all the pre-build data sets available in R (type q to quit)
 
> data()
 
Note 1. Note that we called the variable in which we imported the file “Iris”, whereas the data set called from R is named “iris”. You are free to choose the name of the variables you use in R, but if you call data sets from the existing datasets package you must use its proper nomenclature
 
Note 2. Note that once your data set of interest is loaded, all the commands and functions that we will use will be applicable to both “Iris” and “iris” equally. As an example, we will use “iris” in this guide.
 

How to make simple queries and data manipulation in R

 
Step 1. To view a summary statistics of the whole data set, use the “summary()” function. You can also view summary statistics of one of the variables using the “$” option.
 
> summary(iris)
 Sepal.Length Sepal.Width Petal.Length Petal.Width 
 Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 
 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 
 Median :5.800 Median :3.000 Median :4.350 Median :1.300 
 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 
 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 
 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 
 Species 
 setosa :50 
 versicolor:50 
 virginica :50
> 
> summary(iris$Species)
 setosa versicolor virginica 
 50 50 50 
> 
> summary(iris$Petal.Length)
 Min. 1st Qu. Median Mean 3rd Qu. Max. 
 1.000 1.600 4.350 3.758 5.100 6.900 
 
Step 2. Let’s now query the names of columns using the “names()” function, and the data set content in terms of number of columns and rows, structure, etc…
 
> names(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" 
>
> names(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" 
> dim(iris)
[1] 150 5
> ncol(iris)
[1] 5
> nrow(iris)
[1] 150
>
> sapply(iris, class)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 
 "numeric" "numeric" "numeric" "numeric" "factor" 
> str(iris)
'data.frame':	150 obs. of 5 variables:
 $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
> 
 
Step 3. To query or manipulate this data set, it is possible to use basic operators in R
 
> setosa1 <- iris[iris$Species == "setosa",]
> head(setosa1)
 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> nrow(setosa)
[1] 50
 
Alternative option
 
> setosa2 <- iris[iris$Species %in% "setosa",]
> nrow(setosa2)
[1] 50
 
To select data related to the setosa species in which Sepal.Length > 5
 
> setosa3<- setosa2<-iris[iris$Species %in% "setosa" & iris$Sepal.Length>5,]
> nrow(setosa3)
[1] 22
 
Step 4. To avoid using operators, conditional subsetting is also possible with base functions in R that can ease the process and using the same principles. An example is the “subset()” function
 
To select only data related to the setosa species
 
> setosa.sub1 <- subset(iris, Species == "setosa")
> head(setosa.sub1)
 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> nrow(setosa.sub1)
[1] 50
 
To select again data related to the setosa species in which Sepal.Length > 5
 
> setosa.sub2 <- subset(iris, Species == "setosa" & Sepal.Length > 5)
> nrow(setosa.sub2)
[1] 22
© Wellcome Genome Campus Advanced Courses and Scientific Conferences
This article is from the free online

Bioinformatics for Biologists: An Introduction to Linux, Bash Scripting, and R

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education

close