Skip main navigation

New offer! Get 30% off your first 2 months of Unlimited Monthly. Start your subscription for just £35.99 £24.99. New subscribers only T&Cs apply

# Subsetting the data

using functions to manipulate the data
© Wellcome Connecting Science

Inspecting the data is one thing, but manipulating it is another

Manipulating the data requires different functions than those required for inspecting the data. There are tens of different functions used to manipulate the data, which enables simple and complex operations in R. We are going to explore a few of them here.

### Selecting entire columns and rows

Why would we need to do that? Well, basically because you probably won’t be visualizing all the data you have in a table, especially if it contains many types of information, like the variants. But here’s another common reason: sometimes we have to use packages that require reshaping the data, or we might need to compute basic arithmetic operations on columns and rows, adding new ones or moving others. Moving back and forth between different data formats needs to be done very carefully, without modifying the basic information. Thankfully, “dplyr” and “tidyr” packages are here to help!

1. Filtering columns

Actually, even basic R options can allow for this. Let’s first quickly review the column names in our table, then select some meaningful information to exploit, using basic syntax in R: dataframe[rows,columns]

# Check columns name> colnames(var) [1] "SAMPLE" "CHROM" "POS" "REF"  [5] "ALT" "FILTER" "DP" "REF_DP"  [9] "ALT_DP" "AF" "GENE" "EFFECT" [13] "HGVS_C" "HGVS_P" "HGVS_P_1LETTER" "CALLER"# Select columns 1, 4, and 5> var[, c(1, 4, 5)] SAMPLE REF ALT1 ERR5181310 C T2 ERR5181310 A G3 ERR5181310 G A4 ERR5181310 C T5 ERR5181310 C T6 ERR5181310 T C

As you can see, the code is understandable, but not very easy to remember. Here, the “,” indicates by default all rows, while column numbers are specified with c(). Moreover, the whole result is by default displayed in the console, which makes it hard to preview.

The function select() from the “dplyr” package can help you extract specific columns in a new table. The function needs as arguments the data frame (var) and the names of the columns to select. Please make sure you use the exact names as they are written in your data frame.

# Select columns 1, 4, and 5 with default display> select(var, SAMPLE, REF, ALT) SAMPLE REF ALT1 ERR5181310 C T2 ERR5181310 A G3 ERR5181310 G A4 ERR5181310 C T5 ERR5181310 C T6 ERR5181310 T C# Select columns 1, 4, and 5 with selected display> select(var, SAMPLE, REF, ALT) %>% head(3) SAMPLE REF ALT1 ERR5181310 C T2 ERR5181310 A G3 ERR5181310 G A# Select all columns except the column “CALLER” with selected display> select(var, -CALLER) %>% head(3) SAMPLE CHROM POS REF ALT FILTER DP REF_DP ALT_DP AF GENE1 ERR5181310 MN908947.3 241 C T PASS 8524 6 8518 1 orf1ab2 ERR5181310 MN908947.3 1807 A G PASS 2890 1 2889 1 orf1ab3 ERR5181310 MN908947.3 2659 G A PASS 13621 4 13617 1 orf1ab EFFECT HGVS_C HGVS_P HGVS_P_1LETTER1 upstream_gene_variant c.-25C>T . .2 synonymous_variant c.1542A>G p.Gly514Gly p.G514G3 synonymous_variant c.2394G>A p.Lys798Lys p.K798K

Note how the code is much easier to understand and reuse here. The “%>%” operator is used here as would the “|” be in Unix, making it easy to combine both functions. To avoid outputting all the content of the selected column(s) in your console, the head() function can help reduce the number of output lines to the number specified with the function, here set at 3 lines. If no preference is given, head() will output 6 lines.

To make things even easier to display, the tibble format can offer somewhat friendlier options. It has better formatting principles, and as such constitutes one of the preferred data structures for the set of packages that comes along with “tidyverse”. Remember that the “tibble” package is an integral part of “tidyverse”.

# Transform the data frame into a tibble> var_tb <- as_tibble(var)> select(var_tb, SAMPLE, REF, ALT) %>% head(3)# A tibble: 3 × 3 SAMPLE REF ALT  <chr> <chr> <chr>1 ERR5181310 C T 2 ERR5181310 A G 3 ERR5181310 G A 

2. Filtering rows

Filtering rows is another option that base R code offers, “dplyr” can also facilitate this using the filter() function.

# Select rows with selected display using base R code> var_tb[var_tb\$SAMPLE == "SRR13500958",]# A tibble: 21 × 16 SAMPLE CHROM POS REF ALT FILTER DP REF_DP ALT_DP AF <chr> <chr> <int> <chr> <chr> <chr> <int> <int> <int> <dbl> 1 SRR13… MN90… 100 C T PASS 904 1 903 1 2 SRR13… MN90… 241 C T PASS 1794 0 1794 1 3 SRR13… MN90… 3037 C T PASS 667 0 667 1 4 SRR13… MN90… 3766 T C PASS 1350 0 1350 1 5 SRR13… MN90… 6347 C T PASS 331 0 331 1 6 SRR13… MN90… 10667 T G PASS 324 1 323 1 7 SRR13… MN90… 11824 C T PASS 402 0 402 1 8 SRR13… MN90… 12053 C T PASS 1297 0 1297 1 9 SRR13… MN90… 12964 A G PASS 555 0 555 110 SRR13… MN90… 13647 C T PASS 922 1 920 1# ℹ 11 more rows# ℹ 6 more variables: GENE <chr>, EFFECT <chr>, HGVS_C <chr>,# HGVS_P <chr>, HGVS_P_1LETTER <chr>, CALLER <chr># ℹ Use print(n = ...) to see more rows# Select rows with selected display using dplyr functions> filter(var_tb, SAMPLE == "SRR13500958") %>% head(3)# A tibble: 3 × 16 SAMPLE CHROM POS REF ALT FILTER DP REF_DP ALT_DP AF <chr> <chr> <int> <chr> <chr> <chr> <int> <int> <int> <dbl>1 SRR135… MN90… 100 C T PASS 904 1 903 12 SRR135… MN90… 241 C T PASS 1794 0 1794 13 SRR135… MN90… 3037 C T PASS 667 0 667 1# ℹ 6 more variables: GENE <chr>, EFFECT <chr>, HGVS_C <chr>,# HGVS_P <chr>, HGVS_P_1LETTER <chr>, CALLER <chr>

Note how the tibble format reduces the density of the display (even with base R) by allowing the user to visually inspect only the data that fits into the console, while summarizing the remaining rows and columns.

3. Filtering columns and rows

Selecting specific rows corresponding to one sample type, and viewing these results by selecting specific columns to display is also possible. Let’s now combine both selection functions.

# Select sample type (rows) and variables (columns) with selected display> var_tb %>% filter(SAMPLE == "SRR13500958") %>% select(CHROM, POS, REF, ALT) %>% head(3)# A tibble: 3 × 4 CHROM POS REF ALT  <chr> <int> <chr> <chr>1 MN908947.3 100 C T 2 MN908947.3 241 C T 3 MN908947.3 3037 C T

### Selecting specific values from columns and rows

It is also possible to select columns or rows with specific values. The examples below show options for filtering out definite values from the column “DP”. Owing to space constraints, the header output showing the size of the tibble after filtering is shown here, but the default output will display the column values as shown previously.

# To select all data related to the sample specified> var_tb %>% filter(SAMPLE == "SRR13500958") %>% select(CHROM, POS, REF, ALT, DP)# A tibble: 21 × 5CHROM POS REF ALT DP<chr> <int> <chr> <chr> <int># To select only values for which DP>=500 for the same sample> var_tb %>% filter(SAMPLE == "SRR13500958" & DP>=500) %>% select(CHROM, POS, REF, ALT, DP)# A tibble: 16 × 5CHROM POS REF ALT DP<chr> <int> <chr> <chr> <int># To select only values for which DP>=1000 for the same sample> var_tb %>% filter(SAMPLE == "SRR13500958" & DP>=1000) %>% select(CHROM, POS, REF, ALT, DP)# A tibble: 8 × 5CHROM POS REF ALT DP<chr> <int> <chr> <chr> <int>

Now that we have learned how to explore the content of the entire dataset and to select specific columns and rows to work on, let’s have a look at some very useful functions to further examine its content.

### Discuss

How do you think you could apply the learning so far in your own job or study?

© Wellcome Connecting Science
This article is from the free online

Created by

## Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now