Skip main navigation

Summary statistics on the data

other functions used for basic arithmetics on data
a hand using a computer mouse

Before moving to data visualization, there are a few other useful functions and tips you need to know, that will ease the operational analysis of data immensely

The number of functions we could cover would require individual courses on their own, but basic arithmetic functions will be good bonus material for inspecting, cleaning and/or visualizing data.

Count specific columns and rows

To obtain an insight into the distribution of values in your dataset, the count() function is here to help. The “%>%” operator can be used here as the “|” would be in Unix, making it easy to combine two functions

# Count how many rows are associated with each sample in the data 

> var_tb %>% count(SAMPLE)

# A tibble: 5 × 2
SAMPLE n
<chr> <int>
1 ERR5181310 33
2 ERR5405022 36
3 ERR5556343 35
4 ERR5743893 28
5 SRR13500958 21

This function can also be used with arguments. An example is here the sorting that can be operated on the output of the count() function.

# Sorting the counts 

> var_tb %>% count(SAMPLE, sort = TRUE)

# A tibble: 5 × 2
SAMPLE n
<chr> <int>
1 ERR5405022 36
2 ERR5556343 35
3 ERR5181310 33
4 ERR5743893 28
5 SRR13500958 21

You can also use the count() function with more than one data object to count. Here the output should show 28 lines without the head() function, used here to reduce the output. By default, head() outputs 6 lines of the result.

# Distribution of genes per sample and counts 

> var_tb %>% count(SAMPLE, GENE, sort = TRUE) %>% head()

# A tibble: 6 × 3
SAMPLE GENE n
<chr> <chr> <int>
1 ERR5405022 orf1ab 17
2 ERR5556343 orf1ab 15
3 ERR5181310 S 12
4 ERR5181310 orf1ab 12
5 ERR5556343 S 12
6 SRR13500958 orf1ab 12

Basic Maths

Here are a few operations that are very intuitive to understand and use, that can be very helpful for data analysis. They operate on individual columns.

# Maximum value of column DP

> max(var_tb$DP)

[1] 41836

# Minimum value of column DP

> min(var_tb$DP)

[1] 38

# Mean value of column DP

> mean(var_tb$DP)

[1] 2635.229

Compute operations in new columns

You can compute operations on columns and store the results in a new column that will be appended to your data table. For this we can use the mutate() function from the “dplyr” package. The function we have been using so far explores the data, and gives output in the console, without modifying it.

Important note: It is recommended that you never modify your original data, and consider alternative options when you start performing modifications. You can ideally create a new folder in your directory to store the raw data that should always be kept unchanged. Alternatively, you can simply create new variables each time you want to store the output of modified data.

# Compute a LOG2 transformation on the DP values

> var_tb_log <- var_tb %>% mutate(DP_log2 = log2(DP))

# View the table columns with the DP_log2 new column appended at the end

> head(var_tb_log)

# A tibble: 6 × 17

# View a selected content including the new column

> select(var_tb_log, SAMPLE, REF, ALT, DP, DP_log2) %>% head()

# A tibble: 6 × 5
SAMPLE REF ALT DP DP_log2
<chr> <chr> <chr> <int> <dbl>
1 ERR5181310 C T 8524 13.1
2 ERR5181310 A G 2890 11.5
3 ERR5181310 G A 13621 13.7
4 ERR5181310 C T 2718 11.4
5 ERR5181310 C T 20212 14.3
6 ERR5181310 T C 2414 11.2

Split_apply_combine approach for data analysis

The “split-apply-combine” approach allows one to operate on data by splitting it into groups, applying some analysis, and then combining the results. The function group_by() can be used to split data into groups, taking the column names as arguments. It is classically associated with the summarize() function that combines each group and outputs a single-row summary for each of these groups. Remember how we used this already with the count() function.

# Show the maximum value of DP for each sample

> var_tb %>% group_by(SAMPLE) %>% summarize(max(DP))

# A tibble: 5 × 2
SAMPLE `max(DP)`
<chr> <int>
1 ERR5181310 41836
2 ERR5405022 2896
3 ERR5556343 9105
4 ERR5743893 7987
5 SRR13500958 2212

# Show the minimum value of DP for each sample

var_tb %>% group_by(SAMPLE) %>% summarize(min(DP))

# A tibble: 5 × 2
SAMPLE `min(DP)`
<chr> <int>
1 ERR5181310 874
2 ERR5405022 43
3 ERR5556343 72
4 ERR5743893 38
5 SRR13500958 183

Great work so far! We have now seen a lot of functions that could help us work on data exploration, selective analysis, or subsetting. This will be of great help for any type of data analysis you would like to perform next.

A little help from additional resources

To enable you to have access to the full list a functions in a package we have compiled a list of useful resources, or ‘cheat sheets’ commonly used by the community, that will help you in using RStudio in future.

We encourage you to take some time after this course to explore their content and make use of this gold mine of information.

For “dplyr” https://nyu-cdsc.github.io/learningr/assets/data-transformation.pdf

For “tidyr” https://github.com/rstudio/cheatsheets/blob/main/tidyr.pdf

For both “dplyr” and “tidyr” summarized https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

For “tidyverse” https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Tidyverse+Cheat+Sheet.pdf

© Wellcome connecting Science
This article is from the free online

Bioinformatics for Biologists: Analysing and Interpreting Genomics Datasets

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now