# Summary statistics on the data

other functions used for basic arithmetics on data

Before moving to data visualization, there are a few other useful functions and tips you need to know, that will ease the operational analysis of data immensely

The number of functions we could cover would require individual courses on their own, but basic arithmetic functions will be good bonus material for inspecting, cleaning and/or visualizing data.

### Count specific columns and rows

To obtain an insight into the distribution of values in your dataset, the count() function is here to help. The “%>%” operator can be used here as the “|” would be in Unix, making it easy to combine two functions

# Count how many rows are associated with each sample in the data > var_tb %>% count(SAMPLE)# A tibble: 5 × 2 SAMPLE n <chr> <int>1 ERR5181310 332 ERR5405022 363 ERR5556343 354 ERR5743893 285 SRR13500958 21

This function can also be used with arguments. An example is here the sorting that can be operated on the output of the count() function.

# Sorting the counts > var_tb %>% count(SAMPLE, sort = TRUE)# A tibble: 5 × 2 SAMPLE n <chr> <int>1 ERR5405022 362 ERR5556343 353 ERR5181310 334 ERR5743893 285 SRR13500958 21

You can also use the count() function with more than one data object to count. Here the output should show 28 lines without the head() function, used here to reduce the output. By default, head() outputs 6 lines of the result.

# Distribution of genes per sample and counts > var_tb %>% count(SAMPLE, GENE, sort = TRUE) %>% head()# A tibble: 6 × 3 SAMPLE GENE n <chr> <chr> <int>1 ERR5405022 orf1ab 172 ERR5556343 orf1ab 153 ERR5181310 S 124 ERR5181310 orf1ab 125 ERR5556343 S 126 SRR13500958 orf1ab 12

### Basic Maths

Here are a few operations that are very intuitive to understand and use, that can be very helpful for data analysis. They operate on individual columns.

# Maximum value of column DP> max(var_tb$DP)[1] 41836# Minimum value of column DP> min(var_tb$DP)[1] 38# Mean value of column DP> mean(var_tb\$DP)[1] 2635.229

### Compute operations in new columns

You can compute operations on columns and store the results in a new column that will be appended to your data table. For this we can use the mutate() function from the “dplyr” package. The function we have been using so far explores the data, and gives output in the console, without modifying it.

Important note: It is recommended that you never modify your original data, and consider alternative options when you start performing modifications. You can ideally create a new folder in your directory to store the raw data that should always be kept unchanged. Alternatively, you can simply create new variables each time you want to store the output of modified data.

# Compute a LOG2 transformation on the DP values> var_tb_log <- var_tb %>% mutate(DP_log2 = log2(DP))# View the table columns with the DP_log2 new column appended at the end> head(var_tb_log)# A tibble: 6 × 17# View a selected content including the new column> select(var_tb_log, SAMPLE, REF, ALT, DP, DP_log2) %>% head()# A tibble: 6 × 5 SAMPLE REF ALT DP DP_log2 <chr> <chr> <chr> <int> <dbl>1 ERR5181310 C T 8524 13.12 ERR5181310 A G 2890 11.53 ERR5181310 G A 13621 13.74 ERR5181310 C T 2718 11.45 ERR5181310 C T 20212 14.36 ERR5181310 T C 2414 11.2

### Split_apply_combine approach for data analysis

The “split-apply-combine” approach allows one to operate on data by splitting it into groups, applying some analysis, and then combining the results. The function group_by() can be used to split data into groups, taking the column names as arguments. It is classically associated with the summarize() function that combines each group and outputs a single-row summary for each of these groups. Remember how we used this already with the count() function.

# Show the maximum value of DP for each sample> var_tb %>% group_by(SAMPLE) %>% summarize(max(DP))# A tibble: 5 × 2 SAMPLE max(DP) <chr> <int>1 ERR5181310 418362 ERR5405022 28963 ERR5556343 91054 ERR5743893 79875 SRR13500958 2212# Show the minimum value of DP for each samplevar_tb %>% group_by(SAMPLE) %>% summarize(min(DP))# A tibble: 5 × 2 SAMPLE min(DP) <chr> <int>1 ERR5181310 8742 ERR5405022 433 ERR5556343 724 ERR5743893 385 SRR13500958 183

Great work so far! We have now seen a lot of functions that could help us work on data exploration, selective analysis, or subsetting. This will be of great help for any type of data analysis you would like to perform next.

### A little help from additional resources

To enable you to have access to the full list a functions in a package we have compiled a list of useful resources, or ‘cheat sheets’ commonly used by the community, that will help you in using RStudio in future.

We encourage you to take some time after this course to explore their content and make use of this gold mine of information.

For both “dplyr” and “tidyr” summarized https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf