Progressing in variants exploration
Variants data to explore
As a refresher, let’s first review the data we have in hand. We’ve previously seen how to do that for both static or interactive output.
# View the data
> View(var_tb)
As you can see it, there is plenty of data you can plot and/or check correlation for. The choice of which data to plot is of course entirely dependent on the message you would like to convey. The following examples are only given as use cases, and provide plotting options in parallel.
We are going to plot here:
- The distribution of DP values per chromosome and per sample
- The variants effect per sample
- The genes showing variants effects
- The read depth per position
Distribution of DP values per chromosome and per sample
This can be done using boxplots. To further highlight each species, we will use colours as we did previously, and to further emphasize independent chromosome output, we will introduce here the use of faceting. The faceting approach subdivides a plot into a matrix of plots representing smaller data subsets.
> ggplot(data = var_tb, aes(x=CHROM, y=DP, fill= SAMPLE)) + geom_boxplot() + ylim(0,10000) + scale_fill_brewer(palette="RdYlBu") + labs(title="DP_per_Chromosome") + facet_grid(. ~ SAMPLE)
Now imagine you want to explore different plotting options on the x and y variables we decided to use in this section. And you also probably noticed that a major part of the code above comes from previous decisions we made on plotting options (graph type, colour, axis limits, etc).
Well, a good practice in R is to create a variable to store the data you fixed, and then add to it what you want to test. This simplifies the code a lot.
It is common to use “p” as a variable related to plotting stages, but you can use any name you feel is more convenient. Knowing that in one single project, many variables of this type can be created, simply make sure to use names with the most intuitively indicative content.
# Define a variable with plotting options
> p_DP_CHROM <- ggplot(data = var_tb, aes(x=CHROM, y=DP, fill= SAMPLE)) + ylim(0,10000) + scale_fill_brewer(palette="RdYlBu") + labs(title="DP_per_Chromosome") + theme(legend.position="bottom")
# Test boxplots with faceting
> p_DP_CHROM + geom_boxplot() + facet_grid(. ~ SAMPLE)
# Combine violin plots and boxplots with faceting
> p_DP_CHROM + geom_violin(trim=FALSE) + facet_grid(. ~ SAMPLE) + geom_boxplot(width=0.1)
You are invited to test other plotting options (the different geometries will appear while typing your code (“geom_”) in the console or through the help section). The faceting also comes with the option to display the samples in columns (facet_grid(. ~ SAMPLE)) or rows (facet_grid(SAMPLE ~ .)). Try it!
Variants effects per sample
1. Plotting the variants effects
Different variant effects are reported for each sample. Let’s see what they are, and how many of these variants have been reported for each sample. Notice how the x-axis data is packed in the first output, showing that labels should be either oriented differently, or data could be flipped, as we will be doing in the second display.
# Count number of different effects per sample
> p_EFFECT <- ggplot(data = var_tb, aes(x=EFFECT, fill= SAMPLE)) + scale_fill_brewer(palette="RdBu") + labs(title="Effect_per_Sample") + theme(legend.position="bottom")
> p_EFFECT + geom_bar()
# Flip orientation
> p_EFFECT_flip <- ggplot(data = var_tb, aes(y=EFFECT, fill= SAMPLE)) + scale_fill_brewer(palette="RdBu") + labs(title="Effect_per_Sample") + theme(legend.position="bottom")
> p_EFFECT_flip + geom_bar()
2. Counting the variants effects
Do you want to have the exact number of effects related to each category we just plotted? Well, then it’s time to combine what we’ve learned with ggplot2 plotting options, with what we’ve previously covered for data manipulation!
# Count the number of different effects
> var_tb %>% count(EFFECT)
# A tibble: 8 × 2
EFFECT n
<chr> <int>
1 conservative_inframe_deletion 4
2 disruptive_inframe_deletion 5
3 downstream_gene_variant 4
4 missense_variant 89
5 stop_gained 1
6 stop_lost 3
7 synonymous_variant 38
8 upstream_gene_variant 9
# Count the number of different effects and link them to sample information
> var_tb %>% count(EFFECT, SAMPLE, sort = TRUE)
# A tibble: 27 × 3
EFFECT SAMPLE n
<chr> <chr> <int>
1 missense_variant ERR5405022 24
2 missense_variant ERR5556343 22
3 missense_variant ERR5743893 19
4 missense_variant ERR5181310 13
5 missense_variant SRR13500958 11
6 synonymous_variant ERR5181310 11
7 synonymous_variant ERR5556343 9
8 synonymous_variant SRR13500958 7
9 synonymous_variant ERR5405022 6
10 synonymous_variant ERR5743893 5
# ℹ 17 more rows
# ℹ Use `print(n = ...)` to see more rows
Excellent! Now you know how to count what is of interest for you, and you know how to plot it! Remember that we are dealing here with a simple output dataset, but the exact same commands would work for any type of data, provided it follows the tabular requirements of data frames and tibbles, and that you know which plotting options work with which variable (continuous or discrete).
The genes showing variants effects
1. Counting and extracting all effects for all genes
Here we would like to see which genes are related to which variants’ effects. This is important information you want to know when analyzing genomics data, as some of these genes might have critical roles in your study.
# Counting the effects per gene
> var_tb %>% count(EFFECT, GENE, sort = TRUE)
# A tibble: 23 × 3
EFFECT GENE n
<chr> <chr> <int>
1 missense_variant S 35
2 synonymous_variant orf1ab 29
3 missense_variant orf1ab 27
4 missense_variant N 16
5 synonymous_variant S 5
6 upstream_gene_variant orf1ab 5
7 disruptive_inframe_deletion S 4
8 downstream_gene_variant S 4
9 missense_variant ORF8 4
10 conservative_inframe_deletion orf1ab 3
# ℹ 13 more rows
# ℹ Use `print(n = ...)` to see more rows
2. Counting and extracting specific effects for all genes
Among the high impact effects of variants are those modifying the stop codons, as they might affect the integrity of the protein and its related isoforms. Let’s have a look specifically at variants affecting stop codons, reported in our table as either “stop_lost” or “stop_gained”.
Here are two filtering options using the filter() function we used already. These two options use different operators in R, but give the same output. The select() function is next used to help us output columns related to samples, chromosomes, genes and effects. A series of new operators are used here, where “|” allows to search for one value or another (equivalent of “OR”), and “%in%” filters specific values from a subset of the data (here “stop_lost” and “stop_gained” from the column “EFFECT”).
# Filtering option 1 to select for effect on stop
> filter(var_tb, EFFECT == "stop_lost" | EFFECT == "stop_gained")
# A tibble: 4 × 16
# Filtering option 2 to select for effect on stop
> filter(var_tb, EFFECT %in% c("stop_lost", "stop_gained"))
# A tibble: 4 × 16
# Filtering on effect and selected columns
> filter(var_tb, EFFECT %in% c("stop_lost", "stop_gained")) %>% select(SAMPLE, CHROM, GENE, EFFECT)
# A tibble: 4 × 4
SAMPLE CHROM GENE EFFECT
<chr> <chr> <chr> <chr>
1 ERR5181310 MN908947.3 orf1ab stop_lost
2 ERR5405022 MN908947.3 ORF8 stop_gained
3 ERR5743893 MN908947.3 orf1ab stop_lost
4 ERR5743893 MN908947.3 orf1ab stop_lost
Read depth per position
Another data you might want to visualize is at which positions do the highest or the lowest DP values occur. This usually allows us to identify certain spots for varying density of reads. Obviously, if one can also relate this information to each sample, that’s even better.
This time we will use geom_point() to represent the two continuous data we chose: DP and POS. This is what is called a scatter plot, a very helpful way of representing two variables for which you want to show a certain level of correlation, here the positions at which the lowest and highest DP occur. There are many more options you could use.
# Define your variable
> p_DP_POS <- ggplot(data = var_tb, aes(x=POS, y=DP, fill= SAMPLE)) + scale_fill_brewer(palette="RdBu") + labs(title="DP_per_Position") + theme(legend.position="bottom")
# Plot
> p_DP_POS + geom_point(shape = 21, size = 5)
# Plot with transparency options
> p_DP_POS + geom_point(shape = 21, size = 5, alpha = 0.7)
Again, these functions and options are given to you as examples. Many more options are available. The Help Tab in RStudio is a good resource to find help on the package you’re using, as well as online resources. This gives you a good sense of what types of analysis can be done, and what information can be viewed. However, we encourage you to consider other options for analyzing and plotting data.
Bioinformatics for Biologists: Analysing and Interpreting Genomics Datasets
Bioinformatics for Biologists: Analysing and Interpreting Genomics Datasets
Reach your personal and professional goals
Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.
Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.
Register to receive updates
-
Create an account to receive our newsletter, course recommendations and promotions.
Register for free