Skip main navigation

What data to use from your NGS data output and what to do with it?

introduction to data visualisation, which data and tools will be used
letter R

Understand your data: the variants

Now that you have analysed your data, we are going to play around the output data you obtained! For this, we are going to explore together what you could do with your data analysis output from Week 1 and Week 2.

What data are we going to use? Well, obviously, the variants ! After all, your main goal in doing these types of NGS data analysis was to identify variants in your samples and study their effects.

Now how are the variants presented and where? In the previous two weeks you have obtained several files containing information about variants. These are either a file containing variants, in a Variant Calling Format (VCF), or, with your workflow, a modified variants table. Please refer back to step 2.10 for a reminder on “ivar” output if needed.

As explained in Week 2, among the viralrecon outputs is the file “variants_long_table.csv”. Here the tool aggregates information such as individual variants, functional effect prediction and lineage analysis. This is done for each sample separately, with all outputs collated in one single output file. This is the file we are going to use here.

You can also use the file called “summary_variants_metrics_mqc.csv”, which contains a selection of read alignment and variant calling metrics, to explore the same set of R packages and functions that we will use in this course.

Now here are even better news for you: Actually, any text-based file containing high-throughput NGS output data can be explored using the same packages!

Understand your file

You will not be able to explore and visualize your data wisely if you do not first understand:

  • What is in your file? Your VCF output file contains information that follows the standard VCF format explained in Week 1. It has been used to create the file called “variants_long_table.csv” that has a different structure that we are going to explore and use here.
  • What is the structure of your file? You need first to understand the structure of your data, so that you know how to query it. The data is presented in columns and rows, where each sample is related to a set of variables.
  • What specific information you want to use from it.
    Not all information contained in a file needs to be visually displayed. You need to think of what part of this data (and in which columns, rows) conveys a message you want to share.

Explore it and visualize it in RStudio!

If you want to use an environment that allows you to explore the content of your file, to understand its structure, to query it, to extract specific information, and to visualize it, then RStudio is exactly what you need.

In the next few steps we will lean the basics of RS studio, how to install it and use it to explore the variants you have identified.

© Wellome Connecting Science
This article is from the free online

Bioinformatics for Biologists: Analysing and Interpreting Genomics Datasets

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now