Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

Find out more

Let’s explore the data

looking into structure of the data file to be used for analysis and visualisation

Now that we have set up our environment, installed and loaded the packages needed, we are ready to begin exploring the data

As a reminder, we want first to explore the “variants_long_table.csv” table and the data in it.

Declare your data

1. Make sure the data is your working folder

First, transfer your input files (i.e output from Week 2 of this course) into your working directory. To ensure your input files are properly placed in your current working directory, you can either control this using the “Files” tab in the lower right-hand panel, or in R code, using the list.files() command.

> list.files()
[1] "Script_B4B_Advanced.R"
[2] "summary_variants_metrics_mqc.csv"
[3] "variants_long_table.csv"

The output shows that this working directory contains now the variants’ files we are going to work with, along with the script file you previously created.

2. Import the data in a variable into RStudio

For further processing, you need now to import your data into RStudio. There are multiple data structures you can import or create and then query in R: vector, matrix, array, list and data frame. This course will mainly focus on using data frames.

We will create a variable called “var” in which we are going to import the data. The input data file is a table in a “.csv” format, and R has a great function to import it without further formatting.

> var <- read.csv("variants_long_table.csv")

You can check your file using:

> head(var)

Exploring the file and data structure

To further explore the data structure, we are going to use other basic functions in R. This is a very essential step, required because you need not only to inspect the data structure to understand what and how to use it, but also to identify potential transformations that might be required at later stages.

Important: Remember that all lines beginning with one or more “#” are only indicated to explain the code (and are not meant to be run). The code you need to run is indicated with the R prompt “>” (default prompt in R, not to be entered). The results can be indicated after the command, without the “>” prompt.

1. The dim() function

# Check the dimension of the data
> dim(var)
[1] 153 16

The data contains 153 entries (lines) and 16 columns as individual variables.
Alternatively, you can query the table for columns and rows separately.

# Check number of rows
> nrow(var)
[1] 153
# Check number of columns
> ncol(var)
[1] 16

2. The str() function

# Display the structure of your R object
> str(var)

screenshot of R displaying data structure of 16 variables

The output is a compact display of the structure of your R objects, here the variable “var” containing our input data. An alternative can be the glimpse() function.
Note how the data type and dimension are indicated in the first line, and how the “class” of each object is indicated (chr: character, int: integer, num: numeric, etc).

3. The summary() function

When the column to inspect contains numerical data, this function returns the minimum, maximum, mean, median, and 1st and 3rd quartiles of the data. It can also be used to inspect all columns at once for the whole table. We will run summary(var) command but for the space constraints, will not be showing the whole output. Note that the “$” operator serves to query only part of the data frame, here the column.

# Summary statistics of the whole data or specified columns

## For the whole table

> summary(var)

## For non-numerical data

> summary(var$SAMPLE)

Length Class Mode
153 character character

## For numerical data

> summary(var$DP)

Min. 1st Qu. Median Mean 3rd Qu. Max.
38 369 1014 2635 2766 41836

4. The class() function

You can also check the class of the whole table or of individual objects, using the class() function. Using the typeof() function can allow you to check the class of individual objects. Remember that “var” stores the data, and that the “$” operator specifies individual columns:

# Check the class of your data

> class(var)

[1] "data.frame"

# Check the class of an object

> class(var$CHROM)

[1] "character"

> typeof(var$CHROM)

[1] "character"

5. The View() function

# Preview the data using a spreadsheet-style data viewer in RStudio
> View(var)

spreadsheet type of view of the data structure with two arrows pointing to word Filter in the upper left corner of the spreadsheet, and the other one to words saying number of entries and and total columns in the lower left hand side corner of the spreadsheet respectivelyClick to expand

A new Tab will appear in the upper-left panel displaying the data in a spreadsheet-style tabular format. This enables: – an interactive control on the data such as filtering columns, etc (upper arrow) – showing the number of entries and columns (lower arrow)

In summary

Here is a summary of the functions we covered that are useful to inspect data frames. Note that some can also be used for R objects other than data frames.

Data inspection Function Description
Content head() First rows, by default 6 rows are displayed
  tail() Last rows, by default 6 rows are displayed
  View() Interactive spreadsheet-style display
Dimension dim() Number of rows and number of columns
  nrow() Number of rows
  ncol() Number of columns
Structure str() Structure of the data, includes class of objects
  class() Class of the object specified in query
Statistics summary() Summary statistics (numerical data)
© Wellcome Connecting Science
This article is from the free online

Bioinformatics for Biologists: Analysing and Interpreting Genomics Datasets

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now