Skip main navigation

£199.99 £139.99 for one year of Unlimited learning. Offer ends on 28 February 2023 at 23:59 (UTC). T&Cs apply

Find out more

Galaxy Tool Demonstration

Dr Michael Cornell, Clinical Bioinformatics Lecturer at University of Manchester, provides an in depth demonstration of the Galaxy tool.
Hello, and welcome to this introduction to NGS Analysis software. We’re using the Galaxy platform to run these tools, and we’re using a publicly-available dataset obtained by sequencing the BRCA1 gene using an Illumina sequencer. This is paired-in data, so there are two FASTQ files. And as we can see, the combined size of the two files is about 2.5GB Let’s begin by looking at one of the FASTQ files.
The FASTQ format consists of four lines. The first is the Identifier line. This can include information about the machine, the run, the flow cell, and the position on the flow cell. The second line is the nucleotide sequence. The third line is the quality score identifier line, which in this case, just contains a plus. The fourth line contains the quality scores. Each nucleotide in the sequence has a quality score. This is a measure of the level of competence that the nucleotide has been correctly identified.
The first tool we’re going to use is called FastQC. This will give us an overview of the quality of our data.
We’ll select each of our files in turn and run the software.
Let’s open one of the reports.
FastQC performs several analyses on the FASTQ data. For now, we’re going to focus on just one of these– the per-base sequence quality.
For each position along the sequence, FASTQ calculates the mean quality score shown by this blue line, the interquartile range, shown by these yellow boxes, and the 10th and 90th percentile range, shown by these black whiskers. The quality scores are shown on the y-axis. This is a lock score. A value of 20 represents a 1 in 100 chance of a nucleotide being wrong. 30 is a 1 in 1,000 chance, and 40 is a 1 in 10,000 chance. We can see that the quality of the sequence deteriorates toward the 3 prime end. This is normal for this type of data.
The next step is to remove the poor quality data. Poor quality sequence will generate poor quality alignments and lead to incorrect variant calling. There are several tools we can choose. In this case, I’m going to use a tool called Trimmomatic, which has been designed for paired-in sequence. However, before I can use this, I need to use another tool called FASTQ Groomer.
Because there are different versions of FASTQ and different versions of the quality score, we use FASTQ Groomer to determine which version of FASTQ we’re looking at.
Now that we’ve run FASTQ Groomer, we can run Trimmomatic.
Trimmomatic starts at the 5 prime end of the sequence, and it looks at a window of four nucleotides.
And it calculates the mean quality score over those four nucleotides. If it’s above our threshold, which in this case is 20, then the tool moves up one nucleotide and calculates the next average, and so on. If it gets to a value below our threshold, it trims the sequence at that point. This could lead to us having very short sequences that could be as short as four nucleotides, which will cause problems during alignment, so we’re going to introduce another filter to remove any sequences that are less than 60 nucleotides.
And let’s just adjust this so it selects our two files and execute.
As you can see, the output of Trimmomatic is four files. Two of the files contain sequences which are unpaired. The other member of the pair has been removed, because it didn’t meet our quality thresholds. We also have two files of paired sequences where both members of the pair have been retained. Let’s run FastQC again, and we’ll see how our quality control step has altered the output.
As we can see from the FastQC output, the deterioration of the quality scores towards the 3 prime end is not as great as it was. The next stage of our analysis is to align our sequences to a reference sequence. We’re going to use a tool called BWA.
We’re going to align our paired sequences to the HG19 human reference sequence.
And it’s 9.
Let’s have a look at the outputs of BWA.
We can see that it’s generated a SAM file.
So the lines at the start of the file that begin with this @ sign, these tell us about the reference sequence that has been used and as we continue to scroll down, we can see the alignment of our sequences.
In column 1, we can see the name of the sequence from the FASTQ file. This next column is the flag fields. We’ve got the information about where the sequence is aligned to. You can see everything is aligning in chromosome 17, because that’s where BRCA1 is. In this column, there’s the mapping quality. Again, this relates to the probability that the mapping is wrong and here is the cigar string, which gives details of the matches, mismatches, gaps and insertions in the alignment between the two sequences. And we can scroll across and see the sequence that we’ve aligned and here is our quality scores for the sequence.
The next stage is to use our SAM file to call variants and to do that, we’re going to use a tool called Freebayes.
So it’s got our alignment file here. We need to make sure we align it to the HG19 reference sequence. And we also want to look at the coverage.
So let’s select for a coverage of at least 30.
Let’s look at the output of our variant calling.
This is a VCF file. There are two types of line in this file– the header lines that start with hashes and the variant calls. If we scroll down further, as we can see, there are only two variants called, and that’s because we’ve only sequenced a single gene. Let’s just scroll across, show the rest of the entries.
So we can annotate these variants using the Anavar tool.
This allows us to annotate our variants. We’re going to use information on gene structure from ref gene and let’s use dbSNP, click Execute. So let’s look at our annotated VCF file.
If we scroll down again, we can see that there’s some extra annotation that’s been added. We can see there are– there’s information here from ref gene.
And if we scroll across a bit further, here in this variant, we can see there’s an RS number from dbSNP Obviously, if this was a whole exome or a large gene panel, there would be many more variants, and the next stage would be to pass the annotated variant list to a clinical scientist to determine whether any of the variants were likely to be deleterious. Finally, I’d just like to say that there are many alternatives to the bioinformatics tools that we’ve seen here. The tools you choose and the parameters you select can greatly affect the resulting set of variants.
As a clinical bioinformatician, you’ll need to make sure that you validate your bioinformatics pipeline and that you record the tools and parameters used. That’s it for now. Thanks for watching, and enjoy the rest of the course.

Dr Michael Cornell, Clinical Bioinformatics Lecturer at University of Manchester, provides an in depth demonstration of the Galaxy tool.

He will bring the workflow to life and will show how clinical bioinformaticians use tools such as FastQC, FASTQ Groomer, Trimmomatic, BWA and FreeBayes.

This article is from the free online

Clinical Bioinformatics: Unlocking Genomics in Healthcare

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education