Skip to 0 minutes and 7 seconds Hello, and welcome to this introduction to NGS Analysis software. We’re using the Galaxy platform to run these tools, and we’re using a publicly-available dataset obtained by sequencing the BRCA1 gene using an Illumina sequencer. This is paired-in data, so there are two FASTQ files. And as we can see, the combined size of the two files is about 2.5GB Let’s begin by looking at one of the FASTQ files.
Skip to 0 minutes and 39 seconds The FASTQ format consists of four lines. The first is the Identifier line. This can include information about the machine, the run, the flow cell, and the position on the flow cell. The second line is the nucleotide sequence. The third line is the quality score identifier line, which in this case, just contains a plus. The fourth line contains the quality scores. Each nucleotide in the sequence has a quality score. This is a measure of the level of competence that the nucleotide has been correctly identified.
Skip to 1 minute and 17 seconds The first tool we’re going to use is called FastQC. This will give us an overview of the quality of our data.
Skip to 1 minute and 33 seconds We’ll select each of our files in turn and run the software.
Skip to 1 minute and 50 seconds Let’s open one of the reports.
Skip to 1 minute and 56 seconds FastQC performs several analyses on the FASTQ data. For now, we’re going to focus on just one of these– the per-base sequence quality.
Skip to 2 minutes and 8 seconds For each position along the sequence, FASTQ calculates the mean quality score shown by this blue line, the interquartile range, shown by these yellow boxes, and the 10th and 90th percentile range, shown by these black whiskers. The quality scores are shown on the y-axis. This is a lock score. A value of 20 represents a 1 in 100 chance of a nucleotide being wrong. 30 is a 1 in 1,000 chance, and 40 is a 1 in 10,000 chance. We can see that the quality of the sequence deteriorates toward the 3 prime end. This is normal for this type of data.
Skip to 2 minutes and 54 seconds The next step is to remove the poor quality data. Poor quality sequence will generate poor quality alignments and lead to incorrect variant calling. There are several tools we can choose. In this case, I’m going to use a tool called Trimmomatic, which has been designed for paired-in sequence. However, before I can use this, I need to use another tool called FASTQ Groomer.
Skip to 3 minutes and 23 seconds Because there are different versions of FASTQ and different versions of the quality score, we use FASTQ Groomer to determine which version of FASTQ we’re looking at.
Skip to 3 minutes and 47 seconds Now that we’ve run FASTQ Groomer, we can run Trimmomatic.
Skip to 4 minutes and 1 second Trimmomatic starts at the 5 prime end of the sequence, and it looks at a window of four nucleotides.
Skip to 4 minutes and 12 seconds And it calculates the mean quality score over those four nucleotides. If it’s above our threshold, which in this case is 20, then the tool moves up one nucleotide and calculates the next average, and so on. If it gets to a value below our threshold, it trims the sequence at that point. This could lead to us having very short sequences that could be as short as four nucleotides, which will cause problems during alignment, so we’re going to introduce another filter to remove any sequences that are less than 60 nucleotides.
Skip to 4 minutes and 59 seconds And let’s just adjust this so it selects our two files and execute.
Skip to 5 minutes and 12 seconds As you can see, the output of Trimmomatic is four files. Two of the files contain sequences which are unpaired. The other member of the pair has been removed, because it didn’t meet our quality thresholds. We also have two files of paired sequences where both members of the pair have been retained. Let’s run FastQC again, and we’ll see how our quality control step has altered the output.
Skip to 6 minutes and 3 seconds As we can see from the FastQC output, the deterioration of the quality scores towards the 3 prime end is not as great as it was. The next stage of our analysis is to align our sequences to a reference sequence. We’re going to use a tool called BWA.
Skip to 6 minutes and 26 seconds We’re going to align our paired sequences to the HG19 human reference sequence.
Skip to 6 minutes and 41 seconds And it’s 9.
Skip to 6 minutes and 53 seconds Let’s have a look at the outputs of BWA.
Skip to 6 minutes and 59 seconds We can see that it’s generated a SAM file.
Skip to 7 minutes and 4 seconds So the lines at the start of the file that begin with this @ sign, these tell us about the reference sequence that has been used and as we continue to scroll down, we can see the alignment of our sequences.
Skip to 7 minutes and 23 seconds In column 1, we can see the name of the sequence from the FASTQ file. This next column is the flag fields. We’ve got the information about where the sequence is aligned to. You can see everything is aligning in chromosome 17, because that’s where BRCA1 is. In this column, there’s the mapping quality. Again, this relates to the probability that the mapping is wrong and here is the cigar string, which gives details of the matches, mismatches, gaps and insertions in the alignment between the two sequences. And we can scroll across and see the sequence that we’ve aligned and here is our quality scores for the sequence.
Skip to 8 minutes and 20 seconds The next stage is to use our SAM file to call variants and to do that, we’re going to use a tool called Freebayes.
Skip to 8 minutes and 39 seconds So it’s got our alignment file here. We need to make sure we align it to the HG19 reference sequence. And we also want to look at the coverage.
Skip to 9 minutes and 1 second So let’s select for a coverage of at least 30.
Skip to 9 minutes and 7 seconds Let’s look at the output of our variant calling.
Skip to 9 minutes and 15 seconds This is a VCF file. There are two types of line in this file– the header lines that start with hashes and the variant calls. If we scroll down further, as we can see, there are only two variants called, and that’s because we’ve only sequenced a single gene. Let’s just scroll across, show the rest of the entries.
Skip to 9 minutes and 48 seconds So we can annotate these variants using the Anavar tool.
Skip to 9 minutes and 59 seconds This allows us to annotate our variants. We’re going to use information on gene structure from ref gene and let’s use dbSNP, click Execute. So let’s look at our annotated VCF file.
Skip to 10 minutes and 28 seconds If we scroll down again, we can see that there’s some extra annotation that’s been added. We can see there are– there’s information here from ref gene.
Skip to 10 minutes and 47 seconds And if we scroll across a bit further, here in this variant, we can see there’s an RS number from dbSNP Obviously, if this was a whole exome or a large gene panel, there would be many more variants, and the next stage would be to pass the annotated variant list to a clinical scientist to determine whether any of the variants were likely to be deleterious. Finally, I’d just like to say that there are many alternatives to the bioinformatics tools that we’ve seen here. The tools you choose and the parameters you select can greatly affect the resulting set of variants.
Skip to 11 minutes and 21 seconds As a clinical bioinformatician, you’ll need to make sure that you validate your bioinformatics pipeline and that you record the tools and parameters used. That’s it for now. Thanks for watching, and enjoy the rest of the course.
Galaxy Tool Demonstration
Dr Michael Cornell, Clinical Bioinformatics Lecturer at University of Manchester, provides an in depth demonstration of the Galaxy tool.
He will bring the workflow to life and will show how clinical bioinformaticians use tools such as FastQC, FASTQ Groomer, Trimmomatic, BWA and FreeBayes.
© University of Manchester