Skip to 0 minutes and 27 secondsWithin the activity of the DNA - the information for life - we have today an interview with Ferran Casals, head of the Genomics Core Facility at the Pompeu Fabra University in Barcelona. Ferran Casals has been working on evolutionary genomics in several institutions in Barcelona and in Canada. His current research interests include the development and application of new experimental methodologies for genetic analysis, the study of the genetic etiology of rare disease, and the analysis of genetic variation across human populations with important functional variants in relation to disease. At the Genome Core Facility, he also develops projects covering a variety of applications for next generation sequencing. Ferran, what does it mean to know a genome?

Skip to 1 minute and 22 secondsWhat do we understand by sequencing a genome? I would say that knowing the genome, in fact, is understanding the phenotype. I imagine the phenotype as the output and the genome as the input. So the first step would be sequencing, and sequencing is writing the 3 billion characters of text that we have in each one of our cells. But this is just sequencing; it’s just writing the information. Then we have to understand how we process this information to have a phenotype. So understanding the genome is understanding the phenotype But the first step is to obtain this sequence. When you are asked to obtain a sequence in your facility, what do you do, how do you begin?

Skip to 2 minutes and 9 secondsWell the first step is always to have a quality control of the sample, but let’s assume that it’s a sample that we have no problem with the quality. And then, if we want to sequence a whole genome, what we do first is fragmenting this sample of DNA. We fragment that because the equipment that we have is able to read only short fragments of DNA. By now, we are not able to read a full chromosome, so we have to fragment, to put this input into the equipment and then sequencing these short fragments. Before that, where do you obtain the DNA from? Well DNA can come from anywhere.

Skip to 2 minutes and 54 secondsUsually samples that come from patients come from saliva or from blood, but you can get DNA samples from any cell in your body. And then, when you have this DNA fragmented, how do you proceed after that? With the current technology - what is called the second generation or short read sequencing technology - the way to process is once we have fragmented the DNA, we have to link some adaptors to the end of each fragment. These adaptors are known sequences, so we need to have at both ends of each fragment known sequences, because the sequencing reaction is based in a PCR, and as you know, in a PCR you need to start from some known sequences.

Skip to 3 minutes and 43 secondsSo you have your sequencing primers that will anneal to these known sequences, and from there, you will start the sequencing reaction. And then, you do the sequence reaction, obtaining many short sequences. What do you do with them? What is new with this technology is the parallelization. Before, we were able to have about 96 reactions in a run, and now, we’re having in a single run millions of reactions. It means millions of reads. When we sequence, we transform these fragments of DNA that we have talked about, we put adaptors and we transform these fragments, sequencing, into reads. We call them “reads” of sequences. We transform these fragments into something we can read.

Skip to 4 minutes and 33 secondsSo fragments, let’s say, of about 200, 300 characters of text that we can read with our computers. Characters that are the unit of information in the DNA itself. Exactly. A, T, C, and G, with some quality parameter. So what the equipment is producing is what we call the FASTQ files. These are standard sequence files with some information also about the quality of the sequencing. So at this step, what you have for the genome is lots of short sequences. And then? And then it’s where bioinformatics starts. So I like to say that the bottleneck now is not sequencing, it is the analysis that we have after sequencing. The first step is what we call “mapping”.

Skip to 5 minutes and 26 secondsWe need to know where in the genome these sequences are coming from. So what we do is map all these millions of sequences, we map them to the referent genome, humans, for example. So at the end, we have like a puzzle with different parts of the genome mapped to the proper place. With that, for example, in case of wanting to obtain the normal genome, this is the way to proceed. We have the puzzle of different reads mapped to different parts of the genome. In humans - in DNA from a patient with a rare disease, for example the next step would be getting all the positions that are different in this patient compared to the referent genome.

Skip to 6 minutes and 12 secondsSo after mapping, the next step is what we call the variant calling. We want to know which positions in this individual are different from the referent genome. This is just to know whether you are able to pinpoint the places in the genome that are the causes of the disease. Exactly. What we are assuming here, the logic is that a disease will be originated by a mutation, but some base is different in this patient compared to most of the people in the population. So in the general setting, then the bioinformatic part is the part that really gets from these thousands or millions of fragments into a real, whole genome. How big is this genome?

Skip to 7 minutes and 6 secondsThis genome in humans is 3 billion characters. So it’s impossible to process manually this information. We need bioinformatics here. But I would say that bioinformatics is not only transforming the FASTQ files into variant call files. It is also the interpretation that we will make afterwards. For example, in the case of RNA sequencing, we can be interrogating different tissues in an individual to see which genes are expressed in each tissue. Sequencing in this case the RNA, we’d map the RNA to the referent genome, we would know which genes are expressed in each tissue, we would know which genes are differently expressed in different tissues or in different patients, but then we still have some work to do.

Skip to 7 minutes and 56 secondsWe need to see which networks are enriched for genes to be overexpressed or underexpressed in a set of patients. Again you are interested in how the genome acts, the function of the genome. So you have been talking about obtaining the genome of a given tissue and obtaining the expression pattern on a given tissue. If I want to obtain my own genome, how much may it cost? Well, we like to say that we’re getting closer to this 1000$ per day. I would say that we are almost there, but it refers mostly to experimental costs. I think it refers to the wet lab part. Then we should compute also all the costs of bioinformatics.

Skip to 8 minutes and 45 secondsIf we stay only with the rho sequence, if what you want is to go home with your 3 billion characters, that's okay. But if you want some interpretation, we have much more work there with this data. So that means that we are in a time in which obtaining the sequence is very easy compared to a few years ago. Yes. Obtaining a sequence is very fast. Sequencing the human genome took about 10 or 15 years, hundreds of labs, and 1 billion dollars, and now sequencing one genome is two days of work only and only about 1000 dollars. For the bioinformatic part, could you use a laptop or do you need a big computer? Yes, you need a big computer.

Skip to 9 minutes and 29 secondsYou need to use clusters connected to the equipment to process all this information. So that means it’s not something that we can do at home. Yes, Yes One thing is to obtain then the sequence and the next step will be to understand it. How far are we in this step of going beyond the single sequence and understanding the information of the genome? Well I think it depends on the question, or which level of information you need. For example, we have been able to identify the causal genetic variant producing a rare disease. We would say that’s enough, we have the nature of all of this studied, we have identified the causal variant. But we have always more questions.

Skip to 10 minutes and 13 secondsHow the information is processed from the genetic variant to the phenotype. How the genetic variant is producing this particular phenotype. Which protein interactions we will have in the middle of this process. So we’re able to answer some basic questions, but we have still a long way. In the case of the Genome Core Facility, what kind of samples will you get? Will someone come and ask for their own genome or a genome of specific diseases? What kind of questions do people come to you with? Well, regarding medical genomics, it’s quite diverse. We have some sets of patients with rare diseases and then we sequence the exome, or the coding factor of the genome.

Skip to 11 minutes and 6 secondsIn some cases we are interested only in sequencing a set of candidate genes, so we know that there are 20, 30, 40 genes that are prone to be related to this disease and then we sequence only these genes in a particular set of patients. In other cases, maybe we want to do the whole genome because we want to discover also some structural variation. And not everything is human. We have samples also from other organisms. For example, we have lots of users interested in metagenomics; . that is, taking a sample of saliva or water and sequencing everything that is in this drop of water to see which microorganisms are there.

Skip to 11 minutes and 48 secondsAre we far from the time in which every newborn will have the whole genome sequenced? I don’t think we are far from that, at least in developed countries. Because the cost of the tests that are currently done, I think soon it will be closer to the cost of sequencing the genome. You have the information that you need and other information that maybe you will need later.

Skip to 12 minutes and 17 secondsBut the big question is: is it worth doing it? Are we going to gain in health and quality of life by doing our genome? I think that in health probably yes, we will have information that we had before with the few tests for a few rare diseases that we are currently doing, and we will have other information that can be of utility in the life of this individual. But then, what we have to understand is that using the information has to be a decision of each individual. Okay, thank you very much. We have seen that we are close to the day in which all our genomes will be sequenced.

Skip to 12 minutes and 59 secondsWe will have much more information with that, but that opens a lot of new questions. Not only ethical questions, but also scientific questions that we are still not there to answer.

Conversation with Ferran Casals

Ferran Casals,Head of the Genomics Service, Pompeu Fabra University.

Her current research interests include the development and application of new experimental methodologies for genetic analyses, the study of the genetic etiology of rare diseases, and the analysis of the genetic variation across human populations.

Important concepts from the conversation

1. Fragmenting a sample of DNA (2.16)

DNA is a very long double helix that naturally is made in as many pieces as chromosomes. But when we extract it in the lab, it breaks into pieces. Some sequencing techniques need the DNA to be in short and uniform pieces and this is called fragmentation of the DNA. This may be produced either physically or chemically (by restriction enzymes).

2. Adaptors (3.20)

Short fragments of DNA of known sequence, chemically synthesized, that can be ligated to the ends of other DNA or RNA molecules.

3. PCR (3.37)

Polymerase chain reaction (PCR) is a technique used in molecular biology to amplify a single copy or a few copies of a segment of DNA across several orders of magnitude, generating thousands to millions of copies of a particular DNA sequence. It is an easy, cheap, and reliable way to repeatedly replicate a focused segment of DNA, a technique which is applicable to numerous fields in modern biology.

4. Reads of sequence (4.20)

In a sequencing reaction of any technology, the sequence of the nucleotides in a row that are obtained are called a read. The traditional sequences (Sanger) has a read length of 600-800 base pairs, the widely used Illumina sequencing by synthesis, of some 300 bp and the techniques reading single molecules may achieve more than 20,000.

5. FastQ file (4.48)

FASTQ format is a text-based format for storing both a biological sequence (usually DNA) and its corresponding quality scores. The sequence letter and quality score are each encoded with a single ASCII character for brevity.

6. Mapping sequences (5.20)

Mapping a given DNA sequence to a genome means to place it in the correct (and desirable, unique) position of the genome.

7. De novo genome (5.46)

To build a new genome with the reads obtained with a specific sequencing technology using a computer program. De novo sequence assemblers are a type of program that assembles short nucleotide sequences into longer ones without the use of a reference genome.

8. Reference genome (6.0)

Genome already known of a given species that is used for assembling new sequenced genomes.

9. Variant calling (7.18)

Computational methods used to identify the existence of a variation in a genome (a single nucleotide variant SNV or polymorphism SNP) from the results of the sequencing experiments. For a given individual it may end up having a list of differences in relation to the reference genome.

10. Computer cluster (9.26)

A computer cluster consists of a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software. In most circumstances, all of the nodes use the same hardware and the same operating system.

11. Protein interactions (10.20)

In the chemical reactions that shape life, proteins have a main role with the different outputs that their physical contacts have with other proteins or other molecules, including DNA.

12. Exome (10.58)

The exome is the part of the genome formed by exons, the sequences which when transcribed remain within the mature RNA after introns are removed by RNA splicing. It consists of all DNA that is transcribed into mature RNA in cells of any type. The exome of the human genome consists of 1% of the total genome, or about 30 megabases of DNA. Though comprising a very small fraction of the genome, mutations in the exome are thought to harbor 85% of mutations that have a large effect on disease.

13. Candidate genes (11.03)

In a genetic study, a candidate gene approach focuses on analyzing a pre-defined set of genes that are thought (by external evidence) to be related to the condition, trait or disease.

Share this video:

This video is from the free online course:

Why Biology Matters: Basic Concepts

Pompeu Fabra University Barcelona