Skip to 0 minutes and 14 secondsGenomics and transcriptomics, both involve the generation of huge data sets, but they are meaningless unless they are analysed correctly. There are three billion base pairs in the human genome of each individual cell. Hundreds of cells are sequenced each time a sequencing array is performed. While not all the DNA from each cell is sequenced each time, the result is still a very, very big data file of up to one terabyte of data. Although known transcribed genes make up just 3% of the genome, transcriptome arrays still generate very large data sets containing over half a million exons, coding and non-coding. Furthermore, the genome and transcriptome are not sequenced in one continuous stretch from chromosome 1 to chromosome 23.
Skip to 1 minute and 6 secondsThey are sequenced in many small segments, and they then need to be put together into genes-- chromosomes-- using computer programmes which are developed by bioinformaticians. But the bioinformatician's role does not end there. Generally speaking, bioinformaticians design and implement computer programmes or data analysis algorithms that can be used to extract biologically meaningful relations from the large amounts of data generated by more than high-throughput measurement techniques used in genomics and transcriptomics. Such algorithms can, for instance, be used to compare several sequences for the same sample in order to determine what are genuine mutations and what could be sequencing errors. Or to compare multiple samples to determine which mutations are commonly found in certain cancers as not all mutations are involved in cancer.
Skip to 2 minutes and 2 secondsOr to make correlations between particular mutations in response to therapy, to give just a few examples. Bioinformaticians have also developed algorithms to generate the cancer signatures which you've learned about. Importantly, bioinformaticians also develop software which can be used to visualise the data, enabling the biologists to inspect it. Once the data set is presented in a meaningful way, a statistician must make sure the results are accurate and not due to chance. For example, in most statistical tests a p-value of significance level of 0.05 is considered significant. In plain English, a p-value generally refers to the probability of getting the observed result if there is in fact no difference.
Skip to 2 minutes and 52 secondsSo for a p-value of 0.05, it means that by chance, the result would appear significant once out of 20 times, when in fact, there is no real difference. This is acceptable most of the time, and stricter experiments use a p-value of 0.01, which means it would appear significant by chance only once out of 100 times. However when high-throughput sequencing is used and half a million different sets of results are compared, then even with a p-value of 0.01, 5,000 results will be expected to appear significant when they are in fact not. This is called a multiple testing problem and can make it very difficult to tell which are real results and which are due to chance.
Skip to 3 minutes and 39 secondsTo correct for this, there are a number of different methods called multiple test corrections, such as the Bonferroni correction or the Benjamini-Hochberg method. So the bioinformatician is often also a biostatistician. The bioinformatician's role is so important in new OMICS technologies. They are involved right from the start of the experimental loop-- from experimental design and hypothesising, through to data analysis, rejecting/accepting hypotheses, and visual display of the results.
Dr Camille Huser describes the role of the bioinformatician in large scale data collection and analysis.
© University of Glasgow