Skip main navigation

Machine Learning in Genomics

In this written interview, Dr Nicole Wheeler discusses how she designed and evaluated a short training course on Machine Learning applied in Genomics, addressed to an audience of Computer Scientists. She used online tools to develop course materials and for evaluation.
Trainers in from of a screen showing a photo with colourful medicines

In this written interview, Dr Nicole Wheeler discusses how she designed and evaluated a short training course on Machine Learning applied in Genomics, addressed to an audience of Computer Scientists. She used online tools to develop course materials and for evaluation.

Nicole Wheeler is a Fellow at the University of Birmingham and a consultant for NTI bio. Dr Wheeler’s work focuses on the development of screening tools for identifying DNA from emerging biological threats, establishing genomic pathogen surveillance in resource-limited settings, One Health surveillance of antimicrobial resistance and the ethical development of artificial intelligence (AI) for health applications.

Question 1: Hi Nicole, how is Machine Learning applied in Genomics and why is it important to train others on this topic?

Machine learning is becoming essential for processing the enormous amount of genome sequencing data we’re producing. It’s a way of taking large amounts of data and searching for patterns or associations in the data.

A major area of application at the moment is using machine learning to link genotype to phenotype. This has a range of applications, from improving our ability to understand human genetic diseases, to better identifying the risk a new strain of an infectious disease might pose, to better informing animal breeding for agriculture. It’s important to train others in machine learning for genomics because there is an increasing amount of work to be done in linking genotype to phenotype, and bringing in highly skilled people to contribute to this work will improve the quality of the analysis we’re doing.

Another important point is that over time an increasing number of people will be working with the results of these analysis as a part of public health or research, so it is important for the consumers of this information to have an understanding of how the methods being used in this area today work, how much we can trust the inferences they make and what pitfalls these analyses are susceptible to that people should be looking out for when evaluating how much to trust a set of results.

Question 2: Machine Learning is usually a topic more familiar to Computer Scientists, while Genomics is a subject familiar to Biologists. Who was your target audience in this course and how did you make sure that they grasped the concepts outside of their comfort zone?

In the workshop I lead, the target audience was computer scientists. I gave an introductory talk on basic concepts in genomics relevant to the exercise and placed a particular emphasis on the computational and analytical challenges that arise due to the biology. Where possible, I compared these biology-specific challenges to well-known and relatable applications of machine learning, such as movie recommendations on Netflix.

Question 3: How long did the course run for, how did you adapt the learning outcomes for that particular duration?

The course consisted of a lecture, followed by a 2 hour workshop. Not knowing how quickly the students would pick up the biology concepts, I created a simple tutorial workbook for the students to work through, leaving a lot of time for questions and 1:1 discussions with participants, but also creating the opportunity for the students to adapt the workbook with their own approaches to modelling the data. The students were able to apply their existing knowledge base of machine learning skills to undertake a self-directed exploration of the data and discuss their ideas with me as they generated them.

Question 4: What types of resources did you use to demonstrate the applications of ML in Genomics?

In my lecture, I was able to demonstrate some of the real world problems with infectious diseases that we hope can be addressed with genomics and machine learning, including some early successes in using ML for surveillance and risk assessment. Then, in the tutorial I pulled out pieces of DNA the ML model could use for predicting antimicrobial resistance in bacteria and had the students look into whether they showed evidence of being functionally involved in resistance. We discussed how these pieces of DNA could then be used in different technological applications to screen for resistant bacteria in real-world settings.

Question 5: How did you evaluate the impact of your course? What feedback did you get?

I was able to track the students’ progress in the tutorial during the session and infer from their follow-up ideas how well they had grasped the key concepts. The students came up with clever ideas of how to adapt the analysis and were able to use the data to test whether their ideas worked.

The true impact of the course was actually realised over the longer term through the online materials I published. Publishing the tutorial on the web platform Kaggle allowed me to see how many people viewed and downloaded the data over time, so I was able to note that over 8,000 people have viewed the data and almost 700 have downloaded it to start building their own models. I also published a summary of the workshop on Medium, which has been viewed nearly 4,000 times. I have received emails from students and researchers around the world who have used the code and data to begin their own projects.

Question 6: What would you do differently next time?

My next tutorial will be aimed at clinicians and scientists interested in machine learning, so it will focus more on the fundamentals of machine learning for people who are less familiar with this aspect of the work. I think publishing the materials online for a broader audience worked well and I look forward to tracking levels of engagement in the two tutorials aimed at different audiences. As an improvement, I’d like to design more tests of students’ critical thinking and ability to identify biases and errors in the algorithm, to give the students skills and confidence to evaluate algorithms they come across outside of the course.

© Wellcome Connecting Science
This article is from the free online

Train the Trainer: Design Genomics and Bioinformatics Training

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now