Skip main navigation

New offer! Get 30% off your first 2 months of Unlimited Monthly. Start your subscription for just £35.99 £24.99. New subscribers only T&Cs apply

Find out more

AI-based bioinformatics workflow

Hello, everyone and this is the third content for this course. And for today’s session, I try to focus on feature engineering, means that how you can extract different features in bioinformatics. And this is the outline for the talk today. And the first one I try to explain about AI-based bioinformatics workflow, which means that all of the workflow and how you can perform bioinformatics studies using AI. And the second one is…I try to instructure how to… I try to explain about the feature instruction in bioinformatics how you can make uh…
how you can make algorithms that can understand the bioinformatics data, And the third one is how you can extract the bioinformatics features using some of the websites and also some of the packages. And the last one, for more feature sets, I also try to explain for you. The first session is AI-based bioinformatics workflow, I try to introduce about some of the workflows in bioinformatics, just a very simple workflow, here. And how you can apply AI apply machine learning in bioinformatics. And mostly, we will use supervised learning. This is a branch of machine learning techniques here. And for this figure, I try to plot a figure to show you how you can apply machine learning in bioinformatics.
For example, you have an unknown sequence and you don’t know. And this sequence belongs to A function or None A function. So for this one, you need to apply a binary classification right? And if you use… left protein sequence or DNA sequence, you insert into our model, our model will predict the functions of that proteins, and it belongs to function A or not, And here is the process for the machine learning algorithms that you can build. And you will have the two training data sets here. The first one is from A and the second set is from Non-A. After that, you… the training data will be fed into machine learning algorithms and after that they generate a predictive model.
And in the second step, you will use that predictive model, which means the model already optimized. You will use…like you want to predict an unknown sequence, and you provide the unknown sequence into the model, After that, the model will get the return the prediction, so that is the workflow of binary classification in bioinformatics using sequencing data. And the second one is… if you want to use a multi-classification, this is an example for multi-classification protein. Like you want to check whether protein A belong to which class, which functions are proteins. Because this one is a more common problem in bioinformatics using sequencing data.
Because in real case, for one protein, you have a lot of functions, not only binary classification, not only this function or not, you can even try to classify which functions from a proteins. And so here is the example, if you have a protein sequence, and then you want to perform multi-classification into class one, two, three, or four, so how you can perform it? And similar to binary classifications, I can provide a flowchart like this. You have a training data, and now your training data contains four classes. You try to insert two machine learning items very similar to binary classification.
You can generate a predictive model, and you will use the predictive model to predict enough proteins, and now your model contains a four-classes and the prediction results will return is four classes of protein sequence. And another branch as I mentioned, in the artificial intelligence session, you can also apply natural language processing in bioinformatics. So what is the idea? Here is the difference between the natural language and also the bioinformatics language. Here, I use a work like a medical. This is a very natural work in medical field, so… if you want to use the NLP technique in this work, how you can use this? You try to separate the word… into different sub-word here. Like the…
if you try to divide into one round, you use one round technique, which means only one word. So you try to divide it as M, E, D, I, and so on. If you use two grams, so here, you try to uh… the NLP model will trick two word together as one word, like ME. And for 3-gram MED just like this, and then we’ll move to until the end of the word or the sentence. And very similar, if you think about applying NLP techniques in… like a DNA sequence, like protein, like the biological sequence, you can also… even you can treat the one nucleotide as one word like here.
And if you use one gram or one nucleotide as one word, and if you use two gram, two nucleotides at one word, or three gram like three nucleotides at one word and so on. So you can use a lot of n-gram levels to trick the DNA sequence as natural language. And then use… apply some NLP model. And in this course, I also show you about how you can apply NLP as well as machine learning in bioinformatics.

Dr. Khanh will explain AI-based bioinformatics workflow first. He uses an example to explain protein classification. You could see the chart of multiple classifications.

If one puts training data, containing four classes, and performs multiple classifications, then one can insert two machine learning items, very similar to binary classification. Finally, you will have the predictive model. Next, he will explain natural language processing.

This article is from the free online

Artificial Intelligence in Bioinformatics

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now