Skip main navigation

New offer! Get 30% off your first 2 months of Unlimited Monthly. Start your subscription for just £35.99 £24.99. New subscribers only T&Cs apply

Find out more

Bioinformatics feature extraction

Okay, there are many ways to instruct the features from the protein sequence. And here is a session that I want to explain about some of the simple features for bioinformatics sequence, and how you can convert from sequence to a vectors here. And if we focus on the protein sequence first, there are some simple techniques. The first one we call amino acid occurrence, here. An amino acid occurrence is the number of amino acids of each type present in a protein. For example, here you can try to check uh..there are some residues of proteins, and then the amino acid reference in the information about each of 20 amino citrate in this protein.
For example, you can check how many proteins how many amino acids A, and how many amino acids ASP, and also the system here and so on. So this is, after you calculate it, you can get the occurrence of amino acid and then insert into machine learning. Another simple way to extract the features, we call the amino acid compositions. And how you can calculate the aac here. The AAC ratio equal to total number of an amino acid appears in the sequence and you divide by the total length of the sequence. For example, we have uh.. We have a sequence here which contains four amino acids A, three amino acids R, also 3 amino acids N, and also one amino acids D.
After that, you try to calculate the frequency of this amino acid. You can get… you can convert A into 0.4, R 0.3 and N 0.2, and D 0.1. And for the other agnosis, because it didn’t display into sequence, So you call it as zero. And from this, you can see that. A sequence, you can transform into vectors which contain 20 values for amino acids. And the second one similar to the amino acid composition, you can use the dipeptide pair composition DPC here. and it’s just a little bit different with the AAC in BBC. We try to use the number of occurrences of a pair of amino acid pair, and you divide into total length of the sequence.
Like here…you try to pair two amino acids together like AA, AR, and also the NN. After that, you can have a vector with contains 400 values and then you insert the 400 values into machine learning items. And then they can learn, And that features, we call BBC representatives for dipeptide pair composition. And for some traditional bioinformatics research, they also prefer to use like motif features. Because in sequence, there are some motifs that define the functions of proteins. So if we use the information of this motif, and then insert into the machine learning also. It should be efficient in classified functions of the protein sequence or DNA sequence. And here if you know the motif of that function.
So you can try to count how many motifs that uh… already displayed in that sequence. And then you can use this as feature. Nowadays, many people also prefer to use a provides features just like this. This is a Position Specific Scoring Matrix PSSM profile. And for PSSM profiles, it is a profile represented for 20 amino acid, and also the sequence length. And if you want to use this PSSM profiles, you need to sum up all of the same amino acid.
And then to wrap the PSSM profiles become matrix with 20 by 20 amino acids, or you can even use this as a vector with one, with 400 length And then you use it and use some machine-learning item to learn that kind of features. Move to how about the DNA sequence features, DNA sequence features very similar to the protein sequence features. You can also have some different features. I just want to explain the most simplest one for DNA sequence features. And here you can see that. I try to show k-mer features is the most common features for DNA sequence And for the other features, I can try to show the website.
Then you can try to follow how you can generate the features, And for.. you can see that DNA sequence are represented as the occurrence frequencies of k-neighboring nucleic acids here. And if you use uh.. and you can see that the N-rs is the number of dipeptides represented by DNA and also the type s here. If you want to generate the k-mer features, so here is an example for the tumor features. And if you have the sequence, the DNA sequence like that, And you use tumour which means you use two two nucleotides together and then you can get the frequency of the nucleotide.
And after that you can transform from the sequence into a vector like this And in some case for the k-mer, many bioinformatic study, they prefer to use the reverse k-mer, means from the.. from this k-mer they can transform to another version with container vector like in the below table here.

In this video, Dr. Khanh will introduce bioinformatics feature extraction and DNA sequence features. The examples he used are from different research papers. We provide them in the also link below as your reference.

This article is from the free online

Artificial Intelligence in Bioinformatics

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now