Skip main navigation

Bioinformatics data

And here, next session, I want to explain…  I want to show you about the format of bioinformatics data. There are some… what are different kinds of data formats in bioinformatics data? And so first start from the protein structure. First, I move from detail one. And the protein structure. This is the structural organizations of proteins. So as you know, proteins, in proteins  there are four different structures.  The first one we call primary structure. Then secondary structure,   the tertiary structure.
And the last one is Quaternary structures.  So if you study bioinformatics, mostly you will  focus on the sequence like the primary structure.   And the second one is the tertiary structures  which is like the other three structures.   The sequence structure, this  contains a lot of amino acids.  However, in the secondary structure, is  contained the helix and also the Alpha Helix. Here is an example for 3d structure of proteins. This is the 6WTT. It is one gene related to SARS-CoV-2. This is like COVID-19. And here you can see that.
Its structure of protein just looks like this figure, and you can note the helix and also the beta and the helix strand also the beta sheet as well as some binding  informations from the 3d structure of proteins. And for this kind of data set, if you want  to perform bioinformatics study, you use it in structural bioinformatics. And like molecular docking on virtual screening, even currently, many people can apply AI in  this kind of data set, because since the 3d structure of the protein can be treated as an image. And then you can use some AI model to apply into this image, and  to generate the outcome. Here is… as I mentioned about the primary  structure of proteins. We call it protein sequence.
Protein sequences consist of 20  different kinds of chemical compounds. We call they are amino acids, and also they  serve as building blocks of proteins.  And here is an example for protein  sequence, you can see that. The sequences contains 20 amino acids and 20 aminolysis  arranged in different arrangement.  And you… for the bioinformatics study, you  try to understand about the arrangement, about the positions, about the motif for each amino  acid inside the sequence. And for the different proteins has different motifs and it plays  different functions. That’s why we need to study. And to understand about the protein sequence, you need to work, and you need to deal with the FASTA format.
And most sequencing works is to  deal with FASTA format and what is the FASTA format look like? The FASTA format, like here, I try to show you about three sequences. And for the faster format, there are two components. The first one is the headers. The headers is meaning contender information for the proteins or the DNA. And then the sequence is the data that we can use the sequence contains about, maybe four nucleotides for DNA sequence, and also 20 amino acids for protein sequence, so that there are two components of the FASTA format. Here, I show here is a protein sequence.  You can see that.
There are…if you try to download the protein sequence or DNA  sequence from the public resource,   the data set will look like this one. Here, the first line is the title. The title includes the ID of proteins and also  about the information, the description of the proteins. And then moves to the sequence,  the sequence contains a lot of amino acids. Here I list about 22 amino acids in this table. And the protein sequence comprised of this amino acid and in different other different arrangements. And to study bioinformatics and to see the different functions in proteins, there are many many studies. I try to identify the motifs. As I mentioned the motifs in the sequences.
And this is a protein sequence and  what the DNA sequence looks like.   Here is another example of DNA sequence. The DNA sequence also contains… because it’s it is FASTA for muscle, it also contains a title. And the second one is the sequence, and also for the study, they also use this  sequence information, so we don’t use the title.   We just use the sequence, and to understand  the different functions in the sequence. And the next one is the gene expression level. What is the gene expression level look like? So here is a table, to show different levels  of different genes from different patients.   So the following columns here, for every column, is  represented for every patient’s every subject in the studies.
And here, you can see that. This is the gene. From all the rows, one row represented for one gene, and for the values inside is the gene expression level. This is the expression level for the corresponding gene, and to deal with the gene expression level, you need to use this data set. If you want to download data from GEO website, so you can download like this one. and to… and like the protein sequence, you focus on the text, the letter inside the sequence. And for now, you try to study, you try to apply your AI items in the expressions level, like this.
And to see whether there are some patients with different levels, and for that difference level, you can  use it to classify between each patient.

In this video, Dr. Khanh Le will explain types of bioinformatics data. He starts with a chart on protein structure. Then, he explains FASTA Format for sequencing.

FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. Next, Dr. Khanh Le explains gene expression level.

This article is from the free online

Artificial Intelligence in Bioinformatics

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education