1.10

## Wellcome Genome Campus Advanced Courses and Scientific Conferences

'RuvA protein bound to DNA - side view' by John Rafferty.

# From DNA to protein

In this article we will describe how a protein sequence is generated from a DNA sequence.

In biological systems, the process of transcription ‘transcribes’ (copies) DNA into RNA. It is this RNA molecule that serves as the actual template for the production of proteins. The process of making proteins out of a RNA template is called ‘translation’ and it is carried out by the ribosome. The RNA molecule that takes the message from the DNA to the ribosome is called messenger RNA or mRNA.

The building blocks of proteins are amino acids. Similar to DNA, there is a convention that dictates how the string of amino acids in proteins is represented. Protein sequences are represented from their amino- or N-terminal to the carboxy- or C-terminal. This is the direction in which they are read from the messenger RNA (mRNA) and synthesised by the ribosome (N-termini of free amino acids are chemically attached to the C-terminus of the nascent protein). This corresponds to the direction in which they appear in their DNA blueprint.

Each amino acid is encoded by a group of three nucleotides in the mRNA. Each three letter word is called a codon and en “codes” for an amino-acid. This code, or correspondence between codons and amino acids, is known as the genetic code. Different species can have different genetic codes, but they all follow the same rule: each codon always corresponds to the same amino-acid; however, one amino-acid can be encoded by more than one codon. A codon table can be used to decipher this code; these tables can depict either the DNA or RNA codons with the only difference being that in the RNA codon table “T” is replaced by “U”.

Although this table describes DNA codons, remember that DNA is transcribed into mRNA which in turn is translated into amino acids that form proteins. The prediction of an amino-acid sequence based on its nucleotide sequence is known as a conceptual translation. A conceptual translation is a prediction of the amino-acid sequence based on the nucleotide sequence and the known genetic code.

For this short example DNA sequence, the amino-acid sequence would be: (the codon number is only given as reference)

Codon number		 1   2   3   4   5   6   7   8   9   10  11
Nucleotide sequence	ATG CGA TCG GAC AGT CGA GTC CAG TAG ACG ATC
Amino-acid sequence	 M   R   S   D   S   R   V   Q   -   T   I  

with the 9th codon (TAG) encoding a STOP signal. Notice that the 3rd and 5th codons are different, yet they both code for serine (S).

As mentioned previously, the genetic code is read in codons of three-letter words. Therefore, means that for a DNA sequence of known orientation, there are three possible conceptual translations: the first one starting on the first base, the second one starting on the second base and finally the third one starting on the third base. These are referred to as three “reading frames”.

ATGCGATCGGACAGTCGAGTCCAGTAGACGATC	nucleotide sequence
M  R  S  D  S  R  V  Q  -  T  I	1st reading frame
C  D  R  T  V  E  S  S  R  R		2nd reading frame
A  I  G  Q  S  S  P  V  D  D		3rd reading frame

In our example, the first reading frame starts with a Methionine (M) encoded by the ATG codon but if we were to consider the second reading frame and therefore to start “reading” the code from the second base of the nucleotide sequence, the first amino acid to be read would be (C) encoded by the TGC codon. Moreover, if we didn’t know the orientation of the nucleotide sequence, the conceptual translation could be read either in the forward (5’->3’) or the reverse (3’->5’) giving an additional three possible ways of reading the code.

A useful tool for predicting the conceptual translation of a nucleotide sequence is the “ExPASy translate tool”. This server provides a quick and easy way of finding the amino acid sequence corresponding to a nucleotide sequence in all of the six possible reading frames. Why not give it a try, and check whether the three amino acid sequences offered as 1st, 2nd and 3rd reading frames in the figure above are correct? Can you find out which are the amino acid sequences for the reverse strand?