Finding and annotating genes in a genome
In this course we will focus on the annotation of coding sequence, that is to say, those regions of the genome that form the template for making proteins. From now on, we will call these regions CDS which stands for CoDing Sequence.
The process of annotating a genome implies finding the location of the genes, in our case, we will limit our task to find the locations of protein coding regions. How do we find these regions in a sea of just letters - As, Cs, Gs, Ts? All protein coding regions share some characteristics that we can use to hunt them down.
All CDSs have a START codon and a STOP. These signal the beginning and end of a protein coding region or CDS. Transcripts as well as polypeptides are synthesised from 5’ to 3’ therefore a start codon will be at the 5’ end of a sequence, whereas a stop codon will be at the 3’ end. Bacterial genes are encoded all in one go, that is to say that the nucleotide sequence that has the information to make a polypeptide is found all in a single stretch of uninterrupted sequence. By contrast, eukaryotes usually have introns, which are regions of non-coding genome interspersed with coding regions. Luckily for us, we don’t have to worry about these as bacterial genomes almost never have introns. Optional: Learn more about introns by following the link to a WikiPedia article, given below this article.
The total number of nucleotides in a CDS is in multiples of three. Proteins are made of amino acids and each amino acid is encoded by three nucleotides. This group of three nucleotides is called a codon.
START and STOP codons are well defined. Bacteria can use more than one codon to start the synthesis of a protein. These are: ATG encoding for Methionine, GTG encoding for Valine and TTG encoding for Leucine. In eukaryotes only ATG is a valid start codon. Stop codons are common to bacteria and eukaryotes and are: TAA, TAG and TGA. These codons do not encode an amino acid but signal the end of the protein sequence.
Caution! Start codons do not belong exclusively at the start of the sequence. They can also be found along the CDS. This represents one challenge, how do we know which Methionine, Valine or Leucine is the first amino acid? To resolve this problem we will need more information about the actual sequence or at least, how does it compare with other similar sequences (for example using BLAST). For now, let’s just assume a safe position: if more than one start codon is available, we will choose the one that produces the largest possible CDS.
© Wellcome Genome Campus Advanced Courses and Scientific Conferences