Skip main navigation

Finding and Annotating Genes in a Genome

Learn more about finding and annotating genes in a genome.
Seated learners at their computers and an Educator by the side of and assisting one learner, in a  computer room
© Wellcome Genome Campus Advanced Courses and Scientific Conferences

In this course we will focus on the annotation of coding sequence, that is to say, those regions of the genome that form the template for making proteins. From now on, we will call these regions CDS which stands for CoDing Sequence.

The process of annotating a genome implies finding the location of the genes, in our case, we will limit our task to find the locations of protein coding regions. How do we find these regions in a sea of just letters – As, Cs, Gs, Ts? All protein coding regions share some characteristics that we can use to hunt them down.

  1. All CDSs have a START codon and a STOP. These signal the beginning and end of a protein coding region or CDS. Transcripts as well as polypeptides are synthesised from 5’ to 3’ therefore a start codon will be at the 5’ end of a sequence, whereas a stop codon will be at the 3’ end. Bacterial genes are encoded all in one go, that is to say that the nucleotide sequence that has the information to make a polypeptide is found all in a single stretch of uninterrupted sequence. By contrast, eukaryotes usually have introns, which are regions of non-coding genome interspersed with coding regions. Luckily for us, we don’t have to worry about these as bacterial genomes almost never have introns. Optional: Learn more about introns by following the link to a WikiPedia article, given below this article.
  2. The total number of nucleotides in a CDS is in multiples of three. Proteins are made of amino acids and each amino acid is encoded by three nucleotides. This group of three nucleotides is called a codon.
  3. START and STOP codons are well defined. Bacteria can use more than one codon to start the synthesis of a protein. These are: ATG encoding for Methionine, GTG encoding for Valine and TTG encoding for Leucine. In eukaryotes only ATG is a valid start codon. Stop codons are common to bacteria and eukaryotes and are: TAA, TAG and TGA. These codons do not encode an amino acid but signal the end of the protein sequence.

Caution! Start codons do not belong exclusively at the start of the sequence. They can also be found along the CDS. This represents one challenge, how do we know which Methionine, Valine or Leucine is the first amino acid? To resolve this problem we will need more information about the actual sequence or at least, how does it compare with other similar sequences (for example using BLAST). For now, let’s just assume a safe position: if more than one start codon is available, we will choose the one that produces the largest possible CDS.

© Wellcome Genome Campus Advanced Courses and Scientific Conferences
This article is from the free online

Bacterial Genomes II: Accessing and Analysing Microbial Genome Data Using Artemis

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education