Annenberg CPB  Home  |  Channel  |  Video Catalog  |  About Us  |  Search  |  Contact Us  
Rediscovering Biology Logo
Online TextbookCase StudiesExpertsArchiveGlossarySearch
Online Textbook
Back to Unit Page
Unit Chapters
The Human Genome Project
Sequencing a Genome
Finding Genes
Is the Eukaryotic Genome a Vast Junkyard?
The Difference May Lie Not in the Sequence but in the Expression
Determining Gene Function from Sequence Information
The Virtues of Knockouts
Genetic Variation Within Species and SNPs
Identifying and Using SNPs
Practical Applications of Genomics
Examining Gene Expression
Proteins & Proteomics
Evolution & Phylogenetics
Microbial Diversity
Emerging Infectious Diseases
Genetics of Development
Cell Biology & Cancer
Human Evolution
Biology of Sex & Gender
Genetically Modified Organisms
Finding Genes

Imagine the genome as an encyclopedia with a volume for each chromosome. If you were to open a volume, you would find page after page containing only four letters - A, T, G, and C - without spaces or punctuation. How could you read such a book, or even identify possible words and sentences? The genome sequence itself does not provide direct information on the location of a gene, but there are clues embedded in the sequence that computer programs can find.

Figure 2. Open Reading Frame
Most simple gene prediction programs use several pieces of sequence information to identify a potential gene in a DNA sequence. The programs look for sequences in the DNA that have the potential to encode a protein. These sequences are called open reading frames (ORFs). An ORF usually begins with a codon of AUG, and then contains a long sequence of codons that specify the protein's amino acids. The ORF then ends with a stop codon of UAA, UAG, or UGA (Fig. 2). Using overlapping frames of three nucleotides each, the computer program searches the database until it identifies an ORF region. For example, the sequence "abcdefghijk" could be read in three-letter "words" of "abc-def-ghi'" "bcd-efg-hij," or "cde-fgh-ijk." Computer programs can scan DNA sequences quickly, using these overlapping reading frames on both the original strand and on the complementary strand, producing a total of six different reading frames for any sequence.

Figure 3. RNA processing
Using these programs to find ORFs in bacterial genomes is relatively easy. Here, the DNA sequence matches the mRNA. The situation is more complicated for eukaryotic genes, which often contain one or more noncoding regions (introns). To find ORFs in these genes, the introns are removed in a process called splicing (Fig. 3). The final spliced mRNA, which encodes the protein product of the gene, is smaller than the original RNA transcript that matches the genome. The introns are removed, leading to the splicing of the coding regions of a gene (exons) together into the final mRNA. The problem is that a simple ORF-finding program cannot be used with genomic DNA that has introns because those genes do not match the mRNA. While computer programs can identify eukaryotic genes with introns, they are not always accurate.

An alternate approach to characterize genes in eukaryotes is to first make a DNA copy of the mRNA encoded by the gene. To do this, one uses an enzyme called reverse transcriptase. The copy, called cDNA or complementary DNA, has the same sequence as the mRNA, except that the U is replaced by a T. Because the cDNA lacks introns, the sequence of the cloned cDNA can be used to find an ORF. In addition to simply identifying ORFs, many advanced sequence analysis programs use other information to help identify eukaryotic genes in the chromosome. (See the BLAST section below.)

Back Next

  Home  |  Catalog  |  About Us  |  Search  |  Contact Us

| Follow The Annenberg Learner on Facebook

  © Annenberg Foundation 2013. All rights reserved.
Privacy Policy