| Finding Genes |
Imagine the genome as an encyclopedia with a volume for each chromosome. If you were to open a volume, you would find page after page containing only four letters - A, T, G, and C - without spaces or punctuation. How could you read such a book, or even identify possible words and sentences? The genome sequence itself does not provide direct information on the location of a gene, but there are clues embedded in the sequence that computer programs can find.
Most simple gene prediction programs use several pieces of sequence information to identify a potential gene in a DNA sequence. The programs look for sequences in the DNA that have the potential to encode a protein. These sequences are called open reading frames (ORFs). An ORF usually begins with a codon of AUG, and then contains a long sequence of codons that specify the protein's amino acids. The ORF then ends with a stop codon of UAA, UAG, or UGA (Fig. 2). Using overlapping frames of three nucleotides each, the computer program searches the database until it identifies an ORF region. For example, the sequence "abcdefghijk" could be read in three-letter "words" of "abc-def-ghi'" "bcd-efg-hij," or "cde-fgh-ijk." Computer programs can scan DNA sequences quickly, using these overlapping reading frames on both the original strand and on the complementary strand, producing a total of six different reading frames for any sequence.
Using these programs to find ORFs in bacterial genomes is relatively easy. Here, the DNA sequence matches the mRNA. The situation is more complicated for eukaryotic genes, which often contain one or more noncoding regions (introns). To find ORFs in these genes, the introns are removed in a process called splicing (Fig. 3). The final spliced mRNA, which encodes the protein product of the gene, is smaller than the original RNA transcript that matches the genome. The introns are removed, leading to the splicing of the coding regions of a gene (exons) together into the final mRNA. The problem is that a simple ORF-finding program cannot be used with genomic DNA that has introns because those genes do not match the mRNA. While computer programs can identify eukaryotic genes with introns, they are not always accurate.
An alternate approach to characterize genes in eukaryotes is to first make a DNA copy of the mRNA encoded by the gene. To do this, one uses an enzyme called reverse transcriptase. The copy, called cDNA or complementary DNA, has the same sequence as the mRNA, except that the U is replaced by a T. Because the cDNA lacks introns, the sequence of the cloned cDNA can be used to find an ORF. In addition to simply identifying ORFs, many advanced sequence analysis programs use other information to help identify eukaryotic genes in the chromosome. (See the BLAST section below.)