| Determining Gene Function from Sequence Information |
Researchers have produced an enormous number of genome sequences from a variety of organisms. Publicly available databases, such as GenBank at the NCBI (National Center for Biotechnology Information), store many of these sequences. The databases have been a tremendous boon for comparative biology. The NCBI database stores not only the genome sequences, but also information about the function (if it is known) of the genes.
The NCBI can also identify unknown genes by comparing them with known genes in the database. One program commonly used for this purpose is BLAST (Basic Local Alignment Search Tool). Sequence similarity searching algorithms like BLAST are based on the premise that if two sequences are similar then they are likely to be homologous (that is, they share a common evolutionary ancestor). (See the Evolution and Phylogenetics unit.) Using this database, one can infer the function of an unknown gene by finding similar sequences of known genes and proteins. For example, suppose you were to use BLAST to search for sequences similar to a new gene. Upon viewing your results, you noticed that all the sequences with a high degree of similarity to the new gene belonged to a family of genes known to break down hydrogen peroxide. You could logically conclude, then, that this new gene encoded a protein with a similar function.
BLAST searches can be done at the nucleotide level; however, comparisons at the amino acid level provide much greater sensitivity. Therefore, unless one is particularly interested in the DNA sequence itself, it is better to search for genes using protein. If you have only raw nucleotide sequence data, computer programs can automatically translate the DNA into amino acids using all six reading frames (three frames from one strand and three frames from the complementary strand) before searching the protein database.
In addition to whole proteins, similarity searches can identify protein motifs. A motif is a distinctive pattern of amino acids, conserved across many proteins, which gives a particular function to the protein. For example, the presence of one particular motif in a protein indicates that this protein probably binds ATP and may therefore require ATP for its action.
The result of a database search is a list of matches, ranked from highest to lowest, based on the probability of a significant match (Fig. 4). The reported alignment scores are given "expectation values" (E), which represent the probability that a match with the reported score would be expected to occur by random chance. The smaller the E- value, the higher the assigned score and the less likely that the match was coincidence. Some of the easiest results to interpret are very high scores (small E-values, low-probability), which usually result from two very similar proteins. Other easily identifiable results are very low scores, which indicate that the outcome is probably the result of chance similarity.
Search results also provide links (in blue) to a database page with information on each sequence similar to the query sequence. This page gives extensive information on the match sequence, including the organism it came from, the function of the gene product - if it is known - and references to journal articles concerning the sequence. BLAST results also provide the actual alignment results for nucleotides or amino acids between the query sequence and the match sequences.