Physics for the 21st Century logo

Section 3: The Emergent Genome

The challenge of biological physics is to find a set of organizing principles or physical laws that governs biological systems. It is natural to start by thinking about DNA, the master molecule of life. This super-molecule that apparently has the code for the enormous complexity seen in living systems is a rather simple molecule, at least in principle. It consists of two strands that wrap around each other in the famous double helix first clearly described by physicist Francis Crick and his biologist colleague James Watson. While the structure of DNA may be simple, understanding how its structure leads to a living organism is not.

Figure 9: The double helix.

Source: © Wikimedia Commons, Public Domain. Author: brian0918, 22 November 2009. More info

We will use the word "emergent" here to discuss the genome in the following sense: If DNA simply had the codes for genes that are expressed in the organism, it would be a rather boring large table of data. But there is much more to the story than this: Simply knowing the list of genes does not explain the implicit emergence of the organism from this list. Not all the genes are expressed at one time. There is an intricate program that expresses genes as a function of time and space as the organism develops. How this is controlled and manipulated still remains a great mystery.

As Figure 9 shows, the DNA molecule has a helicity, or twist, which arises from the fundamental handedness, or chirality, of biologically derived molecules. This handedness is preserved by the fact that the proteins that catalyze the chemical reactions are themselves handed and highly specific in preserving the symmetry of the molecules upon which they act. The ultimate origin of this handedness is a controversial issue. But we assume that a right-handed or left-handed world would work equally well, and that chiral symmetry breaking such as what we encountered in Unit 2 on the scale of fundamental particles is not present in these macroscopic biological molecules.

It is, however, a mistake to think that biological molecules have only one possible structure, or that somehow the right-handed form of the DNA double helix is the only kind of helix that DNA can form. It turns out that under certain salt conditions, DNA can form a left-handed double helix, as shown in Figure 10. In general, proteins are built out of molecules called "amino acids." DNA contains the instructions for constructing many different proteins that are built from approximately 20 different amino acids. We will learn more about this later, when we discuss proteins. For now, we will stick to DNA, which is made of only four building blocks: the nitrogenous bases adenine (A), guanine (G), cytosine (C), and thymine (T). Adenine and guanine have a two-ring structure, and are classified as purines, while cytosine and thymine have a one-ring structure and are classified as pyrimidines. It was the genius of Watson and Crick to understand that the basic rules of stereochemistry enabled a structure in which the adenine (purine) interacts electrostatically with thymine (pyrimidine), and guanine (purine) interacts with cytosine (pyrimidine) under the salt and pH conditions that exist in most biological systems.

The DNA double helix, in three of its possible configurations.

Figure 10: The DNA double helix, in three of its possible configurations.

Source: © Wikimedia Commons, GNU Free Documentation License, Version 1.2. Author: Zephyris (Richard Wheeler), 4 February 2007. More info

Not only does the single-stranded DNA (ssDNA) molecule like to form a double-stranded (dsDNA) complex, but the forces that bring the two strands together result in remarkably specific pairings of the base pairs: A with T, and G with C. The pyrimidine thymine base can form strong electrostatic links with the purine adenine base at two locations, while the (somewhat stronger) guanine-cytosine pair relies on three possible hydrogen bonds. The base pairs code for the construction of the organism. Since there are only bases in the DNA molecule, and there are about 20 different amino acids, the minimum number of bases that can uniquely code for an amino acid is three. This is called the triplet codon.

The remarkable specificity of molecular interactions in biology is actually a common and all-important theme. It is also a physics problem: How well do we have to understand the potentials of molecular interactions before we can begin to predict the structures that form? We will discuss this vexing problem a bit more in the protein section, but it remains a huge problem in biological physics. At present, we really cannot predict three-dimensional structures for biological structures, and it isn't clear if we ever will be able to given how sensitive the structures are to interaction energies and how complex they are.

An example of this extreme sensitivity to the potential functions and the composition of the polymer can be found in the difference between ribonucleic acids (RNA) and deoxyribonucleic acids (DNA). Structurally, the only difference between RNA and DNA is that at the 2' position of the ribose sugar, RNA has a hydroxyl (OH) molecule—a molecule with one hydrogen and one oxygen atom—while DNA just has a hydrogen atom. Figure 11 shows what looks like the completely innocuous difference between the two fundamental units. From a physicist's bottom-up approach and lacking much knowledge of physical chemistry, how much difference can that lone oxygen atom matter?

The chemical structure of RNA (left), and the form the folded molecule takes (right).

Figure 11: The chemical structure of RNA (left), and the form the folded molecule takes (right).

Source: © Left: Wikimedia Commons, Public Domain. Author: Narayanese, 27 December 2007. Right: Wikimedia Commons, Public Domain. Author: MirankerAD, 18 December 2009. More info

Unfortunately for the bottom-up physicist, the news is very bad. RNA molecules fold into a far more complex structure than DNA molecules do, even through the "alphabet," for the structures are just four letters: A,C, G, and bizarrely U, a uracil group that Nature for some reason has favored over the thymine group of DNA. An example of the complex structures that RNA molecules can form is shown in Figure 11. Although the folding rules for RNA are vastly simpler than those for DNA, we still cannot predict with certainty the three-dimensional structure an RNA molecule will form if we are given the sequence of bases as a starting point.

The puzzle of packing DNA: chromosomes

Let's consider a simpler problem than RNA folding: packaging DNA in the cell. A gene is the section of DNA that codes for a particular protein. Since an organism like the bacterium Escherichia coli contains roughly 4,000 different proteins and each protein is roughly 100 amino acids long, we would estimate that the length of DNA in E. coli must be about 2 million base pairs long. In fact, sequencing shows that the E. coli genome actually consists of 4,639,221 base pairs, so we are off by about a factor of two, not too bad. Still, this is an extraordinarily long molecule. If stretched out, it would be 1.2 mm in length, while the organism itself is only about 1 micron long.

The mathematics of how DNA actually gets packaged into small places, and how this highly packaged polymer gets read by proteins such as RNA polymerases or copied by DNA polymerases, is a fascinating exercise in topology. Those of you who are fishermen and have ever confronted a highly tangled fishing line can appreciate that the packaging of DNA in the cell is a very nontrivial problem.

The physics aspect to this problem is the stiffness of the double helix, and how the topology of the twisted and folded molecule affects its biological function. How much energy does it take to bend or twist the polymer into the complex shapes necessary for efficient packaging of DNA in a cell? And how does the intrinsic twist of the double helix translate into the necessity to break the double helix and reconnect it when the code is read by proteins? In other words, biological physics is concerned with the energetics of bending DNA and the topological issues of how the DNA wraps around in space.

The incredible length of a DNA molecule, already bad enough for bacteria, gets more outrageous for higher organisms. Most mammals have roughly 3 x 109 base pairs wrapped up into chromosomes, which are very complex structures consisting of proteins and nucleic acids. However, although we view ourselves as being at the peak of the evolutionary ladder, there seems to be much more DNA in organisms we view as our intellectual inferiors: Some plants and amphibians have up to 1011 base pairs! If we laid out the DNA from our chromosomes in a line, it would have a length of approximately 1 meter; that of amphibians would stretch over 30 meters!

Dark matter in the genome

Why is the human DNA genome so long, and other genomes even longer still? We don't know exactly how many genes the human genome contains, but a reasonable guess seems to indicate about 30,000. If we imagine that each gene codes for a protein that has about 100 amino acids, and that three base pairs are required to specify each amino acid, the minimal size of the human genome would be about 107 base pairs. It would seem that we have at least 1,000 times as much DNA as is necessary for coding the genome. Clearly, the amount of "excess" DNA must be much higher for plants and amphibians. Apparently, the DNA is not efficiently coded in the cell, in the sense that lots of so-called "junk" DNA floats around in a chromosome. In fact, a large amount of noncoding DNA has a repeating motif. Despite some guesses about what role this DNA plays, its function remains a substantial puzzle. Perhaps the information content of the genome is not just the number of base pairs, but that there is much "hidden" information contained in this dark genome.

We have succeeded in sequencing the coding part of the human genome, but not the dark part. Are we done now that we know the coding sequence of one given individual? Hardly. We don't know how to extract the information content of the genome at many levels, or even how to define the genome's information quantitatively. The concept of "information" is not only a tricky concept, but also of immense importance in biological physics. Information is itself an emergent property in biology, and it is contextual: The environment gives meaning to the information, and the information itself means little without the context of the environment.

One problem is that we don't know how to measure information in the genome. Paradoxically, information to a physicist is related to entropy, which is a quantitative measure of disorder. The lower the entropy, the higher the information content. We do, however, need to be careful how we define entropy, because the standard equation in undergraduate physics courses does not apply to a string of base pairs.

Different meanings of information

This sequence logo is a compact way of displaying information contained in a piece of genetic material.

Figure 12: This sequence logo is a compact way of displaying information contained in a piece of genetic material.

Source: © P. P. Papp, D. K. Chattoraj, and T. D. Schneider, Information Analysis of Sequences that Bind the Replication Initiator RepA, J. Mol. Biol., 233, 219-230, 1993. More info

The introduction of entropy emphasizes a critical point: Information has a different meaning to a biologist than it does to a physicist. Suppose you look at some stretch of the genome, and you find that all four of the bases are present in roughly equal numbers—that is, a given base pair has a 25 percent chance to be present in the chain. To a biologist, this implies the sequence is coding for a protein and is information-rich. But a physicist would say it has high entropy and low information, somewhat like saying that it may or may not rain tomorrow. If you say it will rain tomorrow, you convey a lot of information and very little obvious entropy. The opposite is true in gene sequences. To a biologist, a long string of adenines, AAAAAAAAAAAA, is useless and conveys very little information; but in the physics definition of entropy, this is a very low entropy state. Obviously, the statistical concepts of entropy and the biological concepts of information density are rather different.

The dark matter that makes up a huge fraction of the total genome is still very much terra incognito. Entropy density maps indicate that it has a lower information density (in the biologist's use of the word) than the "bright matter" coding DNA. Unraveling the mysteries of the dark matter in the genome will challenge biologists just as much as exploring the cosmological variety challenges astrophysicists.