 Teacher resources and professional development across the curriculum

Teacher professional development and classroom resources across the curriculum  # 2.7 DNA Sequencing

## DE BRUIJN SEQUENCES

• A de Bruijn sequence is the shortest string that contains all possible permutations (order matters) of a particular length from a given set.
• We can construct de Bruijn sequences from a given set by finding a Hamilton cycle on a directed graph.

Ramsey Theory says that patterns must exist in certain sets of data, whether they be the connections between people, points of light in the sky, or sequences of numbers. Remember, however, that Ramsey Theory does not specify what that pattern is, just that it exists. If we need more-specific information, we will need more-specific tools.

Consider, for instance, a hypothetical keyless-entry keypad on a car that requires a 5-digit access code for entry. If you forgot your code, how could you get into your car? One approach would be to try every possible combination in succession, starting with 11111 and continuing on to 99999. How many combinations would you have to try in a worst-case scenario (i.e., if the correct combination is the very last option you try)? There are nine choices for the first digit, nine for the second one also, and so on. The total number of possible sequences would be 95, which is about 60,000—a daunting task! Perhaps we can refine our strategy to speed things up a bit.

An interesting feature of these keypads is that they do not require an "enter" key. This means that they take an unbroken stream of numbers until the correct five digits are entered in sequence. So, we could arrange all 60,000 possible codes into one long string, 300,000 digits in length, which would look like this: 11111 11112 11113…99998 99999. Is this the best strategy to apply? Of course not. We can see that there are many overlapping sections of the different codes, the sequence 1111, for example. Entering this pattern more times than we need to would be redundant and would be quite a waste of time. Might we, instead, look for a shorter sequence that takes advantage of these overlaps and still contains all the possible combinations?

Such a sequence, called a de Bruijn sequence, is the shortest sequence that contains every given k-length ordering of an n-sized set of elements. To see how one is constructed, let's look at a somewhat simpler example than our car keypad above. Let's pretend that our keypad requires only a two-digit code and accepts only 1, 2, or 3 as values for those digits. If we were simply to try every possible combination, we would be trying nine (3 × 3) two-digit orderings, or a compiled sequence of eighteen digits. We know there are overlaps, so can we find a de Bruijn sequence for two-digit strings in a set of three elements?

Our combinatorial tool of choice will be a directed graph—that is, a graph in which the edges can be traversed in only one direction. The graph we will use to construct our de Bruijn sequence will have as its nodes all the possible two-digit orderings:

11
12
13
21
22
23
31
32
33

We will connect these nodes to each other in such a way that a directed edge from an initial node to a terminal node exists (and is included in the graph) only if the last digit of the initial node is the same as the first digit of the terminal node. So, node 11 could connect to nodes 12 and 13 only; node 13 could connect to nodes 31, 32, and 33 only. The entire web of allowable connections is shown below. We can define a de Bruijn sequence by finding a path on this graph that connects all the nodes, returning to where we started. This is a Hamilton cycle, similar to the one we used in the circular permutation example discussed earlier. A Hamilton cycle on our de Bruijn graph is defined by this nodal path:

11 → 12 → 21 → 13 → 32 → 22 → 23 → 33 → 31 → 11

This gives us the de Bruijn sequence: 1121322331

So, we can see that instead of entering an 18-digit sequence, we could enter the 10-digit sequence shown above, thereby saving us 44% of our effort. To find the effort saved, by the way, we just compare the amount of change, 8 digits, to the original amount, 18 digits— ~ 44%

The results are even more remarkable for our original example. Recall that our brute-force sequence was 300,000 digits long. A de Bruijn sequence would shrink this string to around 60,000 digits, saving us 80% of our time and effort.

## SHOTGUN SEQUENCING

• Modern DNA sequencing involves breaking up a large DNA molecule into many pieces that can be quickly sequenced simultaneously, and reassembling the parts based on overlaps in a manner similar to constructing a de Bruijn sequence.
• Shotgun sequencing is a faster, though less–reliable, method of sequencing DNA.

This idea of finding the shortest possible string that contains all given sequences has broader application in the field of genomics. Here, geneticists wish to find the specific sequence of nucleotides that make up human DNA. Each strand of our DNA is basically a string of billions of occurrences of the nucleotides adenine (A), cytosine (C), guanine (G), and thymine (T) in some specific sequence. Current techniques of reading this sequence cannot handle such immense lengths. The standard method of reading strands of DNA, the socalled Chain-Termination method, requires much shorter lengths.

Biologists are faced with the task of taking a given DNA molecule, breaking it into manageable chunks, reading each chunk, and putting these chunks back together to construct the original sequence. This is done by randomly fragmenting the original strand into numerous small segments of many nucleotides, sequencing these segments via Chain Termination to obtain "reads," and then looking at the overlaps in the "reads" to find the shortest sequence that contains all of the reads.

In doing this, scientists have to assume that nature seeks efficiency. This means that the chunks should be reassembled in such a way as to minimize the length of the resultant DNA strand.

Let's look at a simplified example. Suppose that a DNA strand gave the following fragments, or "reads," after multiple rounds:

GGA ATT GAT TGC TTG

From what we learned before, there will be 120 (5!) possible chains that can be constructed from these reads. Furthermore, because of overlap, not all will be the same length.

For example, the ordering GGA ATT GAT TGC TTG and removing the overlaps gives GGATTGATGCTTG, which is thirteen nucleotides long.

A different sequence, GGA GAT TGC ATT TTG, reduces to GGATGCATTG, which is ten nucleotides long.

We want to find the shortest possible segment. To do this, we can construct a directed graph, as we did with our de Bruijn sequence, using the rule that a node is connected to another node only if the first can be turned into the second by dropping the initial nucleotide and adding one to the end. In real life, overlaps are much longer than one nucleotide, and reads are not all of uniform length. We are examining an ideal, standardized case here to get a sense for the method that is used.

Applying the chosen rule, we end up with the following graph: We are lucky in this case because there is only one possible sequence. Normally, there are multiple viable candidates. Determining which is the actual sequence requires different types of lab work unrelated to our purposes here. Nevertheless, using this method greatly reduces the number of candidate sequences.

In reality, reads and overlaps are much longer. Consequently, sequencing them requires fast computers running efficient, clever algorithms. Combinatorics has many connections and applications to computing in general, and it is this realm that we will now explore.