This is an excel worksheet that can be used to analyze strands of DNA based on their relative
quantities of base pairs. Given a string of DNA, the worksheet will iterate over each value in the
string and record the number of each base pair. Parameters in this model - Target and Target2 -
determine which base pairs are being counted. The user can also set the inputs in the form of
sequences of DNA base pairs to be analyzed. Multiple strings can be analyzed at once, allowing for
easy comparison.
Background Information
Understanding the human genome is one the most important projects in medical and biological
science today. Although the human genome was first sequenced in its entirety back in 2003,
scientists today still struggle to understand the function of the vast majority of genes. The
benefits of achieving such understanding would be legion. A completely mapped human genome would
allow doctors to quickly identify and cure genetic disorders, produce medicines specifically
tailored to the individual genome of the patient, and even modify human genes to select for
desirable qualities.
One of the most basic methods of understanding this intricate double helix is to look for patterns
in base pairs. The patterns that are found can then be correlated with other observable
characteristics of individuals to get an idea of what function those specific genes might have.
Information about the function of genes can be used to help patients make informed decisions about
their health. For instance, many people are carriers of a single gene for Sickle-Cell disease. On
its own, this gene is harmless and even beneficial, as it confers some resistance to malaria.
However, if two carriers of the sickle cell gene have a child, there is a chance that that child
will inherit both genes and suffer from Sickle-Cell disease.
Science/Math
The fundamental principle behind this model is that the odds of an event occurring are equal to
the number of different ways that the event could occur relative to the number of ways it could
not occur. If the arrangement of DNA base pairs were completely random, then we could easily
calculate the expected frequency of any given pattern or patterns of base pairs. When the actual
frequency of such a pattern differs significantly from its predicted frequency that is evidence
that the pattern is a gene with some specific function.
The excel worksheet works on very simple properties. During every iteration, the following things
happen:
The current position increases by 1
For each string, if the value at the current position is equal to one of the values that the
user is searching the string for, then the counter for that value is increased by 1
If the position is higher than the length of a string, then the counter for the string is not
updated
Teaching Strategies
An effective way of introducing this model is to start with "pure" statistics first to ensure that
students understand the mathematics behind it. Ask students to enumerate all potential results
that can occur in the following random simulations (order matters):
Flip 1 coin
Flip 2 coins
Flip 3 coins
Flip 4coins
Roll one 4-sided number cube
Roll two 4-sided number cubes
Roll three 4-sided number cubes
Roll one 6-sided number cube
Roll two 6-side number cubes
Next, ask the following questions.
How many different pairs of results did you get for each simulation above?
Is there a relationship between the number of possible outcomes for running a random simulation
once, twice, or more times? If so, what might that relationship be?
If I asked you to tell me the number of possible pairs of results of rolling 5 4-sided number
cubes, what would that be? You shouldn't need to enumerate them one by one.
If I increased the number of rolls to 6, what would happen to the number of possible pairs? How
do you know?
Implementation
How to use the Model
There are three main parameters that can be manipulated to change the way this model runs:
The Target and Target2 parameters determine which base pairs to search for as the worksheet
iterates through the DNA strands
Users can also input an unlimited number and length of custom DNA strands to be analyzed
All of these parameters can be manipulated simply by typing the new value into the relevant box.
Once the parameters are set to your satisfaction the model can be run via the following procedure:
Ensure that Excel is set to iterate circular functions only once (Preferences->Calculation->Set
to Calculate "Manually" and to check the box "Limit Iterations" and set maximum iterations to 1)
Change the value of "Start" to 1. The position value should automatically update to 1 as well
Press [Ctrl]+[=] to step forward one iteration. The position value should now be 2
Continue iterating until the position value is larger than any of the lengths of the DNA strands
Modify the excel spreadsheet to search for strings of any length, rather than just single
letters
Understand how to calculate the probability of different types of sequences occurring in a
string of a fixed length
Objective
To accomplish this objective, have students try to understand how the "C" and "G" counters work by
looking at the equation in the cell. Students should pay close attention to the cell references in
this function, as they provide an easy way to trace back the chain of events that determines how
the counters move. Ask the following questions:
What cells do the base counters depend on? How do you know?
What is the meaning of each of the constants in this function? How do you know?
For your target, in place of a "C" type in "CA" and run the spreadsheet again. What happens?
Why?
The counters only ever look at one base at a time, so they aren't ever detecting a two-base
sequence like "CA". How can you modify the counters to look for two-base sequences?
5) Changing the constant in the MID formula will change how the counters analyze the DNA, but
it's a lot of effort to change it each and every time you search for a new string. How can you
change the constant to a variable that automatically updates with the length of the string?
Objective
To accomplish this objective, have students look back at the analysis they did with the rolling
dice earlier, and then ask the following questions:
When you roll two number cubes, how many different ways are there to have the two number cubes
add up to seven? How many total pairs are possible? Based on that, what is the probability of
getting a seven?
How many ways are there to get exactly 3 heads and 1 tail when flipping four coins? What is the
probability of that occurring?
If you assume that DNA base pairs have a random distribution, what is the probability of getting
the string GATC? On average, how many different groups of 4 base pairs would you need to look at
to find that string once?
There are 44 different groups of base pairs in the DNA strings provided in the Excel worksheet.
How many times would you expect to find GATC in these strings? How many times do you actually
find GATC in those strings when you run the simulation? What might this imply?
Extensions
Use statistical tools to analyze observed v. expected frequencies
Expand the spreadsheet to be able to analyze other types of strings
Extension 1
Have students choose several different combinations of base pairs to search for and run the
simulation on each, recording the total number of that combination that they received. Then, have
them analyze the results to see if they have significant evidence that DNA is non-random. Ask the
following questions
How many of each string did you find in the DNA strand? How many did you expect to find, on
average? The average can be calculated by taking the probability of getting the string and
multiplying it by the number of substrings of that size (59 for size 1, 54 for size 2, 49 for
size 3, 44 for size 4).
2) The standard deviation of your results is equal to the square root of the number of
substrings, times the probability of getting that string, times one minus the probability of
getting that string, or sqrt(np(1 - p)) where n is the number of substrings and p is the
probability that an arbitrary substring will match your requirements. What were the standard
deviations for each of your tests?
3) If an experimental result is more than 2 standard deviations away from the average, it is
called a significant result because there is less than a 5% probability that it occurred by
chance alone. Are any of your results significant? How do you know? What does that imply about
these DNA strands?
This applet introduces the concept of numbers that seem random but are actually generated by
determinate processes. In this model, various types of complicated mathematical functions generate
"random" numbers starting from a certain "seed", which helps to ensure that you get different
random numbers each time you run the model. Oddly enough, despite the fact that random numbers are
inherently unpredictable, it is actually possible to analyze and rate the "randomness" of
different functions. One of the most common ways to do so is through frequency analysis: in true
random numbers, the quantity of each digit and string of digits should be exactly what is
predicted by statistics. This model is a great way to introduce the fact that, while DNA can seem
random at first glance, deeper analysis clearly reveals how it differs from "random" data.
This applet challenges users to separate the signal from the noise in partially random data.
Although the data has a component that is completely random and unpredictable, it also has a
determinate component that can be isolated and modeled through regression equations. There are a
number of important parallels between this model and DNA sequence analysis. Like the regression
data, DNA contains random or partially random components due to mutations, measurement errors, and
other environmental disturbances. However, it is important not to let the noise obscure the signal
beneath, which is highly meaningful and significant. Though it is not possible to use a regression
equation directly on DNA, many of the same principles apply, so this model is a great way to
encourage students to start thinking about separating partially random data into a random and a
non-random part.