Sequence Similarity
Tests two sequences to find similarities. Used to:
- Phylogenetic tree reconstruction
- Similar sequences => similar structure => similar function
Global Pairwise Sequence Alignment Problem
Given strings x, y : x = x1 x2 ... xn and
y = y1 y2 ... yn find an optimal alignment of x
and y.
Alignment
Based on changes that happen to molecules as they evolve. (i.e. substitutions,
gaps).
For example,
HEAGAWGHE-E
--P-AW-HEAE
Formally, an alignment of x and y is a pair x', y' where
- x' with gaps removed is
x, similarly for y', y
- |x'| = |y'|
- xi' = yi' = gap never happens
We assign a score to an alignment (x', y') additively: sum of scores
of non-gap (xi', yi') pairs + scores for regions containing gaps.
Score matrices are used to assign scores to non-gapped pairs.
Developing Score Matrices
Matrices are derived according to the following probabilistic interpretation:
- Assume no gaps
- Want score assigned to x',y' to be measure of
likelyhood that x', y' are related
- We consider a score for x', y' relative to a random
model and a match model
- Random model: assume each symbol a occurs with probability
q(a).
P(x', y'|R) = product q(xi') * product q(yi')
- Match model: assigns probablity to pairs (a, b) of symbols.
P(x', y')|M) = product p(xi', yi')
- Take the odds ratio:
P(x', y')|M)/P(x', y')|R)
- Log odds ratio:
sum s(xi', yi')
Assigning Scores to Gaps
Linear gap scoring system with a gap length g: gamma(g) = -dg.
Affine gap scoring system with a gap length
g: gamma(g) = -d - e(g-1).
Algorithms for Sequence Similarity
Given x, y both of length n, how many alignments are there?
The number grows exponentially!
(actually, = 2n choose n ~= 2^2n / sqrt(2 pi n)).
Dynamic Programming Approach
This method is O(n^2).
The optimal alignment of x, y up to the ith and
jth position, respectively, looks like one of the following:
- (optimal alignment of x1 ... xi-1, y1 ... yj-1) and
xi matched with yj
- (optimal alignment of x1 ... xi-1, y1 ... yj ) and
xi matched with a gap
- (optimal alignment of x1 ... xi , y1 ... yj-1) and
yj matched with a gap
So the optimal score is defined as:
F(i, j) = max { F(i-1, j-1) + s(xi, yi), F(i-1, j ) - d, F( i , j-1) - d }
F is the optimal score for any prefix
x1 x2 ... xi, y1 y2 ... yj.
The base cases occur at F(i, 0) = -di (since we must match with
i gaps) and F(0, j) = -dj.
See Biological Sequence Analysis by Durbin, Eddy, Krogh and
Mitchison, page 21 for diagram of how to fill in the table of F
values and how to retrieve the optimal sequence. See also Anne's pseudo-code
for the method.