Sequence Similarity

Tests two sequences to find similarities. Used to:

Phylogenetic tree reconstruction
Similar sequences => similar structure => similar function

Global Pairwise Sequence Alignment Problem

Given strings x, y : x = x1 x2 ... xn and y = y1 y2 ... yn find an optimal alignment of x and y.

Alignment

Based on changes that happen to molecules as they evolve. (i.e. substitutions, gaps). For example, HEAGAWGHE-E

--P-AW-HEAE

Formally, an alignment of x and y is a pair x', y' where

x' with gaps removed is x, similarly for y', y
|x'| = |y'|
xi' = yi' = gap never happens

We assign a score to an alignment (x', y') additively: sum of scores of non-gap (xi', yi') pairs + scores for regions containing gaps. Score matrices are used to assign scores to non-gapped pairs.

Developing Score Matrices

Matrices are derived according to the following probabilistic interpretation:

Assume no gaps
Want score assigned to x',y' to be measure of likelyhood that x', y' are related
We consider a score for x', y' relative to a random model and a match model
Random model: assume each symbol a occurs with probability q(a).
P(x', y'|R) = product q(xi') * product q(yi')
Match model: assigns probablity to pairs (a, b) of symbols.
P(x', y')|M) = product p(xi', yi')
Take the odds ratio:
P(x', y')|M)/P(x', y')|R)
Log odds ratio:
sum s(xi', yi')

Assigning Scores to Gaps

Linear gap scoring system with a gap length g: gamma(g) = -dg.

Affine gap scoring system with a gap length g: gamma(g) = -d - e(g-1).

Algorithms for Sequence Similarity

Given x, y both of length n, how many alignments are there? The number grows exponentially! (actually, = 2n choose n ~= 2^2n / sqrt(2 pi n)).

Dynamic Programming Approach

This method is O(n^2).

The optimal alignment of x, y up to the ith and jth position, respectively, looks like one of the following:

(optimal alignment of x1 ... xi-1, y1 ... yj-1) and xi matched with yj
(optimal alignment of x1 ... xi-1, y1 ... yj ) and xi matched with a gap
(optimal alignment of x1 ... xi , y1 ... yj-1) and yj matched with a gap

So the optimal score is defined as:
F(i, j) = max { F(i-1, j-1) + s(xi, yi), F(i-1, j ) - d, F( i , j-1) - d }

F is the optimal score for any prefix x1 x2 ... xi, y1 y2 ... yj.

The base cases occur at F(i, 0) = -di (since we must match with i gaps) and F(0, j) = -dj.

See Biological Sequence Analysis by Durbin, Eddy, Krogh and Mitchison, page 21 for diagram of how to fill in the table of F values and how to retrieve the optimal sequence. See also Anne's pseudo-code for the method.