Jan 30, 2001
Today: How it is done in practice?
Progressive / Iterative Multiple Sequence Alignment (Progressive / Iterative MSA)
Give up: SP score and optimality constraints
These methods are heuristic and that's why they are less guaranteed to find the optimal solution than the previous ones. But they are fast, efficient and reasonably accurate in practice. Also, they do not optimize the SP score.
It turns out that most of these methods are based on a pretty similar idea.
General idea (underlying most algorithms used in practice - to date): construct a succession of pairwise sequence alignments (PSA)
What to do?
Questions: How to choose
them and how can we align them to first alignment?
Answers:
A concrete implementation of all previous ideas was done by Feng and Doolittle in 1987.
Feng - Doolittle Algorithm
b) Convert the raw scores obtained
from these alignments to approximate pairwise evolutionary distances (a
kind of).
How to do this?
Assuming and are aligned then:
where:
In practice: shuffle and compute the sum.
Question: Why do we have another "log" in front of the effective scores?
Answer: The "-log" makes
it roughly linear with evolutionary distances.
Here (for Feng - Doolittle Algorithm)
Example:
.. .. .. .. .. .. D - V .. .. .. .. .. .. .. insert a gap
| | | | | | | | | |
y1 .. .. ..
.. .. L - V .. .. .. .. .. .. ..
Y = {y1, y2, ..
, yM}
Idea: S('-', '-') = 0 i.e. don't penalize and don't add score at all
2) This rule encourages gaps to occur in the same column and "guides" the alignment.
Can we do better than FD?
Note: In Feng - Doolittle all alignments are determined by PSA. Why is this not the best idea?
Example:
.. .. .. N .. .. .. .. .. .. .. V .. .. .. ..
.. .. .. N .. .. .. .. .. .. .. L .. .. .. ..
.. .. .. N .. .. .. .. .. .. .. N .. .. .. ..
.. .. .. P .. .. .. .. .. .. .. P .. .. .. ..
Idea: When aligning for groups, we exploit the aggregated position specific information from groups' MSA.
So in fact we want to penalize mismatches of conserved subsequence, more severely than for variable positions.
We want also to lower gap penalties
for positions where already lots of gaps occur.
One technique that does this is:
Sequence profile alignment
To do:
2. Align profile vs profile
Here: Assume the linear gap model: F(g) = - gd
1. Align X1 = {x1, x2, .. , xn} to X2 = {xn+1, .. , xN}
Scoring procedure:
Gap scoring:
S('-', ) = S(, '-') = -g
S('-', '-') = 0
Then we can use BLOSUM matrices or something similar.
Note: Sum term 1 and sum term 2 are independent of the alignment of X1 and X2. To get the best alignment we just need to optimize the third sum term. This is analogous to standard PSA if we score columns vs columns when using SP scores and it can be done using a straight-forward generalisation of the standard dynamic programming algorithm for global PSA.
CLUSTALW Program [Thompson, Higgins and Gibson, 1994]
CLUSTALW is one widely used implementation of profile-based progressive multiple alignment.
It is very similar to the Feng - Doolittle algorithm and it works as follows:
1. Construct a distance matrix of all N(N-1)/2 pairs of sequences by pairwise sequence alignment. Then convert the similarity scores to evolutionary distances using a specific model of evolution proposed by Kimura in 1983.
2. Construct a guide-tree from this matrix using a clustering method called neighbor-joining proposed by Saitou and Nei in 1987.
3. Progressively align nodes of the tree in order of decreasing similarity using sequences vs sequences, sequences vs profile and profile vs profile alignments.
We have to add various heuristics to get good accuracy [see Durbin et al., Chapter 6.4, page 148]. Overall, CLUSTALW is a very "handcrafted" algorithm. It would be nice to have something with a better theoretic foundation and comparable or better performance.
New directions: