CPSC536A - Notes for Class 7 - Multiple Sequence Alignment

Jan 25, 2001

Index

1. Introduction to Multiple Sequence Alignment

1.1 What is multiple sequence alignment?
1.2 Why do we need it? - Motivations
1.3 An example
1.4 Definition: Global Multiple Sequence Alignment

2. What is a good alignment? - Scoring Systems for MSA

2.1 some assumptions
2.2 The ideal solution (and why we can't use it)
2.3 The practical solution: The SP score
2.3.1 Assumptions
2.3.2 Definition : The Sum of Pairs score (SP-score)
2.3.3 Critique of the SP score

3. Constructing MSA's using the SP-score

3.1 The Algorithm
3.2 Complexity
3.3 An Example
3.4 NP-completeness and how to get around it...

Next lecture, we'll be looking at...

References

0. Some notes

This is a huge topic, which will only be covered in part during our lecture.

[Gusfield]

[Durbin et al.]

Most of the time when we talk of sequence alignment from now on, we will mean amino-acid sequences, i.e. proteins

1. Introduction to Multiple Sequence Alignment

1.1 What is multiple sequence alignment?

Multiple Sequence alignment (MSA) can be seen as a generalization of Pairwise Sequence Alignment - instead of

1.2 Why do we need it? - Motivations

Computer scientists take it as a challenge, simply because it's possible - generalization cannot be a bad thing
Biologists think beyond that and wonder if it is of any practical relevance. And in fact, it is:

MSA allows us to extract and represent biologically important but faintly/ widely dispersed sequence

MSA can help us to elucidate biological facts about proteins, etc.

analysis of the secondary/tertiary structure of , for example, proteins
critical consensus motives (DNA/ proteins)

MSA can be seen as inverse to Pairwise Sequence Alignment(PSA):

When an alignment is good using PSA, we usually conclude that there exists a functional relationship between

In MSA, we already know that there exists a functional similarity, and want to find out where exactly it comes from

1.3 An example

see [Durbin et al.], fig. 6.1
The figure shows the alignment of ten subsequences of a particular family of proteins

1.4 Definition: Global Multiple Sequence Alignment

A Global Multiple Sequence Alignment of N > 2 sequences is obtained by inserting gaps ("_") into the

The ability to actually determine a good alignment highly depends on how diverged the are
An effect that is often encountered in practice is that in proteins, some regions within all sequences can be well aligned without spending

A plausible explanation for this is that not all residues within a protein are important - some have little or no function at all.

Typical sets of sequences only have a sequence similarity of about 30%

2. What is a good alignment? - Scoring Systems for MSA

2.1 some assumptions

Usually, the sequences are not independent - they usually have some sort of evolutionary relationship

Some positions within the sequences are more conserved than others.

E.g. all agree, except for one:

=> here, the "importance" of N seems obvious, and so P should be given a high penalty

2.2 The ideal solution (and why we can't use it)

What we would like to have is a complete and precise model M of molecular sequence evolution, such that given

P(msa) = P(T|M)

i.e. we could compute the correct model and use it as a basis for scoring

However, this model would have to be VERY complex, simply because life and evolution is so complex and full of

2.3 The practical solution: The SP score

2.3.1 Assumptions

the columns of the alignment matrix are statistically independent, and
we don't make use of phylogenetic trees (for now)

2.3.2 Definition : The Sum of Pairs score (SP-score)

given:

The SP-score for the i-th column of the MSA-matrix is calculated as

hm.... below sigma, it's (1<=j<k<=N) (

being the j-th entry in the i-th column), and the score for the whole matrix M is

(.....................1<=j<=L) Simply speaking, the SP score is calculated by first adding up all possible (pairwise alignment) scores for one column and
then summing up the scores for all columns. 2.3.3 Critique of the SP score

Problems:

There's no probabilistic justification for the SP score
Each sequence is treated as if it were directly evolutionary related to all other N-1 sequences, where in fact

Nevertheless:

The SP score is easy to work with, and widely used
The results are reasonably good
Other (theoretically better founded) methods that are efficient are not known

3. Constructing MSA's using the SP-score

3.1 The Algorithm

Idea: generalize the dynamic programming approach for pairwise alignment (Needleman-Wunsch Algorithm)
Let's do this for an alignment of three sequences:

For three sequences, seven cases have to be considered - generally there are

cases (N = number of sequences) (at the (current) end of each sequence, we can add either a gap or the next character ->

,
minus one because adding only gaps is not allowed)

for initialization, we use

(s(x,y) is the score we get from the PAM/BLOSUM/... matrix)

3.2 Complexity For N sequences of length L, the above algorithm
needs

( see [Gusfield] for an explanation of - intuitively it means that the complexity is quite exactly )

and

3.3 An Example: Fig. 6.3, [Durbin et al.?] It shows you the cube you get for an alignment of three sequences. It turns out that not all cells of this cube (and in general, the N-dimensional matrix) need to be computed, and the order of computation can also be heuristically optimised. The rather sophisticated algorithm by Carillo and Lipman exploits these ideas and achieves thus "slightly" improved time and space usage over "naive" N-diminsional dynamic programming. In 1988, Lipman, Altschul, and Kececioglu implemented a (further refined) version of this algorithm in their program "MSA" (still in use). MSA is practically restricted to 5-7 protein sequences of typical length of 200-300 residues. 3.4 NP-completeness and how to get around it...

Bad news:

The problem we are trying to solve here is very hard intrinsically - "NP-complete", as computer scientists say. For NP-complete problems,
there is (almost) no hope that there is an algorithm that is not exponential in its complexity.

...But

Bounded Error Approximation

Ideas:

do the alignment consistent with a tree ( not necessarily a phylogenetic tree) that relates the sequences to each other

use the so-called "centre-star method"

this gives us, in polynomial time, an SP-score SP that satisfies

still, this method isn't good enough to be used as a standalone method, but

the theory behind it is interesting
it can be very useful for constructing the "really good" algorithms (which are heuristics)

Next lecture, we'll be looking at... Heuristics and

Phylogenetic trees

References

[Gusfield] D.Gusfield: Algorithms on Strings, Trees, and Sequences: Computational Science and Computational Biology.
Cambridge University Press, 1997.

[Durbin et al.] Durbin, Eddy, Krogh, Mitchison: Biological sequence analysis: Probabilistic models of proteins and nucleic acids.
Cambridge University Press, 1998. (Available from CICSR Reading Room)