Heuristics for Sequence Alignment

Motivation: When thousands (or hundreds of thousands) of pairwise sequence alignments need to be done quickly, perhaps in searching for similarities between a query sequence and all entries in a sequence database, the O(n^2) running time of dynamic programming algorithms is too costly. Heuristics are faster, but are not guaranteed to find the best alignments with respect to the underlying score matrices.

Two heuristics are in frequent use: FASTA and and Blast . Both achieve linear time, and both find different sequences.

FASTA (Lipman & Pearson)
Find good local alignments between sub-sequences of two sequences X = X[1...n] and Y = Y[1...m].

Algorithm for FASTA

With respect to a parameter k, FASTA first finds "hot spots", namely pairs (i,j) for which:
X[i]..X[i+k-1] = Y[j]..Y[j+k-1].

Typically, k ~= 6 for DNA or 2 for proteins. The hotspots can be viewed as diagonal regions of a 2D matrix as follows:
Next, find the 10 best diagonal runs of hot spots. A diagonal run is a sequence of contiguous hot spots on the same diagonal. The score associated with each diagonal run is the sum of hot spot scores + (negative) score for spaces between hot spots.
Hot spot scores can be computed using a score matrix.
Using a trusted score matrix, compute a score for each of the 10 chosen subalignments, using a dynamic programming method.
Find the best alignment that "connects" one or more of these subalignments:
- Construct a graph with 1 node per subalignment. Place a directed edge between two nodes u and v if the subalignment v could follow u in an alignment of X and Y, and weight the edge according to the cost of aligning the intermediate region between u and v.
- Find the max weight path in the graph.
This defines a substring of x,y' output this local alignment

How do we implement this?

Find the hot spots:

Too computationally expensive to check every combination
Use Hashing! Process X and produce a hash table, then hash y into the table

As hot spots are found create ordered linked lists of them along the diagonals.
Search the linked lists scoring runs and remember the best solutions as you go
Score the best runs using the full score matrix
Create the graph and find the max weight path

BLAST (Altschul, Gish, Miller, Myers, and Lipman)
Early versions of BLAST were faster than FASTA, and were also popular because a range of solutions are ouput, each associated with a statistical likelihood. (Improved versions of FASTA are now available.) Different versions of Blast exist for proteins and DNA.

BLASTTP (Proteins)
Fix w, t (w~=2)

Find all lenth w substrings of Y that align to a particular substring of X with an alignment score > t
Process X: For each length w substring s, generate all w-tuples that align with s with score > t.
Store these in a keyword tree and use them to process Y.
Extend the w-tuples found in step 1, on both ends, to find longer alignments (with no gaps) with score > C.
- Discard Extensions that have low scores.

PAM Unit/Matrices
PAM - Point/Percentage Accepted Mutation

Definition : Sequences S1, S2 are 1 PAM unit diverged if a series of accepted point mutations has converted S1 to S2 with an average of 1 accepted point mutation per 100 amino acids.

Accepted Mutations don't change the function of S1, or at the very least are non-lethal. No insertions or deletions are allowed.

PAM Matrices

Summarize the expected number of evolutionary changes at the amino acid level. The Nth PAM matrix is intended for comparisons of sequences that are N PAM units diverged.

Entry (i,j) reflects the frequency that Ai is expected to replace Aj in 2 sequences that are N units diverged.

To build the matrix using a large number of pairs with N PAM distance apart.

F(i,j) = (# of times Ai is aligned opposite Aj) / (total # of aligned pairs)

Let f(i), f(j) be the frequency of Ai, Aj.

The PAM matrix entry at i,j = log (F(i,j)/(f(i)-f(j)))

This can't be done using real data for large n. In practice real data is used for small n and then this information is used, along with various assumptions to build tables for larger n.

Blosum Scores
Scores are derived from data in the BLOCKS database, which contains blocks of highly conserved regions in proteins, using principles similar to those used in contruction of the PAM matrices. However, an important refinement of the method is that, when constructing the BLOSUM x matrix, pairs of proteins that are too highly conserved (within x percent identical) are discarded. Thus, the scores are believed to better capture more distant, yet conserved sequence motifs from diverged sequences, compared with the PAM matrices.