The success of this approach on real input sequence data presumably depends in part on how well the data fits the probabilistic model of evolution that underlies the maximum likelihood algorithm. Unfortunately, it seems that the models of evolution underlying available implementations of maximum likelihood that biologists typically use are very simplistic. Ideally, one would like to understand to what extent the simplicity of the model adversely affects the quality of the trees produced by the algorithm (if at all). The goal of this project is to take a first step towards understanding this.
One approach might be to get a working implementation of the maximum likelihood algorithm for two evolutionary models: the simple one commonly used, and also a more sophisticated model. Then you can compare both algorithms on data generated by both models of evolution and see if there are differences in the trees produced. (One could possibly also test the algorithm on real biological data, time permitting.)
A simple model of evolution is the Jukes-Cantor model, which assumes that every base in a sequence evolves independently, and at the same mutation rate. It does not allow for insertions and deletions in the sequences.
One way to increase the sophistication of the model is to remove the independence assumption by allowing pairwise correlations between the evolution of some pairs of bases. Another way is to allow for insertions and deletions. The so-called the Thorne-Kishino-Felsenstein model allows for this.
There is an efficient algorithm that, given a tree, can determine its likelihood according to the Jukes-Cantor model. Is there such an algorithm in the case of pairwise correlations? What about for the Thorne-Kishino-Felsenstein model? These are interesting questions in their own right that would need to be addressed before embarking on an implementation.
References: (I don't have these; you would need to look in the library)
In other situations, one might want to add one or more newly discovered sequences to an already existing tree. A natural approach would be to consider the places where the new sequence(s) could be hung onto the existing tree, and choose the one(s) with the highest likelihood value. The goal of this project is to learn something about when this approach is viable.
You could investigate this experimentally or theoretically. One experimental approach might be to compare the tree output by the traditional maximum likelihood algorithm on n+1 (or n+m) input sequences, with what you get if you apply the algorithm to n sequences and then add the remaining 1 (or m) sequences(s) onto the tree. Even simpler, you could run the maximum likelihood algorithm on n sequences and also on n+m sequences, and see how often the tree obtained with n sequences is a subtree (in the "right" sense) of the tree obtained with n+m sequences. (Other variables, such as the sequence length and the method for generating the input sequences would also have to be considered.) From a theoretical point of view, perhaps it is possible to prove that for any tree with n+1 leaves, for sufficiently large sequence length, if the leaf sequences are generated according to the Jukes-Cantor model then both the tree with n+1 leaves and any subtree with n leaves are maximum likelihood trees for that data with high probability. Can one bound the length of the sequences for which such a result might be true?
Buhler and Tompa recently developed an interesting algorithm for finding certain types of planted motifs in sequences. Specifically, the problem they addressed is as follows (quoted from their paper):
"Suppose there is a fixed but unknown nucleotide sequence M (the motif) of length l. The problem is to determine M, given t nucleotide sequences each of length n, and each containing a planted variant of M. More precisely, each such planted variant is a substring that is M with exactly d point mutations."
In their conclusions they suggest that finding good algorithms based on other probabilistic models for planting motif variants, such as allowing insertions and deletions, would be valuable. How could this be done? Alternatively, suppose that the planted motif is present not in all sequences, but in a subset of them. Could you develop an algorithm that could predict such motifs?
Reference:
The goal of this project is consider the inverse problem, that of designing an RNA molecule that will fold so as to have a desired secondary structure.
Given the desired secondary structure, one way to design the molecule would be to place one base, say "A", in the positions that are intended to be unpaired, and use C-G and G-C pairs for the desired base pairs. As a first cut, it would seem reasonable to choose the assignment of C,G bases for the helices so that the assignment of bases to "adjacent" helices are as different as possible, e.g. to have high Hamming distance from each other. It would be interesting to understand how well this approach works, and to compare it to other approaches that you might think of.
To do this, you might write two or possibly more heuristics for assigning bases to helices, given as input a description of a secondary structure. To compare these heuristics, you could build on RNA secondary structure prediction code (downloadable from the web, see Zuker's site) to obtain a program that will take as input a description of a secondary structure, along with several candidate assignments of bases to the structure, and output information on the quality of each of these candidate assignnments.
You could try your heuristic(s) on RNA structures that arise in nature. Can you design structures on which your heuristics fail? You could also compare with an existing algorithm of Hofacker et al (see below).
References:
The goal of this project is to implement and empirically study the behaviour of iterative refinement methods for multiple sequence alignment using the SP (sum of pairs) score.
After familiarising yourself with the general approach and conducting a literature survey, you might want to implement two or three different methods (these could be all rather closely related) and study their performance on a set of homologous biological sequences. You might also consider additional tests on sets of sequences obtained from applying simulated evolution (according to one of the commonly used sequence evolution models, e.g., the Jukes-Cantor model).
References:
The project is to implement at least two algorithms for protein folding in the 2D HP model (ideally, one of these should not have previously be applied to the problem) and to carefully evaluate their performance against each other and against results known from the literature.
The goal of the project is not to improve on state-of-the art algorithms for this problem, but rather to familiarise yourself with the problem and the approaches to solving it, and to then get first-hand experience with designing, implementing, and evaluating an algorithm for it.
Various types of stochastic local search algorithms can be and have been applied to this problem. One of the algorithms you implement can be fairly basic, such as Randomised Hill-climbing or a simple Simulated Annealing approach. For the other, you might want to consider simple variants of Iterated Local Search, Ant Colony Optimisation, or Tabu Search.
References: