CPSC 545 - Project Ideas
General Remarks:
- Projects should be worked on by groups of three or four students
all of whom are taking the course for credit.
Each group needs to include at least two CS or ECE graduate students.
- The following are just project ideas. If you have other ideas for potential
projects, come talk to Holger - I look forward to hearing
about your ideas!
- Start discussing project ideas you are interested in with your fellow students
and with me (Holger) as soon as possible - you need to have selected
a project by September 22nd and a proposal has to be submitted by
October 6th. The proposal should be about 2-3 pages long and needs to reflect
your understanding of the problem, briefly discuss existing
and related work, describe the proposed work,
specify a work schedule and distribution of work between the group members,
and list relevant literature.
1. Collecting and Characterising RNA Molecules with Known Secondary Structure
Predicting the secondary structure of RNA molecules provides
an important basis for understanding their function.
Programs to predict such structures exist,
such as mfold [1] and the fold function of the
Vienna RNA package [3]. Also, a program to predict
secondary structure of a pair of RNA molecules is available on the RNAsoft
web page [5]. However, the prediction accuracy of these programs needs
to be further improved,
especially for long molecules.
The first stage of this project is to create a database of RNA
sequences and their known secondary structure. The database would contain two
types of such data: single RNA molecules and pairs of RNA molecules. The
database should comprise biological molecules as well as molecules whose
structures have been studied in vitro.
Other information about each entry in the database
may be needed: RNA type (tRNA, rRNA, snRNA, ribozyme, artificial, etc.),
references, etc.
Databases with single RNA sequence-structures already exist [6,7], but
they are specialised on specific types of RNAs. Also, many more
known pairs of RNA sequences and structures exist, but
are spread out over the Internet. As
regards pairs of RNA molecules, it is questionable whether any database
exists, but many research papers give examples [8], which may be collected
into the new database.
Assembling such a database would provide an important basis
for improving our understanding of RNA secondary structure
and may ultimately help to design new methods for better
prediction of secondary structures.
The second stage of the project is to use a programme
for statistically analysing collections of RNA structures,
RNA Analyser, to investigate and characterise
the properties of the RNA structures in the newly created database.
RNA Analyser takes as its input a set of RNA secondary structures, and
outputs statistics related to the number of multiloops, length of stems,
size of internal loops, etc. The goal of this stage is to characterise
the properties of various types of biological RNA structures.
For this purpose, the existing RNA Analyser will likely have to be extended
in various ways.
This project will involve a substantial amount of literature and web
search. It will improve your research skills and your knowledge on RNA
molecules.
References:
- Zuker's web site on RNA secondary structure prediction:
bioinfo.math.rpi.edu/~zukerm/rna.
- For background on RNA secondary structure prediction, see the lecture
notes at U. Washington: www.cs.washington.edu/education/courses/527/00wi
- I. Hofacker, ``RNA folding packages". 3 October, 2000.
www.tbi.univie.ac.at/~ivo/RNA
- I.L. Hofacker, W. Fontana, P.F. Stadler, S. Bonhoffer, M. Tacker, and
P. Schuster, ``Fast folding and comparison of RNA secondary structures.''
Monatshefte f. Chemie, Volume 125, 1994. pages 167-188. (Available from
the above web site.)
- RNAsoft web page: www.rnasoft.ca.
- Gutell Lab Comparative RNA Web Site: www.rna.icmb.utexas.edu.
- M. Sprinzl, C. Horn, M. Brown, A. Ioudovitch and S. Steinberg,
Compilation of tRNA sequences and sequences of tRNA genes,
Nucl. Acids. Res. (1998) 26: 148-153.
- P. B. Rupert and A. R. Ferre-d'Amare, Crystal structure of a
hairpin ribozyme - inhibitor complex with implications for
catalysis, Nature (2001) 410.
2. Comparison of Empirical Potential Functions for
Reduced Off-lattice Models of Protein Folding
The protein structure prediction problem is one of the most important problems in
computational biology. Simplified models of protein folding have been
widely used to study protein folding. Success in solving the protein
structure prediction problem relies on the choice of an accurate potential
energy function that can descriminate between native (functional) state of
the protein and misfolded states. Reduced off-lattice models are models
that are usually used for defining the configuration of the backbone of
the protein: the protein main chain (consisting of C, O, N and C_alpha
atoms) is modelled by means of two angles per residue (phi and psi).
Side chains are modelled as units and the side
chain propensities are taken into account by the energy potential function.
This project is to carry out an empirical comparison of at least
two known energy potential functions of your choice
(see references below) with respect to their accuracy in
descriminating between native structures
and misfolded states of given proteins
using the standard RMSD measure of similarity between two
structures (RMSD); furthermore, the correlation between
the similarity of various structures of the same protein
and the respective energy potential values should be studied.
References:
- S. Vajda, M. Sippl, and J. Novotny: Empirical potentials and
functions for protein folding and binding. Current Opinion in Structural
Biology, Vol 7, pp. 222-228, 1997.
[ electronic version ]
- B. Park, M. Levitt: The Complexity and Accuracy of Discrete State
Models of Protein Structure. J. Mol. Biol., Vol 249, pp. 493-507, 1995.
- S. Wallin, J. Farwer, and U. Bastolla: Testing Similarity Measure
with Continuous and Discrete Protein Models. Proteins: Structure, Function
and Genetics, Vol 50, pp. 144-157, 2003.
- more references can be found in the first reference above,
see section "Off-lattice folding simulations".
3. Protein Design in the 2D HP Model
De novo design of proteins is an approach for designing
novel proteins with predetermined properties.
It gives rise to the following protein design problem or
inverse protein folding problem:
Given a protein structure find an amino-acid sequence
that folds into the specified structure as its native state. The
two-dimensional Hydrophobic-Hydrophilic (2D HP) model is an extremely
simplified model of protein structure that has been extensively studied by
biochemists and physicists, and models certain properties of real proteins.
The project is to implement an algorithms for protein design in the 2D HP
model (ideally, a new algorithm that has not been previously applied to the
problem) and to carefully evaluate its performance against results known
from the literature. Furthermore, given protein structures
can be investigated in terms of "designability", e.g., by determining
the number of sequences that have a given structure as
their native structure.
Implementations of high-performance algorithms for 2D HP protein structure
prediction will be provided.
References:
- J. Deutsch, and T. Kurosky: New Algorithm for Protein Design.
Physical Review Letters, Vol 76, Num 2, pp. 323-326, 1996.
- A. Irback, C. Peterson, F. Pottharst, and E. Sandelin: Monte Carlo
procedure for protein design. Physical Review E, Vol 58. Num 5, pp.
5249-5252, 1998.
- C. Micheletti, A. Maritan, J. Banavar: A comparative study of
existing and new design techniques for protein models. J. of Chem.
Physics, Vol 110, Num 19, pp. 9730-9738.
- Brian Hayes: Prototeins. American Scientist, Volume 86, No. 3, 1998.
[ electronic version ]
- 2D HP Benchmark Instances
4. Motif Finding Algorithm
In order to find potential sites in DNA sequences that play roles in
control processes such as regulation of genes or splicing of introns,
researchers use algorithms that can find certain types of "motifs"
in DNA sequences.
Buhler and Tompa recently developed an interesting algorithm for
finding certain types of planted motifs in sequences. Specifically,
the problem they addressed is as follows (quoted from their paper):
"Suppose there is a fixed but unknown nucleotide sequence M (the
motif) of length l. The problem is to determine M, given t nucleotide
sequences each of length n, and each containing a planted variant of
M. More precisely, each such planted variant is a substring that is M
with exactly d point mutations."
In their conclusions they suggest that finding good algorithms based
on other probabilistic models for planting motif variants, such as
allowing insertions and deletions, would be valuable. How could this
be done? Alternatively, suppose that the planted motif is present not
in all sequences, but in a subset of them. Could you develop an
algorithm that could predict such motifs?
Reference:
- J. Buhler and M. Tompa. "Finding Motifs Using Random Projections",
Proceedings of the Fifth Annual Intl. Conference on Computational
Molecular Biology, Montreal, Canada, April 2001. (The paper is
available at Tompa's web page at U. Washington.)
5. Multiple Sequence Alignment
The problem of finding alignments of multiple sequences (DNA or protein)
arises in many bioinformatics applications and often forms the basis
for further computations, e.g., in phylogenetic tree construction.
Unfortunately, determining optimal multiple sequence alignments
is computationally hard (under certain, not realistic conditions,
the problem has been shown to be NP-hard). A variety of
heuristic techniques have been applied to this problem, including
Genetic Algorithms and Simulated Annealing.
The goal of this project is to implement and empirically
study the behaviour of iterative refinement methods
for multiple sequence alignment
using the SP (sum of pairs) score.
After familiarising yourself with the
general approach and conducting a literature survey,
you might want to implement two or three different methods
(these could be all rather closely related) and study their
performance on a set of homologous biological sequences.
You might also consider additional tests on sets of sequences
obtained from applying simulated evolution (according to
one of the commonly used sequence evolution models, e.g., the
Jukes-Cantor model).
References:
-
Durbin, Eddy, Krogh, Mitchison: Biological Sequence Analysis.
Cambridge University Press, 1998; Ch.6.
(This book is available from the Reading Room; it contains
a good introduction to the problem and contains further references.)
-
O. Gotoh: Multiple sequence alignment: algorithms and applications.
Adv. Biophys. 36, 159-206, 1999.
- Udo Tönges, Sören W. Perrey, Jens Stoye, Andreas W. M. Dres:
A General Method for Fast Multiple Sequence Alignment;
Preprint, Universität Bielefeld, 1996.
[ electronic version ]
- Multiple Alignment Resource Page of the VSNS BioComputing Division:
www.techfak.uni-bielefeld.de/bcd/Curric/MulAli/welcome.html
6. Combinatorial DNA Design: A Coding Theory Approach
A DNA strand can be seen as a string over a quaternary alphabet
with the letters A,C,G,T.
Combinatorial methods for designing DNA strands have been used
to obtain sets of DNA strands that do not hybridize between
themselves and only with perfect complements.
Some methods used for this purpose can be found in the coding theory
literature [1,2,3]. Nevertheless, few publications report
results on time complexity and algorithmic behavior for their methods.
The project is to implement and possibly expand
one (or more) of the algorithms presented in [1,2] and/or [3],
and to characterise its performance,
with the goal of gaining a better understanding of the empirical
performance of these methods and
possibly achieving improved results.
References:
- K.U. Koshnick (1991), "Some new constant weight codes," IEEE
Transactions on Information Theory, 37, 370-371.
- G. Dueck and T. Scheuer (1990), "Threshold accepting: a general
purpose optimization algorithm appearing superior to simulated
annealing," Journal of Computational Physics, 90, 161-175
- F. Comellas and R. Roca (1993), "Using genetic algorithms to
design constant weight codes," Proceedings of the International
Workshop on Applications of Neural Networks to Telecomunication,
LAwrence Erlbaum, Hillsdale, NJ, 119-124.
- A.E. Brouwer, J.B. Shearer, N.J.A. Sloane, W.D. Smith (1990),
"A new table of constant weight codes," IEEE Transactions on
Information Theory, 36, 1334-1380.
7. Design of DNA Strand Sets with Unequal Length
The accurate design of DNA molecules for biomolecular computations and
nano-assemblies is of great interest to many researchers today.
Many approaches to the problem of designing sets of DNA strand with
desirable properties have been applied so far. Most of these
methods, however, are designed to generate sets of strands that
are all of equal length.
The project is to develop an algorithm (or
eventually to modify a currently existing method [1]) to design DNA
strands of different length. All strands must fullfil
similar constraints (combinatorial and thermodynamic). The
algorithm should be able to build new sets
from scratch as well as to extend a initial set of DNA
strands and/or to generate new DNA strands without modifying the
existing ones, such that the entire resulting set
fulfils all given constraints.
The new algorithm should be analysed empirically
and its performance should be compared to existing methods.
References:
- Dan C. Tulpan, Holger H. Hoos, Anne Condon, "Stochastic Local
Search Algorithms for DNA Word Design," DNA 8 2002: 229-241.
last update 03/09/16, hh