Please vote which abstracts you would like us to discuss during the following meetings

Subject: Interesting recent abstracts
Presenters: All old group members
Tasks:
Alena ISMB 2005
Sanja RNA Journal
Mirela RECOMB 2005
Baharak Bioinformatics
Viann PLoS
Dan NAR
Holger Nature + Science
Rosalia JMB

ISMB 2005 (selected by Alena)

[A1] Three-stage prediction of protein beta-sheets by neural networks, alignments and graph algorithms
Jianlin Cheng and Pierre Baldi
Motivation: Protein beta-sheets play a fundamental role in protein structure, function, evolution and bioengineering. Accurate prediction and assembly of protein beta-sheets, however, remains challenging because protein beta-sheets require formation of hydrogen bonds between linearly distant residues. Previous approaches for predicting beta-sheet topological features, such as beta-strand alignments, in general have not exploited the global covariation and constraints characteristic of beta-sheet architectures.

Results: We propose a modular approach to the problem of predicting/assembling protein beta-sheets in a chain by integrating both local and global constraints in three steps. The first step uses recursive neural networks to predict pairing probabilities for all pairs of interstrand beta-residues from profile, secondary structure and solvent accessibility information. The second step applies dynamic programming techniques to these probabilities to derive binding pseudoenergies and optimal alignments between all pairs of beta-strands. Finally, the third step uses graph matching algorithms to predict the beta-sheet architecture of the protein by optimizing the global pseudoenergy while enforcing strong global beta-strand pairing constraints. The approach is evaluated using cross-validation methods on a large non-homologous dataset and yields significant improvements over previous methods.

Availability: http://www.igb.uci.edu/servers/psss.html

[A2] Detecting coevolving amino acid sites using Bayesian mutational mapping
Matthew W. Dimmic, Melissa J. Hubisz, Carlos D. Bustamante and Rasmus Nielsen
Motivation: The evolution of protein sequences is constrained by complex interactions between amino acid residues. Because harmful substitutions may be compensated for by other substitutions at neighboring sites, residues can coevolve. We describe a Bayesian phylogenetic approach to the detection of coevolving residues in protein families. This method, Bayesian mutational mapping (BMM), assigns mutations to the branches of the evolutionary tree stochastically, and then test statistics are calculated to determine whether a coevolutionary signal exists in the mapping. Posterior predictive P-values provide an estimate of significance, and specificity is maintained by integrating over uncertainty in the estimation of the tree topology, branch lengths and substitution rates. A coevolutionary Markov model for codon substitution is also described, and this model is used as the basis of several test statistics.

Results: Results on simulated coevolutionary data indicate that the BMM method can successfully detect nearly all coevolving sites when the model has been correctly specified, and that non-parametric statistics such as mutual information are generally less powerful than parametric statistics. On a dataset of eukaryotic proteins from the phosphoglycerate kinase (PGK) family, interdomain site contacts yield a significantly greater coevolutionary signal than interdomain non-contacts, an indication that the method provides information about interacting sites. Failure to account for the heterogeneity in rates across sites in PGK resulted in a less discriminating test, yielding a marked increase in the number of reported positives at both contact and non-contact sites.

[A3] Improved detection of DNA motifs using a self-organized clustering of familial binding profiles
Shaun Mahony, Aaron Golden, Terry J. Smith and Panayiotis V. Benos
Motivation: One of the limiting factors in deciphering transcriptional regulatory networks is the effectiveness of motif-finding software. An emerging avenue for improving motif-finding accuracy aims to incorporate generalized binding constraints of related transcription factors (TFs), named familial binding profiles (FBPs), as priors in motif identification methods. A motif-finder can thus be biased towards finding motifs from a particular TF family. However, current motif-finders allow only a single FBP to be used as a prior in a given motif-finding run. In addition, current FBP construction methods are based on manual clustering of position specific scoring matrices (PSSMs) according to the known structural properties of the TF proteins. Manual clustering assumes that the binding preferences of structurally similar TFs will also be similar. This assumption is not true, at least not for some TF families. Automatic PSSM clustering methods are thus required for augmenting the usefulness of FBPs.

Results: A novel method is developed for automatic clustering of PSSM models. The resulting FBPs are incorporated into the SOMBRERO motif-finder, significantly improving its performance when finding motifs related to those that have been incorporated. SOMBRERO is thus the only existing de novo motif-finder that can incorporate knowledge of all known PSSMs in a given motif-finding run.

Availability: The methods outlined will be incorporated into the next release of SOMBRERO, which is available from http://bioinf.nuigalway.ie/sombrero

[A4] Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps
Elena Nabieva, Kam Jim, Amit Agarwal, Bernard Chazelle and Mona Singh
Motivation: Determining protein function is one of the most important problems in the post-genomic era. For the typical proteome, there are no functional annotations for one-third or more of its proteins. Recent high-throughput experiments have determined proteome-scale protein physical interaction maps for several organisms. These physical interactions are complemented by an abundance of data about other types of functional relationships between proteins, including genetic interactions, knowledge about co-expression and shared evolutionary history. Taken together, these pairwise linkages can be used to build whole-proteome protein interaction maps.

Results: We develop a network-flow based algorithm, FunctionalFlow, that exploits the underlying structure of protein interaction maps in order to predict protein function. In cross-validation testing on the yeast proteome, we show that FunctionalFlow has improved performance over previous methods in predicting the function of proteins with few (or no) annotated protein neighbors. By comparing several methods that use protein interaction maps to predict protein function, we demonstrate that FunctionalFlow performs well because it takes advantage of both network topology and some measure of locality. Finally, we show that performance can be improved substantially as we consider multiple data sources and use them to create weighted interaction networks.

Availability: http://compbio.cs.princeton.edu/function

[A5] Alignments anchored on genomic landmarks can aid in the identification of regulatory elements
Kannan Tharakaraman, Leonardo Mario-Ramrez, Sergey Sheetlin , David Landsman and John L. Spouge
Motivation: The transcription start site (TSS) has been located for an increasing number of genes across several organisms. Statistical tests have shown that some cis-acting regulatory elements have positional preferences with respect to the TSS, but few strategies have emerged for locating elements by their positional preferences. This paper elaborates such a strategy. First, we align promoter regions without gaps, anchoring the alignment on each promoter's TSS. Second, we apply a novel word-specific mask. Third, we apply a clustering test related to gapless BLAST statistics. The test examines whether any specific word is placed unusually consistently with respect to the TSS. Finally, our program A-GLAM, an extension of the GLAM program, uses significant word positions as new anchors to realign the sequences. A Gibbs sampling algorithm then locates putative cis-acting regulatory elements. Usually, Gibbs sampling requires a preliminary masking step, to avoid convergence onto a dominant but uninteresting signal from a DNA repeat. However, since the positional anchors focus A-GLAM on the motif of interest, masking DNA repeats during Gibbs sampling becomes unnecessary.

Results: In a set of human DNA sequences with experimentally characterized TSSs, the placement of 791 octonucleotide words was unusually consistent (multiple test corrected P < 0.05). Alignments anchored on these words sometimes located statistically significant motifs inaccessible to GLAM or AlignACE.

Availability: The A-GLAM program and a list of statistically significant words are available at ftp://ftp.ncbi.nih.gov/pub/spouge/papers/archive/AGLAM/.


RNA journal (selected by Sanja)

[S1] RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble
New approach for RNA secondary structure prediction
Prediction of RNA secondary structure by free energy minimization has been the standard for over two decades. Here we describe a novel method that forsakes this paradigm for predictions based on Boltzmann-weighted structure ensemble. We introduce the notion of a centroid structure as a representative for a set of structures and describe a procedure for its identification. In comparison with the minimum free energy (MFE) structure using diverse types of structural RNAs, the centroid of the ensemble makes 30.0% fewer prediction errors as measured by the positive predictive value (PPV) with marginally improved sensitivity. The Boltzmann ensemble can be separated into a small number (3.2 on average) of clusters. Among the centroids of these clusters, the "best cluster centroid" as determined by comparison to the known structure simultaneously improves PPV by 46.5% and sensitivity by 21.7%. For 58% of the studied sequences for which the MFE structure is outside the cluster containing the best centroid, the improvements by the best centroid are 62.5% for PPV and 31.4% for sensitivity. These results suggest that the energy well containing the MFE structure under the current incomplete energy model is often different from the one for the unavailable complete model that presumably contains the unique native structure. Centroids are available on the Sfold server at http://sfold.wadsworth.org
[S2] Molecular mimicry: Quantitative methods to study structural similarity between protein and RNA
Comparing RNA and protein structures
With rapidly increasing availability of three-dimensional structures, one major challenge for the post-genome era is to infer the functions of biological molecules based on their structural similarity. While quantitative studies of structural similarity between the same type of biological molecules (e.g., protein vs. protein) have been carried out intensively, the comparable study of structural similarity between different types of biological molecules (e.g., protein vs. RNA) remains unexplored. Here we have developed a new bioinformatics approach to quantitatively study the structural similarity between two different types of biopolymers - proteins and RNA - based on the spatial distribution of conserved elements. We applied it to two previously proposed tRNA - protein mimicry pairs whose functional relatedness between two molecules has been recently determined experimentally. Our method detected the biologically meaningful signals, which are consistent with experimental evidence.
[S3] Weighted sequence motifs as an improved seeding step in microRNA target prediction algorithms
microRNA target prediction
We present a new microRNA target prediction algorithm called TargetBoost, and show that the algorithm is stable and identifies more true targets than do existing algorithms. TargetBoost uses machine learning on a set of validated microRNA targets in lower organisms to create weighted sequence motifs that capture the binding characteristics between microRNAs and their targets. Existing algorithms require candidates to have (1) near-perfect complementarity between microRNAs' 5' end and their targets; (2) relatively high thermodynamic duplex stability; (3) multiple target sites in the target's 3' UTR; and (4) evolutionary conservation of the target between species. Most algorithms use one of the two first requirements in a seeding step, and use the three others as filters to improve the method's specificity. The initial seeding step determines an algorithm's sensitivity and also influences its specificity. As all algorithms may add filters to increase the specificity, we propose that methods should be compared before such filtering. We show that TargetBoost's weighted sequence motif approach is favorable to using both the duplex stability and the sequence complementarity steps. (TargetBoost is available as a Web tool from http://www.interagon.com/demo/.)
[S4] Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency
Statistical analysis of free energies; Z-scores
We present results of computer experiments that indicate that several RNAs for which the native state (minimum free energy secondary structure) is functionally important (type III hammerhead ribozymes, signal recognition particle RNAs, U2 small nucleolar spliceosomal RNAs, certain riboswitches, etc.) all have lower folding energy than random RNAs of the same length and dinucleotide frequency. Additionally, we find that whole mRNA as well as 5'-UTR, 3'-UTR, and cds regions of mRNA have folding energies comparable to that of random RNA, although there may be a statistically insignificant trace signal in 3'-UTR and cds regions. Various authors have used nucleotide (approximate) pattern matching and the computation of minimum free energy as filters to detect potential RNAs in ESTs and genomes. We introduce a new concept of the asymptotic Z-score and describe a fast, whole-genome scanning algorithm to compute asymptotic minimum free energy Z-scores of moving-window contents. Asymptotic Z-score computations offer another filter, to be used along with nucleotide pattern matching and minimum free energy computations, to detect potential functional RNAs in ESTs and genomic regions.
[S5] The contributions of dsRNA structure to Dicer specificity and efficiency
Effect of secondary structure on protein (Dicer) function
Dicer processes long double-stranded RNA (dsRNA) and pre-microRNAs to generate the functional intermediates (short interfering RNAs and microRNAs) of the RNA interference pathway. Here we identify features of RNA structure that affect Dicer specificity and efficiency. The data presented show that various attributes of the 3' end structure, including overhang length and sequence composition, play a primary role in determining the position of Dicer cleavage in both dsRNA and unimolecular, short hairpin RNA (shRNA). We also demonstrate that siRNA end structure affects overall silencing functionality. Awareness of these new features of Dicer cleavage specificity as it is related to siRNA functionality provides a more detailed understanding of the RNAi mechanism and can shape the development of hairpins with enhanced functionality.

RECOMB 2005, LNCS 3500/2005 (selected by Mirela)

[M1] A Polynomial Time Solvable Formulation of Multiple Sequence Alignment
Sing-Hoi Sze, Yue Lu, Qingwu Yang
Since traditional multiple alignment formulations are NP-hard, heuristics are commonly employed to find acceptable alignments with no guaranteed performance bound. This causes a substantial difficulty in understanding what the resulting alignment means and in assessing the quality of these alignments. We propose an alternative formulation of multiple alignment based on the idea of finding a multiple alignment of k sequences which preserves k - 1 pairwise alignments as specified by edges of a given tree. Although it is well known that such a preserving alignment always exists, it did not become a mainstream method for multiple alignment since it seems that a lot of information is lost from ignoring pairwise similarities outside the tree. In contrast, by using pairwise alignments that incorporate consistency information from other sequences, we show that it is possible to obtain very good accuracy with the preserving alignment formulation. We show that a reasonable objective function to use is to find the shortest preserving alignment, and, by a reduction to a graph-theoretic problem, that the problem of finding the shortest preserving multiple alignment can be solved in polynomial time. We demonstrate the success of this approach on three sets of benchmark multiple alignments by using consistency-based pairwise alignments from the first stage of two of the best performing progressive alignment algorithms TCoffee and PROBCONS, and replace the second heuristic progressive step of these algorithms by the exact preserving alignment step (we ignore the iterative refinement step in this study). We apply this strategy to TCoffee and show that the new approach outperforms TCoffee on two of the three test sets. We apply the strategy to a variant of PROBCONS with no iterative refinements and show that the new approach achieves a similar accuracy except on one test set. The most important advantage of the preserving alignment formulation is that we are certain that we can solve the problem in polynomial time without using a heuristic. A software program implementing this approach (PSAlign) is available at http://faculty.cs.tamu.edu/shsze.
[M2] Maximum Likelihood of Evolutionary Trees Is Hard
Benny Chor and Tamir Tuller
Maximum likelihood (ML) is an increasingly popular optimality criterion for selecting evolutionary trees (Felsenstein, 1981). Finding optimal ML trees appears to be a very hard computational task, but for tractable cases, ML is the method of choice. In particular, algorithms and heuristics for ML take longer to run than algorithms and heuristics for the second major character based criterion, maximum parsimony (MP). However, while MP has been known to be NP-complete for over 20 years (Day, Johnson and Sankoff [5], reduction from vertex cover), such a hardness result for ML has so far eluded researchers in the field. An important work by Tuffley and Steel (1997) proves quantitative relations between parsimony values and the corresponding log likelihood values. However, a direct application of it would only give an exponential time reduction from MP to ML. Another step in this direction has recently been made by Addario-Berry et al. (2004), who proved that ancestral maximum likelihood (AML) is NP-complete. AML “lies in between ” the two problems, having some properties of MP and some properties of ML. We resolve the question, showing that “regular” ML on phylogenetic trees is indeed intractable. Our reduction follows those for MP and AML, but starts from an approximation version of vertex cover, known as gap vc. The crux of our work is not the reduction, but its correctness proof. The proof goes through a series of tree modifications, while controlling the likelihood losses at each step, using the bounds of Tuffley and Steel. The proof can be viewed as correlating the value of any ML solution to an arbitrarily close approximation to vertex cover.
[M3] A High-Throughput Approach for Associating microRNAs with Their Activity Conditions
Chaya Ben-Zaken Zilberstein, Michal Ziv-Ukelson, Ron Y. Pinter, Zohar Yakhini
Plant microRNAs (miRNAs) are short RNA sequences that bind to target mRNAs and change their expression levels by influencing their stabilities and marking them for cleavage. We present a high throughput approach for associating between microRNAs and conditions in which they act, using novel statistical and algorithmic measures. Our new prototype tool, miRNAXpress, computes a (binary) matrix T denoting the potential targets of microRNAs. Then, using T and an additional predefined matrix X indicating expression of genes under various conditions, it produces a new matrix that predicts associations between microRNAs and the conditions in which they act. The computational intensive part of miRNAXpress is the calculation of T. We provide a hybridization search algorithm which given a query microRNA, a text mRNA, and a predefined energy cutoff threshold, finds and reports all targets (putative binding sites) of the query in the text with binding energy below the predefined threshold. In order to speed it up, we utilize the sparsity of the search space without sacrificing the optimality of the results. Consequently, the time complexity of the search algorithm is almost linear in the size of a sparse set of locations where base-pairs are stacked at a height of three or more. We employed our tool to conduct a study, using the plant Arabidopsis thaliana as our model organism. By applying miRNAXpress to 98 microRNAs and 380 conditions, some biologically interesting and statistically strong relations were discovered. Further details, including figures and pseudo-code, can be found at: http://www.cs.technion.ac.il/~michalz/LinearRNA.ps

Bioinformatics (selected by Baharak)

[B1] A parsimonious tree-grow method for haplotype inference
Zhenping Li, Wenfeng Zhou, Xiang-Sun Zhang and Luonan Chen
Motivation: Haplotype information has become increasingly important in analyzing fine-scale molecular genetics data, such as disease genes mapping and drug design. Parsimony haplotyping is one of haplotyping problems belonging to NP-hard class.

Results: In this paper, we aim to develop a novel algorithm for the haplotype inference problem with the parsimony criterion, based on a parsimonious tree-grow method (PTG). PTG is a heuristic algorithm that can find the minimum number of distinct haplotypes based on the criterion of keeping all genotypes resolved during tree-grow process. In addition, a block-partitioning method is also proposed to improve the computational efficiency. We show that the proposed approach is not only effective with a high accuracy, but also very efficient with the computational complexity in the order of O(m^2 n) time for n single nucleotide polymorphism sites in m individual genotypes.

Availability: The software is available upon request from the authors, or from http://zhangroup.aporc.org/bioinfo/ptg/

[B2] Alignment of metabolic pathways
Ron Y. Pinter, Oleg Rokhlenko, Esti Yeger-Lotem and Michal Ziv-Ukelson
Motivation: Several genome-scale efforts are underway to reconstruct metabolic networks for a variety of organisms. As the resulting data accumulates, the need for analysis tools increases. A notable requirement is a pathway alignment finder that enables both the detection of conserved metabolic pathways among different species as well as divergent metabolic pathways within a species. When comparing two pathways, the tool should be powerful enough to take into account both the pathway topology as well as the nodes' labels (e.g. the enzymes they denote), and allow flexibility by matching similar - rather than identical - pathways.

Results: MetaPathwayHunter is a pathway alignment tool that, given a query pathway and a collection of pathways, finds and reports all approximate occurrences of the query in the collection, ranked by similarity and statistical significance. It is based on a novel, efficient graph matching algorithm that extends the functionality of known techniques. The program also supports a visualization interface with which the alignment of two homologous pathways can be graphically displayed.

We employed this tool to study the similarities and differences in the metabolic networks of the bacterium Escherichia coli and the yeast Saccharomyces cerevisiae, as represented in highly curated databases. We reaffirmed that most known metabolic pathways common to both the species are conserved. Furthermore, we discovered a few intriguing relationships between pathways that provide insight into the evolution of metabolic pathways. We conclude with a description of biologically meaningful meta-queries, demonstrating the power and flexibility of our new tool in the analysis of metabolic pathways.

[B3] Prediction of protein-protein interactions by combining structure and sequence conservation in protein interfaces
A. Selim Aytuna , Attila Gursoy and Ozlem Keskin
Motivation: Elucidation of the full network of protein-protein interactions is crucial for understanding of the principles of biological systems and processes. Thus, there is a need for in silico methods for predicting interactions. We present a novel algorithm for automated prediction of protein-protein interactions that employs a unique bottom-up approach combining structure and sequence conservation in protein interfaces.

Results: Running the algorithm on a template dataset of 67 interfaces and a sequentially non-redundant dataset of 6170 protein structures, 62 616 potential interactions are predicted. These interactions are compared with the ones in two publicly available interaction databases (Database of Interacting Proteins and Biomolecular Interaction Network Database) and also the Protein Data Bank. A significant number of predictions are verified in these databases. The unverified ones may correspond to (1) interactions that are not covered in these databases but known in literature, (2) unknown interactions that actually occur in nature and (3) interactions that do not occur naturally but may possibly be realized synthetically in laboratory conditions. Some unverified interactions, supported significantly with studies found in the literature, are discussed.

Availability: http://gordion.hpc.eng.ku.edu.tr/prism

[B4] Differential network expression during drug and stress response
Lawrence Cabusora, Electra Sutton, Andy Fulmer and Christian V. Forst
Motivation: The application of microarray chip technology has led to an explosion of data concerning the expression levels of the genes in an organism under a plethora of conditions. One of the major challenges of systems biology today is to devise generally applicable methods of interpreting this data in a way that will shed light on the complex relationships between multiple genes and their products. The importance of such information is clear, not only as an aid to areas of research like drug design, but also as a contribution to our understanding of the mechanisms behind an organism's ability to react to its environment.

Results: We detail one computational approach for using gene expression data to identify response networks in an organism. The method is based on the construction of biological networks given different sets of interaction information and the reduction of the said networks to important response sub-networks via the integration of the gene expression data. As an application, the expression data of known stress responders and DNA repair genes in Mycobacterium tuberculosis is used to construct a generic stress response sub-network. This is compared to similar networks constructed from data obtained from subjecting M.tuberculosis to various drugs; we are thus able to distinguish between generic stress response and specific drug response. We anticipate that this approach will be able to accelerate target identification and drug development for tuberculosis in the future.

[B5] Efficient sorting of genomic permutations by translocation, inversion and block interchange
Sophia Yancopoulos, Oliver Attie and Richard Friedberg
Motivation: Finding genomic distance based on gene order is a classic problem in genome rearrangements. Efficient exact algorithms for genomic distances based on inversions and/or translocations have been found but are complicated by special cases, rare in simulations and empirical data. We seek a universal operation underlying a more inclusive set of evolutionary operations and yielding a tractable genomic distance with simple mathematical form.

Results: We study a universal double-cut-and-join operation that accounts for inversions, translocations, fissions and fusions, but also produces circular intermediates which can be reabsorbed. The genomic distance, computable in linear time, is given by the number of breakpoints minus the number of cycles (b-c) in the comparison graph of the two genomes; the number of hurdles does not enter into it. Without changing the formula, we can replace generation and re-absorption of a circular intermediate by a generalized transposition, equivalent to a block interchange, with weight two. Our simple algorithm converts one multi-linear chromosome genome to another in the minimum distance.


PLoS Computational Biology (selected by Viann)

[V1] Evolution of Genetic Potential
Lauren Ancel Meyers, Fredric D. Ancel, and Michael Lachmann,
PLoS Comput Biol 2005, 1(3):e32
Variation is the fuel of natural selection. Understanding the mutational processes that underlie evolution has long been a central objective of population genetics. Today, amidst a computational revolution in biology, such understanding is pivotal to progress in many biological disciplines. For example, neutral mutations make the molecular clock tick, and this clock is fundamental to reconstructing phylogenies, measuring recombination rates, and detecting genetic functionality. In this manuscript, the researchers provide an original perspective on a long-standing question in evolutionary biology: to what extent do mutation rates evolve? They argue that to cope with environmental fluctuation, populations can evolve their phenotypic mutation rate without changing their genetic mutation rate. That is, populations can evolve .genetic potential..a heightened sensitivity to the effects of mutation. The researchers use a simple mathematical model of amino acid evolution to illustrate the evolution of genetic potential, and show that as environmental variability decreases, evolving populations reach three distinct states. In a rapidly fluctuating environment, organisms evolve the flexibility to cope with variation within an individual lifetime; in moderately variable environments, populations evolve the ability to evolve rapidly; and in fairly constant environments, populations evolve robustness against the adverse effects of mutation.
[V2] Comparative Genomics and Disorder Prediction Identify Biologically Relevant SH3 Protein Interactions
Pedro Beltrao and Luis Serrano
PLoS Comput Biol 2005, 1(3):e26
How can we tackle the complexity of a living cell? It is commonly said that living organisms are complex and display .emergent. properties. Emergence is perceived in this context as behaviors that appear at the system level but are not observable at the level of the system's components. In the cell this would be equivalent to saying that the cellular complexity could be explained if we could understand the interplay between the cellular components: that is, not just describe the .parts. that make up a cell but understand how they interact with each other to perform the necessary tasks.

A big step on the road to understanding cellular complexity will be a complete list of all relevant interactions between the cellular components. Although a lot of progress as been made in this direction, we are often dependent on experimental methods that are costly and time consuming. It's a big challenge for computational biology to process the current available knowledge and to propose new ways of predicting the interactions between cellular components.

Here the researchers studied protein interactions that are mediated by small linear peptide motifs,specifically interactions between a protein's SH3 domain and its targets, usually small peptide stretches containing a PXXP motif (where P is proline and X is any amino acid). The results showed that the putative target motifs that are conserved in ortholog proteins and are within regions that do not have a defined secondary structure are more likely to be relevant binding sites. Besides proposing a way to combine secondary structure information with comparative genomics to predict protein.protein interactions, the researchers highlight a possible role of intrinsically disordered proteins in SH3 protein interactions. The results also show that when looking for conservation of these motifs, it is important to carefully select the species used in the study: comparisons between species that have diverged to a certain extent- not too little and not too much- are the most informative.
[V3] microRNA Target Predictions across Seven Drosophila Species and Comparison to Mammalian Targets
Dominic Grun, Yi-Lu Wang, David Langenberger, Kristin C Gunsalus, Nikolaus Rajewsky
PLoS Comput Biol 2005, 1(1):e13
MicroRNA genes are a recently discovered large class of small noncoding genes. These genes have been shown to regulate the expression of target genes by binding to partially complementary sites in the mRNAs of the targets. To understand microRNA function it is thus important to identify their targets. Here, the authors use their bioinformatic method, PicTar, and cross-species comparisons of several newly sequenced fly species to predict, genome wide, targets of microRNAs in Drosophila. They find that known fly microRNAs control at least 15% of all genes in D. melanogaster. They also show that genomic clusters of microRNAs are likely to coordinately regulate target genes. Analysis of the functional annotation of target genes furthermore suggests specific biological functions for many microRNAs. All predictions are freely accessible at http://pictar.bio.nyu.edu. Finally, Grn et al. compare the function of microRNAs across flies and mammals. They find that (a) the overall extent of microRNA gene regulation is comparable between both clades, (b) the number of targets for a conserved microRNA in flies correlates with the number of targets in mammals, (c) some conserved microRNAs may function in clade-specific modes of gene regulation, and (d) some specific microRNA.target regulatory relationships may be conserved between both clades.

Nucleic Acids Research (selected by Dan)

[D1] Limitations and potentials of current motif discovery algorithms
Jianjun Hu, Bin Li, and Daisuke Kihara,
Nucl. Acids Res. 2005 33: 4899-4913
Computational methods for de novo identification of gene regulation elements, such as transcription factor binding sites, have proved to be useful for deciphering genetic regulatory networks. However, despite the availability of a large number of algorithms, their strengths and weaknesses are not sufficiently understood. Here, we designed a comprehensive set of performance measures and benchmarked five modern sequence-based motif discovery algorithms using large datasets generated from Escherichia coli RegulonDB. Factors that affect the prediction accuracy, scalability and reliability are characterized. It is revealed that the nucleotide and the binding site level accuracy are very low, while the motif level accuracy is relatively high, which indicates that the algorithms can usually capture at least one correct motif in an input sequence. To exploit diverse predictions from multiple runs of one or more algorithms, a consensus ensemble algorithm has been developed, which achieved 645% improvement over the base algorithms by increasing both the sensitivity and specificity. Our study illustrates limitations and potentials of existing sequence-based motif discovery algorithms. Taking advantage of the revealed potentials, several promising directions for further improvements are discussed. Since the sequence-based algorithms are the baseline of most of the modern motif discovery algorithms, this paper suggests substantial improvements would be possible for them.
[D2] Detecting seeded motifs in DNA sequences
Cinzia Pizzi, Stefania Bortoluzzi, Andrea Bisognin, Alessandro Coppe, and Gian Antonio Danieli,
Nucl. Acids Res. 2005 33: e135
The problem of detecting DNA motifs with functional relevance in real biological sequences is difficult due to a number of biological, statistical and computational issues and also because of the lack of knowledge about the structure of searched patterns. Many algorithms are implemented in fully automated processes, which are often based upon a guess of input parameters from the user at the very first step. In this paper, we present a novel method for the detection of seeded DNA motifs, composed by regions with a different extent of variability. The method is based on a multi-step approach, which was implemented in a motif searching web tool (MOST). Overrepresented exact patterns are extracted from input sequences and clustered to produce motifs core regions, which are then extended and scored to generate seeded motifs. The combination of automated pattern discovery algorithms and different display tools for the evaluation and selection of results at several analysis steps can potentially lead to much more meaningful results than complete automation can produce. Experimental results on different yeast and human real datasets proved the methodology to be a promising solution for finding seeded motifs. MOST web tool is freely available at http://telethon.bio.unipd.it/bioinfo/MOST
[D3] Length-dependent energetics of (CTG)n and (CAG)n trinucleotide repeats
Samir Amrane, Barbara Saccà, Martin Mills, Madhu Chauhan, Horst H. Klump, and Jean-Louis Mergny,
Nucl. Acids Res. 2005 33: 4065-4077
Trinucleotide repeats are involved in a number of debilitating diseases such as myotonic dystrophy. Twelve to seventy-five base-long (CTG)n oligodeoxynucleotides were analysed using a combination of biophysical [UV-absorbance, circular dichroism and differential scanning calorimetry (DSC)] and biochemical methods (non-denaturing gel electrophoresis and enzymatic footprinting). All oligomers formed stable intramolecular structures under near physiological conditions with a melting temperature that was only weakly dependent on oligomer length. Thermodynamic analysis of the denaturation process by UV-melting and calorimetric experiments revealed an unprecedented length-dependent discrepancy between the enthalpy values deduced from model-dependent (UV-melting) and model-independent (calorimetry) experiments. Evidence for non-zero molar heat capacity changes was also derived from the analysis of the Arrhenius plots and DSC profiles. Such behaviour is analysed in the framework of an intramolecular branched-hairpin model, in which long CTG oligomers do not fold into a simple long hairpinstem intramolecular structure, but allow the formation of several independent folding units of unequal stability. We demonstrate that, for sequences ranging from 12 to 25 CTG repeats, an intramolecular structure with two loops is formed which we will call bis-hairpin. Similar results were also found for CAG oligomers, suggesting that this observation may be extended to various trinucleotide repeats-containing sequences.
[D4] Topological constraints in nucleic acid hybridization kinetics
Justin S. Bois, Suvir Venkataraman, Harry M. T. Choi, Andrew J. Spakowitz, Zhen-Gang Wang, and Niles A. Pierce,
Nucl. Acids Res. 2005 33: 4090-4095
A theoretical examination of kinetic mechanisms for forming knots and links in nucleic acid structures suggests that molecules involving base pairs between loops are likely to become topologically trapped in persistent frustrated states through the mechanism of helix-driven wrapping. Augmentation of the state space to include both secondary structure and topology in describing the free energy landscape illustrates the potential for topological effects to influence the kinetics and function of nucleic acid strands. An experimental study of metastable complementary kissing hairpins demonstrates that the topological constraint of zero linking number between the loops effectively prevents conversion to the minimum free energy helical state. Introduction of short catalyst strands that break the topological constraint causes rapid conversion to full duplex.
[D5] Genome-wide selection of unique and valid oligonucleotides
Heikki Hyyrö, Martti Juhola, and Mauno Vihinen,
Nucl. Acids Res. 2005 33: e115
Functional genomics methods are used to investigate the huge amount of information contained in genomes. Numerous experimental methods rely on the use of oligo- or polynucleotides. Nucleotide strand hybridization forms the underlying principle for these methods. For all these techniques, the probes should be unique for analyzed genes. In addition to being unique for the studied genes, the probes should fulfill a large number of criteria to be usable and valid. The criteria include for example, avoidance of self-annealing, suitable melting temperature and nucleotide composition. We developed a method for searching unique and valid oligonucleotides or probes for genes so that there is not even a similar (approximate) occurrence in any other location of the whole genome. By using probe size 25, we analyzed 17 complete genomes representing a wide range of both prokaryotic and eukaryotic organisms. More than 92% of all the genes in the investigated genomes contained valid oligonucleotides. Extensive statistical tests were performed to characterize the properties of unique and valid oligonucleotides. Unique and valid oligonucleotides were relatively evenly distributed in genes except for the beginning and end, which were somewhat overrepresented. The flanking regions in eukaryotes were clearly underrepresented among suitable oligonucleotides. In addition to distributions within genes, the effects on codon and amino acid usage were also studied.

Science and Nature (selected by Holger)

[H1] Toward High-Resolution de Novo Structure Prediction for Small Proteins
Philip Bradley, Kira M. S. Misura, and David Baker
Science, Vol 309, Issue 5742, 1850-1854, 16 September 2005: 1868-1871.
The prediction of protein structure from amino acid sequence is a grand challenge of computational molecular biology. By using a combination of improved low- and high-resolution conformational sampling methods, improved atomically detailed potential functions that capture the jigsaw puzzle.like packing of protein cores, and high-performance computing, high-resolution structure prediction (<1.5 angstroms) can be achieved for small protein domains (<85 residues). The primary bottleneck to consistent high-resolution prediction appears to be conformational sampling.
[H2] Parallel Patterns of Evolution in the Genomes and Transcriptomes of Humans and Chimpanzees
Philipp Khaitovich, Ines Hellmann, Wolfgang Enard, Katja Nowick, Marcus Leinweber, Henriette Franz, Gunter Weiss, Michael Lachmann, and Svante Paabo
Science, Vol 309, Issue 5742, 1850-1854, 16 September 2005: 1850-1854.
The determination of the chimpanzee genome sequence provides a means to study both structural and functional aspects of the evolution of the human genome. Here we compare humans and chimpanzees with respect to differences in expression levels and protein-coding sequences for genes active in brain, heart, liver, kidney, and testis. We find that the patterns of differences in gene expression and gene sequences are markedly similar. In particular, there is a gradation of selective constraints among the tissues so that the brain shows the least differences between the species whereas liver shows the most. Furthermore, expression levels as well as amino acid sequences of genes active in more tissues have diverged less between the species than have genes active in fewer tissues. In general, these patterns are consistent with a model of neutral evolution with negative selection. However, for X-chromosomal genes expressed in testis, patterns suggestive of positive selection on sequence changes as well as expression changes are seen. Furthermore, although genes expressed in the brain have changed less than have genes expressed in other tissues, in agreement with previous work we find that genes active in brain have accumulated more changes on the human than on the chimpanzee lineage.
[H3] Genomics: Massively parallel sequencing
Yu-Hui Rogers and J. Craig Venter
Nature, Volume 437 Number 7057 pp295-450:
A sequencing system has been developed that can read 25 million bases of genetic code - the entire genome of some fungi - within four hours. The technique may provide an alternative approach to DNA sequencing.
[H4] DNA polymerases: Hoogsteen base-pairing in DNA replication?
Jimin Wang
Nature, Volume 437 Number 7057 pp295-450: pE6
Full article: Human polymerase-iota belongs to the error-prone Y family of polymerases, which frequently incorporate incorrect nucleotides during DNA replication but can efficiently bypass DNA lesions. On the basis of X-ray diffraction data, Nair et al. propose that Hoogsteen base-pairing is adopted by DNA during its replication by this enzyme. Here I re-examine their X-ray data and find that the electron density is very weak for a Hoogsteen base pair formed between a template adenine deoxyribonucleotide in the syn conformation and a deoxythymidine 5'-triphosphate (dTTP), and that the fit is better for a normal Watson-Crick base pair. As a guanine-cytosine (G-C) base pair has no potential to form a Hoogsteen base pair at physiological pH, Hoogsteen base-pairing is unlikely to be used in replication by this polymerase.
[H5] Viruses in the sea
Curtis A. Suttle
Nature 437, 356-361 (15 September 2005)
Viruses exist wherever life is found. They are a major cause of mortality, a driver of global geochemical cycles and a reservoir of the greatest genetic diversity on Earth. In the oceans, viruses probably infect all living things, from bacteria to whales. They affect the form of available nutrients and the termination of algal blooms. Viruses can move between marine and terrestrial reservoirs, raising the spectre of emerging pathogens. Our understanding of the effect of viruses on global systems and processes continues to unfold, overthrowing the idea that viruses and virus-mediated processes are sidebars to global processes.
[H6] Research grants: The nightmare before funding
Nature, Volume 437 Number 7057 pp295-450: p308
Asked to name one thing they hate about their jobs, many scientists say grant applications. Nature's reporters have asked researchers just why the process is so frustrating, and what can be done to improve matters.
[H7] Harry Potter and the prisoner of presumption
Antony N. Dodd, Carlos T. Hotta and Michael J. Gardner
Nature, Volume 437 Number 7057 pp295-450: p318
Full article: Jeffrey Craig and colleagues, in Correspondence ("Harry Potter and the recessive allele" Nature 436, 776; 2005), recommend the use of analogies as tools for introducing young people to scientific concepts. Taking their example from J. K. Rowling's stories about the young wizard Harry Potter, they suggest that wizarding is a monogenic trait, with the wizard allele (W) recessive to the muggle allele (M). We believe the assumption that wizarding has a genetic basis to be deterministic and unsupported by available evidence.

Following Craig and colleagues' analogy, Hermione, as a muggle-born witch, must have WM parents. However, as Rowling fans could point out, Hermione's parents were muggle dentists who lack any family history of wizarding. It's true, of course, that chance may not have thrown up a witch or wizard for many generations, or that any who did have magical powers may have kept them secret to avoid a witch hunt.

What about Neville's apparently poor wizarding skills? These cannot be explained by incomplete penetrance, as Craig and colleagues suggest. In incomplete penetrance, individuals either display the trait or not: they do not display an intermediate degree of the trait. Poor wizarding skills might be indicative of variable expressivity of an allele. However, both variable expressivity and incomplete penetrance are associated with dominant alleles. If the wizarding allele were dominant, rather than recessive as suggested, wizarding children such as Hermione could not be born to non-wizarding parents.

Neville's clumsiness may, perhaps, be an individual characteristic unrelated to his potential powers. However, it is not possible, from the evidence presented so far, to conclude that wizarding is a heritable trait.

[H8] Fear of the future (book review)
Nature, Volume 437 Number 7057 pp295-450: p319
Will scientific innovation bring progress and benefits, or just risks and dangers?

Journal of Molecular Biology (selected by Rosalia)

[R1] High Affinity Targets of Protein Kinase Inhibitors Have Similar Residues at the Positions Energetically Important for Binding
Felix B. Sheinerman, Elie Giraud and Abdelazize Laoui
JMB, Volume 352, Issue 5 , 7 October 2005, Pages 1134-1156
Inhibition of protein kinase activity is a focus of intense drug discovery efforts in several therapeutic areas. Major challenges facing the field include understanding of the factors determining the selectivity of kinase inhibitors and the development of compounds with the desired selectivity profile. Here, we report the analysis of sequence variability among high and low affinity targets of eight different small molecule kinase inhibitors (BIRB796, Tarceva, NU6102, Gleevec, SB203580, balanol, H89, PP1). It is observed that all high affinity targets of each inhibitor are found among a relatively small number of kinases, which have similar residues at the specific positions important for binding. The findings are highly statistically significant, and allow one to exclude the majority of kinases in a genome from a list of likely targets for an inhibitor. The findings have implications for the design of novel inhibitors with a desired selectivity profile (e.g. targeted at multiple kinases), the discovery of new targets for kinase inhibitor drugs, comparative analysis of different in vivo models, and the design of a-la-carte chemical libraries tailored for individual kinases.
[R2] Transition State Contact Orders Correlate with Protein Folding Rates
Emanuele Paci, Kresten Lindorff-Larsen, Christopher M. Dobson, Martin Karplus and Michele Vendruscolo
JMB Volume 352, Issue 3 , 23 September 2005, Pages 495-500
We have used molecular dynamics simulations restrained by experimental values derived from protein engineering experiments to determine the structures of the transition state ensembles of ten proteins that fold with two-state kinetics. For each of these proteins we then calculated the average contact order in the transition state ensemble and compared it with the corresponding experimental folding rate. The resulting correlation coefficient is similar to that computed for the contact orders of the native structures, supporting the use of native state contact orders for predicting folding rates. The native contacts in the transition state also correlate with those of the native state but are found to be about 30% lower. These results show that, despite the high levels of heterogeneity in the transition state ensemble, the large majority of contributing structures have native-like topologies and that the native state contact order captures this phenomenon.
[R3] In silico Discovery of EnzymeSubstrate Specificity-determining Residue Clusters
Gong-Xin Yu, Byung-Hoon Park, Praveen Chandramohan, Rajesh Munavalli, Al Geist and Nagiza F. Samatova
JMB, Volume 352, Issue 5 , 7 October 2005, Pages 1105-1117
The binding between an enzyme and its substrate is highly specific, despite the fact that many different enzymes show significant sequence and structure similarity. There must be, then, substrate specificity-determining residues that enable different enzymes to recognize their unique substrates. We reason that a coordinated, not independent, action of both conserved and non-conserved residues determine enzymatic activity and specificity. Here, we present a surface patch ranking (SPR) method for in silico discovery of substrate specificity-determining residue clusters by exploring both sequence conservation and correlated mutations. As case studies we apply SPR to several highly homologous enzymatic protein pairs, such as guanylyl versus adenylyl cyclases, lactate versus malate dehydrogenases, and trypsin versus chymotrypsin. Without using experimental data, we predict several single and multi-residue clusters that are consistent with previous mutagenesis experimental results. Most single-residue clusters are directly involved in enzymesubstrate interactions, whereas multi-residue clusters are vital for domaindomain and regulatorenzyme interactions, indicating their complementary role in specificity determination. These results demonstrate that SPR may help the selection of target residues for mutagenesis experiments and, thus, focus rational drug design, protein engineering, and functional annotation to the relevant regions of a protein.