CG User Manual

Last updated on Thursday June 11, 2009, at 12:17 hrs by Aashish Dattani

About the document

This document is a user manual for the Constraint Generation software written by Mirela Andronescu.

Thanks to Aashish Dattani, Monir Hajiaghayi and Hosna Jabbari for contributions to this document.

Index

1. Introduction to CG
2. Copyright and disclaimer
3. Components of CGcore
4. Using prediction algorithms with CG

4.1 Simfold

4.2 HotKnots

4.3 Hfold

5. Using the cluster

5.1 CPLEX

5.2 IPOPT

6. Output
7. Troubleshooting
8. History
9. References

1. Introduction to CG

Constraint generation (CG) is a computational approach to RNA free energy parameter estimation that can be efficiently trained on large sets of structural as well as thermodynamic data. Our constraint generation approach employs a novel iterative scheme, whereby the energy values are first computed as the solution to a constrained optimization problem. Then the newly-computed energy parameters are used to update the constraints on the optimization function, so as to better optimize the energy parameters in the next iteration.
Using our method on biologically sound data, we obtain revised parameters for the Turner99 and other energy models. We show that by using our new parameters, we obtain significant improvements in prediction accuracy over current state-of-the-art methods.

2. Copyright and disclaimer

Copyright

The CG algorithm and code is copyrighted under GNU General Public Licence by Mirela Andronescu, Anne Condon and Holger Hoos, Department of Computer Science, University of British Columbia.

Disclaimer

Although the authors have made every effort to ensure that CG correctly implements the underlying models and fullfills the functions described in the documentation, neither the authors nor the University of British Columbia guarantee its correctness, fitness for a particular purpose, or future availability.

3. Components of CGcore

CG has two main parts:

CGcore - core algorithm that performs all the CG iterations. Currently, this is implemented in Perl and uses an SGE (Sun Grid Engine) cluster. Let's call this directory CGcore. [Download CGcore ]
- CGlearn.pl is the main file. It calls various other files, outlined below.
The predictor algorithm that is described in detail in section 4.

4. Using prediction algorithms with CG

So far, I have used CG with Simfold, HotKnots and Hfold predictor programs. However, you can use your own predictor program with CG. To use your own predictor program, please follow this link.

4.1 Simfold

Download:To be able to run Simfold predictor with CG, you will need to download the following packages:

MultiRNAFold - preferably version 2.0 or higher. Simfold is part of the MultiRNAFold package. Alternatively you may just download Simfold, but this version might not be updated to the latest version of MultiRNAFold.

Compile:To compile these packages, you should be able to use gcc version 4.2.1 (higher versions are known to give errors). At BETA lab, UBC, it would be advisable to log into the hydra.cs.ubc.ca machine. After that, follow these steps:

Create a directory named CGpred and extract Simfold in it.
Extract MultiRNAFold under the same parent directory as CGpred and rename the MultiRNAFold-x.x directory to just MultiRNAFold.
Change into the Multifold directory and type make.
Change directory to CGpred/Simfold_template/tools/.
If you have a different directory structure than the one described above, then you need to edit the file Makefile. The MDIR variable should be set to the directory where MultiRNAFold is installed and compiled, relative to this directory.
Type make clean
make depend
make

Run: (First make sure that you are logged into the cluster and are able to use it. See section 5) CG takes as input one configuration file with a variety of options. First cd into CGcore, then run:

perl CGlearn.pl ../CGpred/Simfold-template/config/config_sample.txt

If, for some reason, CGlearn.pl stops during some iteration, you can run the same command, and it will continue from the last completed iteration. So you don't have to run it all over again (Note: I'm not sure this is still working well in version 2.0, although it was working in version 1.0).

4.2 HotKnots

Download: Click here to download the HotKnots package.

Compile: To compile these packages, you should be able to use gcc version 4.2.1 (higher versions are known to give errors). At BETA lab, UBC, it would be advisable to log into the hydra.cs.ubc.ca machine. After that, follow these steps:

Extract the package and change directory to HotKnots-template/HotKnotsDP-template/tools.
Type make clean.
make depend
make

Run:(First make sure that you are logged into the cluster and are able to use it. See section 5). Place the CGcore and HotKnots-template directories under the same parent directory. Change directory into CGcore and run:

perl CGlearn.pl ../HotKnots-template/config/config_sample.txt

4.3 HFold

This is not working yet.

5. Using the cluster

To use the ICICS SGE cluster at CS-UBC, make sure that you can login(ssh) to any one the following (server@cs.ubc.ca):

agistri
antiparos
anaphi
guinness
blobel

If you cannot login to any of the above, contact your supervisor to get permissions. If you are using the beta cluster, simply type

use sge

Or if you are using the arrow cluster, then:

If you are using csh or tcsh as your command interpreter, type the following:

source /cs/beta/lib/pkg/sge/beta_grid/common/settings.csh

If you are using sh, ksh or bash as your command interpreter, type the following:

. /cs/beta/lib/pkg/sge/beta_grid/common/settings.sh

For all clusters, to check if your workstation is allowed to submit jobs to the cluster, type

qconf -ss

Now test your workstation with a simple example:

qsub -cwd -o . -e . ~kevinlb/World/helloWorld.sh

The qsub command should confirm the successful job submission as follows:

your job 1 (“simple.sh”) has been submitted

If you are able to successfully run a script on the cluster, then you are ready to run CG with the predictor. Look at section 4 more details on running the code. For more details and FAQs about using the cluster, click here. This wiki page is also a good place to start off.

5.1 CPLEX

If your energy model is linear (e.g. Simfold), you need to be able to use cplex to solve the QP. If you do it at CS-UBC, you'll have to talk to your supervisor and/or notify Kevin Leyton-Brown.

5.2 IPOPT

If your energy model is quadratic (for example, HotKnots or HFold), you need to be able to use IPOPT to solve the QP. If you run CG at CS-UBC, there is an installed version at /ubc/cs/research/beta/People/Andronescu/bio/Software/CoinIpopt/, and you (probably) don't have to do anything. If you don't have it installed, or you want to install it somewhere else, follow the IPOPT documentation, and then edit the file data/Makefile_qcp_template to point to the right path.

Note that, if your training file is large, compiling the C++ program that runs in conjunction with the IPOPT library will take quite a bit of memory. For that reason, CGlearn.pl logs into the machine saria.cs.ubc.ca, that has 4GB of RAM (not very much, but should be enough). To make this login smooth, you should be able to log in saria.cs.ubc.ca passwordless (otherwise it will ask you for the password at every iteration). To do that, follow these steps:
- Generate id_dsa and id_dsa.pub in your .ssh directory (use the default output file, and don't give a passphrase): ssh-keygen -t dsa
- Add the content of the .ssh/id_dsa.pub file into the .ssh/authorized_keys2 file: cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys2
- From some machine other than saria.cs.ubc.ca, test if you can ssh into saria.cs.ubc.ca passwordless.
- Note: The above should work at UBC CS, where all the machines share the home directory. To do this for 2 different machines that do not share the home directory, generate the id_dsa files on one machine, and edit the authorized_keys2 file of the other machine.

6. Output

CG creates a directory with a complicated name, depending on the input options provided. This directory will be in <CGpred>/results and contains several files of interest:

results_final.txt contains the accuracy of the prediction on the training set, at each CG iteration. After all iterations are done, CG finds the best parameter set according to the F-measure on the training set (or validation set, if given), and tests it on the testing set specified as an input option. This information is written in the results_final.txt file, at the end.
params_sub-*.txt are the parameters estimated at each iteration.
output_verbose.txt contains the input options and a log of the run.
the configuration file you provided as input is copied in the output directory

The output of running:
- perl CGlearn.pl ../CGpred/Simfold-template/config/config-sample.txt
is in CGpred/Simfold-template/results/EXAMPLE_OUTPUT_*

7. Troubleshooting

If you get ERROR, was not parsed correctly, make sure you can use the SGE cluster, and you can run qsub properly.

If CG stops unexpectedly early, check if the file <results_directory>/qcp_main was modified after <results_directory>/qcp.cpp (use ls -l to check that). If this is not the case, then the qcp program didn't compile at that iteration. Try the following:
- Make sure you can log in saria.cs.ubc.ca passwordless (if at UBC CS, otherwise into a machine that has at least 4GB memory). To do that, follow the instructions here
- Make sure saria.cs.ubc.ca is alive -- try to ssh into it. If it's not, find another machine with enough memory and modify the CGlearn.pl to use a new machine (for this, search for ssh saria).

If, in the file results_final.txt, you get something like: ERROR!, File <some_file> is empty! An error must have happened!, then try the following:
- The error might be happening because of some incompatibility of HotKnots with the newer g++ compilers. This error shows up if you compile HotKnots with openSuSE 11.1, but seems to be fine if you compile HotKnots with an earlier version, for example openSuSE 10.3. At the time of this writing (June 9, 2009), it works if you compile HotKnots on hydra.cs.ubc.ca (if at UBC, or an openSuSE 10.3 if somewhere else). To compile, go to the directory tools, and type make clean, then make. You only have to recompile HotKnots if you change anything in the directory tools or in the HotKnots code. Once you have it compiled, you can run CG from another machine.
- This error may also happen when the cluster jobs crashed for some reason. If that was the case, then simply rerunning CG might solve the problem. In any case, I suggest you follow the advice above too.

8. History

March 3, 2009, version 2.0 was released. Major changes have been applied since version 1.0.

CG v2.0 uses SGE (Sun Grid Engine) to run training and prediction on a computing cluster.
Two CG versions (namely DIM-CG and LAM-CG) have been added, see Mirela Andronescu's PhD thesis.
CG is now independent (more or less) of the prediction program, and the user can provide a comprehensive configuration file as input.
So far, CG was used in conjunction with:

SimFold, for pseudoknot-free models. This uses CPLEX to solve the optimization program.
HotKnots, for pseudoknotted models. This uses IPOPT to solve the optimization program.

June 2007, version 1.0 was released. That version is now deprecated, but you can download it and read the user manual here.

9. References

CG v1.0 was used in:

M. Andronescu, A. Condon, H.H. Hoos, D.H. Mathews and K.P. Murphy, "Efficient parameter estimation for RNA secondary structure prediction", Bioinformatics (2007) 9(1):340, ISMB/ECCB 2007.

CG v2.0 was used in:

M. Andronescu, Computational approaches for RNA energy parameter estimation, PhD. Thesis, Dept. of Computer Science, University of British Columbia (2008).

Please cite one of the above if you use CG in your work.

Return to top of the page

Constraint Generation v2.0 - User Manual