Last updated on Thursday
June 11, 2009, at 12:17 hrs by Aashish
Dattani
About the document
This document is a user manual for the Constraint
Generation software written by Mirela Andronescu.
Thanks to Aashish Dattani, Monir Hajiaghayi
and Hosna Jabbari for contributions to this document.
Index
1. Introduction to CG
Constraint generation (CG) is a computational approach to RNA free
energy parameter estimation that can be efficiently trained on large
sets of structural as well as thermodynamic data. Our constraint
generation approach employs a novel iterative scheme, whereby the
energy values are first computed as the solution to a constrained
optimization problem. Then the newly-computed energy parameters are
used to update the constraints on the optimization function, so as to
better optimize the energy parameters in the next iteration.
Using our method on biologically sound data, we obtain revised
parameters for the Turner99 and other energy models. We show that by
using our new parameters, we obtain significant improvements in
prediction accuracy over current state-of-the-art methods.
2. Copyright and disclaimer
Copyright
The CG algorithm and code is copyrighted under GNU General
Public Licence by Mirela Andronescu, Anne Condon and Holger
Hoos, Department of Computer Science, University of British Columbia.
Disclaimer
Although the authors have made every effort to ensure that CG
correctly implements the underlying models and fullfills the functions
described in the documentation, neither the authors nor the University
of British Columbia guarantee its correctness, fitness for a particular
purpose, or future availability.
3. Components of CGcore
CG has two main parts:
- CGcore - core algorithm that performs
all the CG iterations.
Currently, this is implemented in Perl and uses an SGE (Sun Grid
Engine) cluster. Let's call this directory CGcore. [Download
CGcore ]
- CGlearn.pl is the main file. It
calls various other files, outlined below.
- start_sge_job.pl: This is a
wrapper for the SGE jobs. It helps CGlearn.pl know when the SGE job
finished.
- split_dataset_into_many_files.pl:
Takes as input the training structural set and splits it into many
files, that will be used for the SGE jobs.
- get_percentage_lb_ub.pl: is a Perl
script that creates files with the upper and lower bounds on the
initial parameters.
- average_analysed_results.pl: After
the prediction and
analysis step was run via SGE, this script averages the results and
adds the average to the results file
- cat_all_files.pl: Just a
replacement of cat, which doesn't work when there are too many files.
- pick_best_training.pl is a script
which picks the best
parameter set according to the f-measure on the training set (or
validation set, if provided). It is called by CGlearn.pl, but it can
also be called separately.
- The predictor algorithm that is described in detail in
section 4.
4. Using prediction algorithms with CG
So far, I have used CG with Simfold, HotKnots and Hfold predictor
programs. However, you can use your own predictor program with CG. To
use your own predictor program, please
follow this link.
4.1 Simfold
Download:To be able to run Simfold
predictor with CG, you will need to download the following packages:
- MultiRNAFold
- preferably version 2.0 or higher. Simfold is part of the MultiRNAFold
package. Alternatively you may just download Simfold,
but this version might not be updated to the latest version of
MultiRNAFold.
Compile:To compile these packages, you should be
able to use gcc
version 4.2.1 (higher versions are known to give errors). At BETA lab,
UBC, it would be advisable to log into the hydra.cs.ubc.ca
machine. After that, follow these steps:
- Create a directory named CGpred and extract Simfold in it.
- Extract MultiRNAFold under the same
parent directory as CGpred and rename the MultiRNAFold-x.x
directory to just MultiRNAFold.
- Change into the Multifold directory and
type make.
- Change directory to CGpred/Simfold_template/tools/.
- If you have a different directory structure than the one
described
above, then you need to edit the file Makefile. The MDIR variable
should be set to the directory where MultiRNAFold is installed and
compiled, relative to this directory.
- Type make clean
- make depend
- make
Run: (First make sure that you are logged
into the cluster
and are able to use it. See section 5) CG takes as input one
configuration file with a variety of options. First cd into CGcore,
then run:
- perl CGlearn.pl
../CGpred/Simfold-template/config/config_sample.txt
If, for some reason, CGlearn.pl stops during some iteration, you can
run the same command, and it will continue from the last completed
iteration. So you don't have to run it all over again (Note: I'm not
sure this is still working well in version 2.0, although it was working
in version 1.0).
4.2 HotKnots
Download: Click
here to download the HotKnots package.
Compile: To compile these packages, you
should be able to use
gcc version 4.2.1 (higher versions are known to give errors). At BETA
lab, UBC, it would be advisable to log into the hydra.cs.ubc.ca
machine. After that, follow these steps:
- Extract the package and change directory to HotKnots-template/HotKnotsDP-template/tools.
- Type make clean.
- make depend
- make
Run:(First make sure that you are logged
into the cluster and are able to use it. See section 5). Place the CGcore
and HotKnots-template directories under the same
parent directory. Change directory into CGcore and
run:
- perl CGlearn.pl
../HotKnots-template/config/config_sample.txt
If, for some reason, CGlearn.pl stops during some iteration, you can
run the same command, and it will continue from the last completed
iteration. So you don't have to run it all over again (Note: I'm not
sure this is still working well in version 2.0, although it was working
in version 1.0).
4.3 HFold
This is not working yet.
5. Using the cluster
To use the ICICS SGE cluster at CS-UBC, make sure that you can
login(ssh) to any one the following (server@cs.ubc.ca):
- agistri
- antiparos
- anaphi
- guinness
- blobel
If you cannot login to any of the above, contact your supervisor
to get permissions. If you are using the beta cluster, simply type
Or if you are using the arrow cluster, then:
- If you are using csh or tcsh as your command interpreter,
type the following:
- source
/cs/beta/lib/pkg/sge/beta_grid/common/settings.csh
- If you are using sh, ksh or bash as your command
interpreter, type the following:
- .
/cs/beta/lib/pkg/sge/beta_grid/common/settings.sh
For all clusters, to check if your workstation is allowed to submit jobs to the cluster,
type
Now test your workstation with a simple example:
- qsub -cwd -o . -e . ~kevinlb/World/helloWorld.sh
The qsub command should confirm the successful job submission as
follows:
- your job 1 (“simple.sh”) has been submitted
If you are able to successfully run a script on the cluster, then
you are ready to run CG with the predictor. Look at section 4 more
details on running the code. For more details and FAQs about using the
cluster, click
here. This wiki
page is also a good place to start off.
5.1 CPLEX
If your energy model is linear (e.g. Simfold), you need to be
able
to use cplex to solve the QP. If you do it at CS-UBC, you'll have to
talk to your supervisor and/or notify Kevin Leyton-Brown.
5.2 IPOPT
If your energy model is quadratic (for example, HotKnots or
HFold),
you need to be able to use IPOPT to solve the QP. If you run CG at
CS-UBC, there is an installed version at
/ubc/cs/research/beta/People/Andronescu/bio/Software/CoinIpopt/, and
you (probably) don't have to do anything. If you don't have it
installed, or you want to install it somewhere else, follow the IPOPT documentation,
and then edit the file data/Makefile_qcp_template to point to the right
path.
- Note that, if your
training file is large,
compiling the C++ program that runs in conjunction with the IPOPT
library will take quite a bit of memory. For that reason, CGlearn.pl
logs into the machine saria.cs.ubc.ca, that has 4GB of RAM (not very
much, but should be enough). To make this login smooth, you should be
able to log in saria.cs.ubc.ca passwordless (otherwise it will ask you
for the password at every iteration). To do that, follow these steps:
- Generate id_dsa and id_dsa.pub in your .ssh directory
(use the default output file, and don't give a passphrase):
ssh-keygen
-t dsa
- Add the content of the .ssh/id_dsa.pub file into the
.ssh/authorized_keys2 file:
cat ~/.ssh/id_dsa.pub
>> ~/.ssh/authorized_keys2
- From some machine other than saria.cs.ubc.ca, test if
you can ssh into saria.cs.ubc.ca passwordless.
- Note: The above should work at UBC CS, where all the
machines share
the home directory. To do this for 2 different machines that do not
share the home directory, generate the id_dsa files on one machine, and
edit the authorized_keys2 file of the other machine.
6. Output
- CG creates a directory with a complicated name, depending
on the
input options provided. This directory will be in <CGpred>/results and
contains several files of
interest:
- results_final.txt
contains the accuracy of the prediction on the training set, at each CG
iteration. After all iterations are done, CG finds the best parameter
set according to the F-measure on the training set (or validation set,
if given), and tests it on the
testing set specified as an input option. This information is written
in the results_final.txt file, at the end.
- params_sub-*.txt
are the parameters estimated at each iteration.
- output_verbose.txt
contains the input options and a log of the run.
- the configuration file you provided as input is copied in
the output directory
- The output of running:
- perl CGlearn.pl
../CGpred/Simfold-template/config/config-sample.txt
is in CGpred/Simfold-template/results/EXAMPLE_OUTPUT_*
7. Troubleshooting
-
If you get ERROR, was not parsed correctly, make
sure you can use the SGE cluster, and you can run qsub
properly.
- If CG stops unexpectedly early, check if the file
<results_directory>/qcp_main was modified after
<results_directory>/qcp.cpp (use ls -l to check that). If
this is
not the case, then the qcp program didn't compile at that iteration.
Try the following:
- Make sure you can log in saria.cs.ubc.ca passwordless
(if at UBC
CS, otherwise into a machine that has at least 4GB memory). To do that,
follow the instructions here
- Make sure saria.cs.ubc.ca is alive -- try to ssh into
it. If it's
not, find another machine with enough memory and modify the CGlearn.pl
to use a new machine (for this, search for
ssh saria
).
- If, in the file results_final.txt, you
get something like: ERROR!, File <some_file> is
empty! An error must have happened!, then try the following:
- The error might be happening because of some
incompatibility
of HotKnots with the newer g++ compilers. This error shows up if you
compile HotKnots with openSuSE 11.1, but seems to be fine if you
compile HotKnots with an earlier version, for example openSuSE 10.3. At
the time of this writing (June 9, 2009), it works if you compile
HotKnots on hydra.cs.ubc.ca (if at UBC, or an openSuSE 10.3 if
somewhere else). To compile, go to the directory tools,
and type make clean, then make.
You only have to recompile HotKnots if you change anything in the
directory tools or in the HotKnots code. Once you
have it compiled, you can run CG from another machine.
- This error may also happen when the cluster jobs
crashed for
some reason. If that was the case, then simply rerunning CG might solve
the problem. In any case, I suggest you follow the advice above too.
8. History
- March 3, 2009, version 2.0 was released. Major
changes have been applied since version 1.0.
- CG v2.0 uses SGE (Sun Grid Engine) to run training and
prediction on a computing cluster.
- Two CG versions (namely DIM-CG and LAM-CG) have been
added, see Mirela Andronescu's PhD thesis.
- CG
is now independent (more or less) of the prediction program, and the
user can provide a comprehensive configuration file as input.
- So far, CG was used in conjunction with:
- SimFold, for pseudoknot-free models. This uses CPLEX to
solve the optimization program.
- HotKnots, for pseudoknotted models. This uses IPOPT to
solve the optimization program.
- June 2007, version 1.0 was released. That version is now
deprecated, but you can download it and read the user manual here.
9.
References
- CG v1.0 was used in:
- M. Andronescu, A. Condon, H.H. Hoos, D.H. Mathews and
K.P. Murphy,
"Efficient parameter estimation for RNA secondary structure
prediction", Bioinformatics (2007) 9(1):340,
ISMB/ECCB 2007.
- CG
v2.0 was used in:
- M. Andronescu, Computational approaches for RNA energy
parameter
estimation, PhD. Thesis, Dept. of Computer Science, University of
British Columbia (2008).
Please
cite one of the above if you use CG in your work.