This software serves as a reference implementation of a dynamic programming algorithm proposed by Anne Condon and Chris Thachuk for optimizing codon usage of a coding DNA sequence while simultaneously removing undesirable motifs and adding desirable motifs. See the conference slides for an overview of how the algorithm works and the journal paper for details.
This work was published in a special issue of the Journal of Discrete Algorithms.
A preliminary version appeared at the IWOCA conference.
A Tarball and zipfile are available for the source code which additionally contains the experimental data used in the conference and journal papers. A separate Tarball and zipfile is also available containing only the experimental data.
The software can be built in the typical unix way. The
configure script will ensure the software can be built on your
system and will check that the
required Boost libraries
(Version 1.48+) are
installed. You can optionally specify a prefix path
where the software should be installed. Otherwise, it will be
installed to the standard directories for your system. Note the
software does not need to be installed to be used. After
the make
command, the
binary codon-optimizer
will be available in the build
directory for use.
$ ./configure --prefix=${HOME} $ make $ make install
The program assumes input sequence files are in FASTA format. After building the software, command line options and usage can be determined with:
./codon-optimizerAt present, the usage is:
Usage: codon-optimizer [options] <fasta_file> Allowed options: Generic: -h [ --help ] produce this help message Design Specifications: -s [ --start-index ] arg (=1) first index in FASTA file of sequences to optimize -e [ --end-index ] arg (=1) last index in FASTA file of sequences to optimize -f [ --forbidden-motif-file ] arg a newline separated file containing forbidden motifs -d [ --desired-motif-file ] arg a newline separated file containing desired motifs Other: -o [ --optimized-sequence-file ] arg (=optimized.fasta) output file for optimized sequences -t [ --trace-file ] arg (=optimized.trace) trace file for optimized sequences
The data directory contains the experimental data used in the
published manuscripts. To repeat the CAI optimization of all
3,157 sequences in the data/gencode_filtered.fasta
file using the forbidden motifs in
the data/motifs/forbidden.cpg
file and the
desirable motifs in the data/motifs/desirable.cpg
file, issue the following command:
./codon-optimizer -s 1 -e 3157 -f data/motifs/forbidden.cpg -d data/motifs/desirable.cpg data/gencode_filtered.fasta
The progress will be indicated as the sequences are updated:
Eliminating the following forbidden motifs: CCG CGG <snip> Adding the following desirable motifs: AACGTT AACGTTCG <snip> Optimizing 3157 sequence(s). 100 % Warnings occurred while optimizing sequences. See 'optimized.trace' for details.
If the sequences being optimized contained invalid bases, or
other possible warnings are generated, they will be indicated at the end of
the run. A trace file optimized.trace
will be
produced giving statistics of the optimization for each
sequence and will list warnings:
#Warning: sequence 1 length is not a multiple of 3. It has been truncated. #seq_id length CAI_before Forbidden_before Desirable_before CAI_after Forbidden_after Desirable_after CPU_runtime 1 351 0.588861 18 2 0.868277 0 21 0.050000
The optimized sequences will be written to the file optimized.fasta
. Note that alternative filenames can be specified for both the trace and optimized sequences.
>hg17_chr7_26907301_26907654_+ GTGGGTGGCTCGCAGAGCGTTTAAGGTCGTCGTCCACGTGGACGTAACGCTGGTCGACGTGCAGGCCTGCTGTAAGGCTTTCTGGGCCTGCTGGTCGTTAGCTGGCGTGGTGGTCGTCGCGATGGCCAGGAGACGTTACTGCTGACGTTTCTGCTGCTGCATAGCGCTGTGCTGGAACCAGATCTGCATCTGGGCCTCGTAGAACTGCAGGGCTGCAGCGATCTGCATCCAGCAGGCGCTCGACAGGTGCTCGTAGAAGTGGAACTGCTGCTGCAGTTTCGCGAACTGCTGGGCAGCGAAGTGGGTGCGCATCGTGTGGGCCTGACCCAGGTCGCTGTGCTGAGCAACTTT >hg17_chr7_26908119_26908771_+ CTGTTCTGGGAAGGCTTCTTCTAACTGTCGTCGTCTCAGAAATCAGCGCTGGAGAAGATGAGCCCAATGCGTGGCAGCGATCGATTACTGGGTGGCTGGCGTGGTGAAGGCACGCGTAGCTATTATACGTAACCAGGCCCAGGCTCTGGTGCGCCAGTGCATATGTCTGGCGAATGCATTGAAGCGAGCCCACCTCGACCACAGCATAATCCAGGTGGTGGTGGCGACGCTGGCCCATGGGAAATGCGTGATTTCCAGAGCAAACAGCGTGAACGTACAGGTGGCACCCATCATTTACGTTTACTGCCAGATTTAACTCGACGTGGCTGCAAAGCGCATTAGTCGTCTGTCTCGCATAGCCTCGACCATAATCTGTCTGGCTCTCGTACGCCACCAGGCTCTCGTAAATCTGGACGTTAACCAGCAGGTGGTGGCGATGGTGGTGGTGGTGGTGGTGGTGGCGCGAATCGTAGCGCTCCACCATGCCCACTGGGCTCTGGTCGTCGTCGACGTAATTGCTGGCGTTAACCTCGTACCACAGGCAAACTGTAAAGCTATGGCCCACGTGGACGTTTAGGCCTGTCTCGTAGCCCAAGTCGTCATTGCTAAGTGTGGGGTATTCCAGGACGTTCGTCGTTTCTGCATTGCCCA >hg17_chr7_26912637_26912720_+ <snip>