DanielMayArchive < BETA

---+++05/03/10

Meeting with Chris, some things to do and their status:
   * vectorized SW (assigned to Jay)
      * testing and integration
   * readaligner index integration (assigned to Daniel)
      * substitute rank structure
      * integrate query classes (strategies)
      * memory-time trade off with pre-built kmer index
      * cmake refactoring
   * RNA-Seq (later)
      * splice graph
         * determine input format and minimum required
         * output of spliced reads in SAM (check <nop>TopHat)
   * colourspace (much later)
      * colour <--> letter space converter

   * git (done)
      * reorganize repo
   * tools directory re-organization (done)
      * each tool in separate directory
   * refactoring (to do...)

Besides the meeting, also did a lot of housekeeping for the project, and setup for Jay
   * cleaned repo (merge all branches to dev, and delete everything else)
   * setup a working version of ticgit (well... Chris did this, but had some troubles getting it to work on my end too)
   * lots of small guides for wiki

Up next:
   * integrating readaligner

---+++05/04/10

Integrating the readaligner was a major headache, so rather than struggling to adapt it to our current structure, we decided to give a major overhaul of directory organization:
   * merged =include= into =lib= (separating them was causing headaches when integrating external libraries)
   * lots of new cmake code to:
      * generate an =include= directory in the build directory, containing public interfaces via header files
      * copy data files automatically for tests and tools
   * discovered that my gcc-3.4.4 was extremely "robust" (well... sloppy is more like it) compared to the gcc-4.3.4 that the lab was using
      * this may explain why Chris always had errors while my compiler was happy with the sloppy code
      * I've also discovered that the newer compiler does a better job with optimization, cutting run time to a third

With this reorganization, the readaligner library at least compiles... but I have to yet to actually call functions from it.  Thus, up next:
   * integrating readaligner into the aligner


---+++05/05/10

Integrated readaligner, starting to construct a minimal aligner using classes from the software:
   * The basic flow:
      * Obtain reference as string
      * Build forward index with string
      * Reverse string
      * Build reverse index
      * Initialize output writer
      * Initialize query (aligner)
      * Feed reads into query
   * The query initialization takes the outer writer as an argument
      * Thus, the actual output operation is abstracted away complete
      * This is slightly problematic, as the SAM output they currently use isn't quite complete
   * I also don't quite understand it handles mismatches
      * The original executable handles mismatches
      * I have tried repeatedly and failed to reproduce this, even after copying and pasting blocks of code...

Next up:
   * Reproduce mismatches with the minimal aligner
   * Unabstract the index, and the query...
   * Then integrate back into bacterial aligner

---+++05/06/10

Still attempting to unabstract the index:
   * I have failed and given up on reproducing mismatches
   * Instead I have turned to unabstracting the the index
      * Instead of relying on the Query classes to output SAM entries, I completely ignoring them and using =<nop>TextCollection= as an index
      * I tried using =<nop>FMIndex= directly, but it directly uses another sublibrary, and the compiler crashed when I tried, so I still to accessing it indirectly via =<nop>TextCollection=
      * Unlike the classes in <nop>PizzaChili, I cannot seem to find any clear =locate= or =count= functions
         * Instead, the query relies purely on using the =LF= function and =GetPositions= (which is not locate... unfortunately) and a few stacks, which I assume may corresponding to something similar to =ep= and =sp=
      * I have found a private function in =<nop>FMIndex.cpp= that doesn't seem to be used in the alignments (as commenting it out doesn't break anything)
         * Oddly, this function looks almost identical to =count= I've seen in other indices

Next up then:
   * Modifying this =Search= function to create a =locate= function, and integrate it into my existing interface
   * Possibly adapt the Query classes to handle inexact alignments

---+++05/07/10

Integrated the =<nop>FMIndex= from readaligner:
   * After discovering it was more trouble than it was worth to use the unabstracted =<nop>FMIndex= class directly, I decided to use it via =<nop>TextCollection= & =<nop>TextCollectionBuilder=
   * However, neither =<nop>FMIndex= or =<nop>TextCollection= had standard =locate= or =count= functions, so I had to go in and piece together a few functions to get what I believe is a working =locate=
   * I have thus built a =<nop>NGS_FMIndex= class that inherits my own =<nop>IndexInterface= on top of the slightly modified =<nop>FMIndex=
      * As before, I can now use it interchangeably with other indices in my =bacterial_aligner=
   * I've also made is indices now support two new functions
      * =Extract=, so an original copy of the genome no longers needs to be maintained
         * Currently, I simply keep a copy of the original genome inside the index object, but later this may use true succinct index =extract= methods
      * =name=, to retrieve the reference name of the indexed genome
         * Saves the trouble of passing this value separately
         * Assumes that a single index has a single reference_name... this may not be true later

I have also taken some steps in reducing the sheer length of =bacterial_aligner=:
   * Pulled out some reference reading methods into a new =<nop>FastaFile= class
   * Pulled out some =<nop>SamEntry= creation methods into the =<nop>SamEntry= class as generator methods

I also figured out some errors that =<nop>KmerIndex= can run into, and added somewhat informative error messages if anyone ever hits them.

I believe that I should finally return to tackling the "inexact match" function:
   * Either write a variation of the =locate= function
   * ...or figure out the ingenuity behind the huge branching looping pathway that the =Query= classes use to achieve the same effect

---+++05/10/10

Rather than building an inexact aligner from scratch, I've taken the suggestion of Chris, and started to directly integrate readaligner's =Query= classes
   * I've recovered an older version of =maligner= which used =readaligner= classes directly
   * With this =maligner=, I'm now attempting to reproduce the desired inexact matching produced in =readaligner=
      * I assume there are currently some bugs that make =maligner= unable to do allow any mismatches
   * I have confirmed that bugs both exist in the building of the index and the actual alignment using the index
      * The former was confirmed and resolved, by running indices built in =maligner= on =readaligner= with correct output 
      * The latter is to come

Up next:
   * Find the bug in the alignment portion of =maligner= that prevents it from allowing inexact matches

---+++05/11/10

Debugged maligner, traced readaligner, ready to fully integrate =Query= classes to handle mismatches:
   * The "bug" was really an initialization problem, and solved by initializing a certain "limit" to a very high number (so the recursion wouldn't terminate prematurely)
   * I have placed =couts= throughout the =Query= classes and made some sense of the flow of control
      * The simplest perfect match case seems simple enough
      * I haven't quite understood yet how mismatches are allows (how they're disallowed is simple though)
   * Nonetheless, I have determined the exact lines responsible for the final output
      * While I'd love to simply have it return a usable object/value, I cannot for a few reasons
         * It's not the last step of the function calls
         * It's in the last function on a very large stack of function calls, so returning would involve changing a lot of signatures to propagate the value

Up next then:
   * Modify =Query= classes to work with existing classes and functions
      * I plan to introduce a "buffer" class member that will be written to at the aforementioned "output line"
      * After the true final call, I just have to return the value in this buffer
      * In theory then, I only have to modify the output line, and the signature and return line of the first function

---+++05/12/10

Integrated the =<nop>MismatchQuery= class:
   * I overrode the "output" line in the original class, and replaced it with a line that fills a vector
      * A separate call to the object returns the now filled vector of positions
      * These two steps are encapsulated in the =<nop>LocateWithMismatches= function
   * =<nop>LocateWithMismatches= takes in a string sequence, an integer value (say k) and returns all positions that the sequence is found in the index with *at most* k mismatches
      * This also means that setting k = 1 will return all perfect matches (k = 0) + all matches with 1 mismatch
      * You can imagine how this gets slightly out of hand with higher mismatch values
         * Before that even occurs though, the sheer amount of recursion required by high =k= values would make the run time unbearable

Asking Chris what to tackle next... although I feel that a rather horrific organization may be required some time...

---+++05/13/10

Did some rudimentary performance testing and comparisons:
   * The data:
      * the first 125000 reads extracted from s_1_2_sequence.txt obtained from [[http://www.clcbio.com/index.php?id=1290][here]]
      * The _E.coli_ genome provided from the same link
   * The results (index building time is excluded):
      * Altering the allowed mismatches (=k=):
|  *k*  |*unmapped*|*time (s)*|
|  0  |  17660|  15.7|
|  1  |  7886|  18.6|
|  2  |  5449|  21.4|
|  3  |  4468|  31.7|
|  4  |  3974|  178.3|

      * Compared with other aligners on the same dataset (125000 reads)
|*name*|*time (s)*|*notes*|
|baligner|  20.9|=k = 2=, 5449 mismatches (no indels)|
|readaligner|  9.7|=k = 2= (no indels)|
|bwa|  14.4|=k = 2=, also handles indels|
|bowtie|  4.1|unknown =k=, 6644 mismatches, assumed not to handle indels|

I should also note that I've fixed the SAM output to be tview-friendly (I was missing a header).

From these results, much can be improved upon, so up next:
   * Add indel support
   * Find bottlenecks and optimize...?

---+++05/14/10

Starting integrating the =<nop>IndelQuery= from Makinen et al's readaligner:
   * Ran some gapped =readaligner= runs
      * Even at =k = 0=, the speed is considerably slower compared to =<nop>MismatchQuery=
      * The speed becomes quite unbearable at any =k= larger than 0...
      * Mismatches and gaps are currently mutually exclusive in =readaligner=
   * Using the integrated =<nop>MismatchQuery= as a template, I integrated =<nop>IndelQuery=
      * I encapsulated it all in a =<nop>LocateWithGaps= function in =<nop>NGS_FMIndex=
      * It compiled fine...
      * But it didn't work... (it returned nothing...)

Next up then:
   * Figuring out why indels aren't working.

---+++05/17/10

Integrated indels into the bacterial aligner:
   * The error was previously caused by an uninitialized matrix (oddly, it works in Makinen's readaligner fine... with or without the initialization call)
   * Currently, the basic flow is as follows:
      * Aligned with up to 2 mismatches
         * If missed, attempt a gapped locate with up to 2 gaps
   * This slightly change has increased the run time to 2 minutes (from the previous 20 seconds)...

Up next:
   * Abstracting everything but the main function out of bacterial aligner
   * Some small tasks I've ticketed on ticgit (Jay could do these... actually...)
   * Determining bottlenecks in bacterial aligner

---+++05/18/10

Meeting with Anne and debriefed Jay about his upcoming task:
   * Jay will use [[http://valgrind.org/info/tools.html][Valgrind]] to profile the =baligner=
      * Subsequent data will be used for optimization to reduce run time
   * I will refactor =baligner=
      * Will reduce code bulk
      * Will comment and clarify existing classes

After this, we will direct our efforts to optimization, and only after do we consider RNA-Seq specific strategies, and finally colourspace alignments.

---+++05/19/10

Begun to refactor =baligner=:
   * Pulled out =pairend= resolution functions into a new =MatchMaker= class
   * Logged a lot of small errors and improvements as =ticgits=

Up next:
   * Pulling out the =mapping= functions from =baligner=
   * Adding to and resolving the current list of bugs/improvements

---+++05/20/10

Pulled out all non-main() functions out of =baligner=:
   * Created a new =mapper= directory in =lib/ngsa= that contains all the mapper functions previously in =baligner=

Next up:
   * Pulling out more stuff from =baligner=
   * After the above step, build minimal mode-specific aligners
      * e.g. single end, pair end
      * The goal is to be able to build an aligner in 20 lines

---+++05/21/10

Refactored out large chunks of =baligner= logic into =driver= classes:
   * You can now literally create an aligner with less than 20 lines of code (excluding import and cout statements)

Next up:
   * Debug and document
   * Profile and optimize

---+++05/24/10

Tried a few things before the cold settled in and knocked me out...
   * Got =valgrind= running out =leros= via =ssh= (it doesn't work on cygwin)
   * Couldn't install nor run =<nop>KCacheGrind= on =leros=
      * Attempted and failed to get it running on =cygwin= (it's supposedly possible...)

Next up:
   * Tackle tickets left on =ticgit=...

---+++05/25/10

Cold improved, but I believe the CS server went down, taking out my email, the git repo, and the wiki for a good while...
   * Managed to install =<nop>KCacheGrind= on my windows via [[http://windows.kde.org/][this project]]
      * Generated and displayed a file from =callgrind=... everything but the Call Graph works...

Next up:
   * Tackle tickets left on =ticgit=... hopefully when the repo connects again and the tickets sync up...

---+++05/26/10

See [[https://bugs.cs.ubc.ca/cgi-bin/twiki/view/BETA/JaysJournal][Jay's Journal]]...
   * Discussion about class hierarchy
 
Besides that, I have been terribly unproductive lately, by a combination of this cold and being somewhat lost as what to do... Seeing as how Jay's been hard at work with low level implementation specific optimizations, I guess I should tackle some higher level algorithmic optimizations.

Up next:
   * Forget about those darn ticgits... I've been telling myself to do them for days to no avail...
   * Create an uncompressing extract() function, then compare speed, and consider if keeping an uncompressed copy of the genome is worth the speed-memory trade off
   * Confirm the speed differences between our implementation and bowtie, then see what they're doing that makes all the difference
   * Kickstart locate... first by avoiding a number of bases to simulate the effect
      * If a dramatic improvement is seen, implement the pre-computed kmer table
   * Try a difference rank structure
   * To accommodate almost all of the above steps, buckle down, stare at some code, and read up on full-succinct indices...

---+++05/27/10

Wanted to implement a true =extract()= function:
   * Stared at code from =<nop>PizzaChili= indexes...
   * Couldn't understand any of the notations (such as the =C= array)
   * Started reading various papers
      * Started with [[http://www.dcc.uchile.cl/~gnavarro/ps/psc05.3.pdf][this]] paper's relatively simple description of the FMIndex in Section 2, leading me to the need to understand suffix arrays and BWT... which are highly related... in an odd twisted backwards kind of way I'm trying to wrap my head around
      * Wikipedia's entries on [[http://en.wikipedia.org/wiki/Suffix_array][suffix arrays]] and [[http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform][BWT]] were a somewhat lacking in depth, but served as a good entry
      * [[http://sary.sourceforge.net/docs/suffix-array.html][This]] site provided a more pictorial intro to suffix arrays, and its searching algorithm (defined best in Fig. 4 of Navarro's [[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.186&rep=rep1&type=pdf][review]]... supplemented by notation specified in the [[http://webglimpse.net/pubs/suffix.pdf][original paper]], about the greater than symbols between strings in section 2)
      * Next was to understand the backward search in a search using the suffix array with =C= (that keeping an odd cumulative count...) and =L= (L for Last... which is really the BWT of the text) array, best described in in section 4.1 of Navarro's review (although this is done in the context of a suffix array... not BWT)
      * Finally understood what =LF()= was, and exactly how it does this with the =C= array and =Occ= (whose details I don't quite know... but I know it has to do with rank... so I'll probably have to know one day if I'm replacing a rank structure)
   * Unfortunately, armed with new knowledge of =LF()=, =BWT=, and =C=, I've realized that =extract()= is not all that straight forward...
      * When in doubt, copy others... so I returned to check out how =extract= was implemented in the Alphabet Friendly FM Index
         * It uses a few functions that still elude me... so more reading may be in order.
         * Alternatively, I'll also check out the =extract= of the original FM Index... in hopes that these elusive functions are unique to the Alphabet Friendly FM Index...

Next up:
   * Continue reading papers and code, until I understand the FM Index enough to make a simple =extract()= function... after which I hope to do more with it.

---+++05/31/10

   * Finished up reading on FM Indexes, mainly section two of Ferragina and Manzini's [[http://www.google.ca/url?sa=t&source=web&ct=res&cd=1&ved=0CBQQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.4.1585%26rep%3Drep1%26type%3Dpdf&rct=j&q=Opportunistic+Data+Structures+with+Applications&ei=YlQETI2uKoyINvm7gbYK&usg=AFQjCNGJ99QvYVRibe8FbFsP9MSiTvjRow][original paper]]
      * I will ignore the other sections of the paper for now... since they talk about compression, and make little sense to me...
   * Start to check out the code in bowtie, which lead me to discover 4 distinct modes of alignment, and some performance testing done on =skorpios= (all time data is taken from program output...):
      * Reference: =NC_010473.fasta=
      * Read: =e_coli_125000_1.fq= (unmodified)
      * Default flags: =--time -S= (to diplay time and output in SAM)
      * Build (displayed via =./bowtie --version=):
<verbatim>
64-bit
Built on skorpios
Mon May 31 14:32:21 PDT 2010
Compiler: gcc version 4.4.1 [gcc-4_4-branch revision 150839] (SUSE Linux) 
Options: -O3  
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}
</verbatim>

|  *Mode*  |  *Flags*  |  *Mapped*  |  *Unmapped*  |  *Search Time*  |  *Average Search Time*  |
| 0-Mismatch  |  =-v 0=  |  105606|  19394|  19s  |  0.000152s  |
| 0-Mismatch  |  =-v 0=  |   2250708|  371674|  406s  |  0.000155s  | with 2622382 set |
| 1-Mismatch  |  =-v 1=  |  115631|  9369|  20s  |  0.00016s  |
| 2/3-Mismatch  |  =-v 2=  |  118356|  6644|  20s  |  0.00016s  |
| 2/3 Mismatch  |  =-v 3=  |  119430|  5570|  21s  |  0.000168s  |
| Seeded Quality  |  =-n 0=  |   112317|  12683|  20s  |  0.00016s  |
| Seeded Quality  |  =-n 1=  |   117659|  7341|  20s  |  0.00016s  |
| Seeded Quality  |  =-n 2=  |   118356|  6644|  20s  |  0.00016s  |
| Seeded Quality  |  =-n 3=  |   118330|  6670|  21s  |  0.000168s  |
| Seeded Quality  |  =-n 3=  |   2488915|  133467|  422s  |  0.000161s  | with 2622382 set |

Some notes:
   * =bowtie= doesn't support gapped alignments
   * =bowtie= doesn't supported any value greater than 3 (i.e. =-v 4= and =-n 4= are not implemented)
   * Mismatch modes are similar to FM-based mappers (i.e. everything but =<nop>KmerMapper=), as they are not quality-wary, and align the entire read
   * Seeded Quality mode, (aka Maq-like mode) is somewhat similar to our =<nop>KmerMapper=, as it first establishes a list of seeds, then aligns using these seeds
   * I find it quite odd how the time is nearly constant... maybe a larger dataset will shed some new light (*UPDATE*: ran a few trials with the large dataset, same average times)

For a quick comparison, =saligner= (commit =5a82392eeeb5653d1b75b7a8dd7ccdd499605cfa=), using =MismatchMapper= with various =k= settings on =skorpios= with the same datset (=Invalids= are reads with =N=):

|  *k*  | *Mapped*  |  *Unmapped + Invalid*  |  *Search Time*  |  *Average Search Time*  |
|  0  |  105606|  17732 + 1662|  5.22s  |  0.00004176s  |
|  1  |  115424|  7914 + 1662|  6.21s  |  0.00004968s  |
|  2  |  117872|  5466 + 1662|  8.31s  |  0.00006648s  |
|  3  |  118854|  4484 + 1662|  15.06s  |  0.00006608s  |
|  4  |  119349|  3989 + 1662 |  116.61s  |  0.00093288s  |

One potential reason that we currently outperform =bowtie= so much could be the =report= setting.  In the past, I set =report= to 10, and Jay has recently dropped it to 1.  By dropping it to one, we no longer detect multimaps, but I believe this is a fair comparison with bowtie, which also doesn't report multimaps.

However, =bwa= does multimaps, gaps and mismatches, simultaneously, so we'll have to adjust our program to run on equal ground when doing those comparisons.

One odd note... after it prints all the information, the program won't actually terminate until about 20 seconds after on skorpios... so there is a large constant (about 20 seconds) discrepency between actual runtime and reported runtime.  For example, with =k = 2=:

<verbatim>
real   0m26.654s
user   0m7.940s
sys   0m0.360s
</verbatim>

...And for =k = 4=

<verbatim>
real   2m14.341s
user   1m54.987s
sys   0m0.504s
</verbatim>

Is this the problem with the program? Or something unique to running programs on =skorpios= via SSH?  It doesn't seem to occur when I run =saligner= locally (which takes 15.61 seconds).

<verbatim>
real    0m16.159s
user    0m15.077s
sys     0m0.656s
</verbatim>

By the timing of the pause, I would suspect that it's the destructors... but that doesn't make much sense either.

Finally, all my builds and test files should be openly accessible in =/ubc/cs/research/condon/people/jujubix/= in case anyone wants to double check
This topic: BETA > TipsAndTricks > WebHome > NGSAlignerProject > DanielsJournal > DanielMayArchive
Topic revision: r1 - 2010-06-02 - jujubix