DanielJuneArchive < BETA

Tags: view all tags
---+++06/01/10

Meeting with Jay and Chris, up next:
   * Create a simple minimal locate function
   * Modify the aforementioned function to make a single mismatch locate
   * Compare with bowtie's analogous functions 
   * Copy bowtie if required

Also, future performance tests on =skopios= should read and write files from /tmp/ so results aren't affected by network lag... which caused 19 seconds of additional runtime in yesterday's results, making them useless.

---+++06/02/10

Making a simple minimal locate function:
   * Read another [[http://www.google.ca/url?sa=t&source=web&ct=res&cd=3&ved=0CCcQFjAC&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.77.2997%26rep%3Drep1%26type%3Dpdf&rct=j&q=Indexing+Compressed+Text&ei=Q_0GTIfEIsSqlAekh_2LDg&usg=AFQjCNHFq9u2yRmKDlS4LvByeEcNH1p0ag][paper]] on the <nop>FMIndex, which describes the basic logic behind the =count= and =locate= algorithms in sections 3.1 and 3.3.
   * Confirmed that the =locate= I implemented for =ExactMapper= follows the minimal basic algorithm described by the above paper
   * From Jay's journal, it seems that our code base is thrice as slow as bowtie's... either that, or bowtie's taking some shortcuts to speed things up.
   * Started implementing a simple =<nop>OneMismatchLocate= inside =/external/readaligner/FMIndex.cc=, as it needs access to 2 private class members
      * Unfortunately, this also means I have to add yet another virtual function in =TextCollection=
      * I've tried using =<nop>FMIndex= class directly, but including the =<nop>FMIndex.h= header in any file causes the compiler to complain a lot... due to the way includes are set up
         * The quick dirty resolution would be to force the cmake to copy a lot more files
         * The real solution would be to resolve ticgit 4b282e

Next up:
   * Finish implementing the 1 mismatch locate... which does not look efficient in the very least...
   * Investigate bowtie code

---+++06/03/10

Implemented =<nop>OneMismatchLocate= inside =<nop>OneMismatchMapper=:
   * Uses =<nop>LF= and =<nop>getPositions= from =<nop>TextCollection=
   * Non-rigorous testing using a single read, and manually confirming the result seem to indicate that it is correct (I had to work out a few edge cases though...)
   * It runs in three phases, which I assume is a bruteforce depth first search
      1. Locate the unmodified read, and store the =sp= and =ep= positions generated
      2. If the read missed, backtrack to last valid position range, and change the base (repeat for every base and position (increasing) until a hit is made)
      3. A special case is required to handle mismatches in the very last base (first base read in matching algorithm), as no pre-existing position range was stored
   * Preliminary results seems to indicate that it is slower than =<nop>MismatchQuery= with =k = 1= due to excessive backtracking
      * In case anyone is interested, by uncommenting all the "printf" statements in the function (use replace), you get a very nice visual presentation of the backtracking
      * Can be easily remedied by copying readaligner's strategy by using both a forward and reverse (not complemented) index
      * The number of read mapped and unmapped are identical, which I hope is good news
   * A slightly more complicated but potentially beneficial change is to swap bases with lowest qualities first

Next up:
   * Improve performance of =<nop>OneMismatchLocate= by using the reverse index to reduce excessive backtracking

---+++06/04/10

Improved =<nop>OneMismatchLocate=:
   * By using the reverse index along with the forward index, I have improved the performance of the =<nop>OneMismatchMapper=
   * Details best summarized in the following table, generated using =saligner= on my computer, with the 125000 set, just for a rough comparison:

|  *Mapper*  |  *max_results = 10*  |  *max_results = 1*  |
|<nop>ExactMapper  |  7.2s|  6.5s|
|<nop>OneMismatchMapper (old)  |  33.4s|  22.0s|
|<nop>OneMismatchMapper (new)  |  11.1s|  9.4s|
|<nop>MismatchMapper (=k = 1=)  |  12.3s|  11.0s|
|<nop>MismatchMapper (=k = 2=)  |  15.5s|  13.8s|

   * I'm pleased to see that the improved =<nop>OneMismatchMapper= outperforms =<nop>MismatchMapper= at equivalent settings (=k = 1=), barely...
   * Of course, =bowtie= does the same thing in 3 seconds... so I'll have to see what it's doing

Staring at =bowtie=:
   * With the aid of =callgrind=, I have located the core logic of the =bowtie= alignments to be in =backtrack= function in =ebwt_search_backtrack=
   * There are actually 3 =backtrack= functions, which call each other in a sequential fashion (and the last and largest one is self-recursive)
   * Combined, they have about 1000 lines of code that makes little sense to me... even with the descriptive comments

Next up:
   * =bowtie=? Although it feels somewhat futile at the moment...
   * Reviewing some old entries, maybe I'll give that <nop>KickStartLocate a try, which would theoretically work with a pre-computed table to speed up alignments.

---+++06/07/10

Stared at =bowtie=:
   * Tracing the exact alignment of a single read on the reverse strand with =gdb= and =callgrind= in 0-mismatch mode, I have determined a few things that contributes to its speed:
      * Firstly, it uses the standard reverse depth first search algorithm standard to FM Indexes, but it has a lot of extra conditionals and branches to deal with mismatches and backtracking, =top= and =bot= are used in the code analogously to =sp= and =ep=
      * It "kick starts" the traditional =locate= from the 11th (to last) base (i.e. it skips the first 10 iterations)
         * This is effectively the same as Chris' idea, although I am not sure if =bowtie= uses a precomputed matrix or not
         * The exact functions (one for =top= and =bot=) called on =ebwt_search_backtrack.h:258-259= and defined in =ebwt.h=
         * By the comments, this "chomp" can actually be longer than 10 bases, and there are conditional to catch cases where the "chomp" locates the entire read
      * The =top= and =bot= ranges converge much quicker, leading to much quicker rejections for inexact hits
         * I have no clue how this is achieved... short of assuming that the suffix indexes are built different
         * *UPDATE*: On second thought, this is merely a side effect of the aforementioned seeding process.  With the 10 base seed, the starting range is much narrower, and the convergence much quicker.
      * There is a special function called =when top + 1 equals bot= (i.e. when only 1 suffix is valid in the search range) to update the two pointer... at =ebwt_search_backtrack.h:563=, defined in =ebwt.h:2256=
         * I'm not entirely sure if this is contributing to anything
      * I am highly confused by this =<nop>SideLocus= structure that is passed around...
         * From =LF= signatures, it seems analogous to the =<nop>BWT(T)=, but it doesn't quite make sense why =top= and =bot= use different unique copies of this structure
         * It is defined in =ebwt.h=

I've simulated what it would be like to kick start the locate 10 bases in on =saligner=, and I can shave about 10% off the current runtime, so it may be worthwhile to check this out.

Next up:
   * Trace =bowtie= in 1-mismatch mode
   * Implement the kick started locate

---+++06/08/10

Meeting today, some things to do:
   * Confirm whether readaligner supports multiple references
      * If so, adapt it
      * If not, we'll need to design something...
   * Continue tracing bowtie's exact alignment mode
   * Implement a kick started version of locate, also known as "chomp"
      * This may involve pre-computing and saving a structure
      * Probably a separate structure than readaligner's two index files
   * Look into unencapsulating <nop>FMIndex from <nop>TextCollection
      * May involve breaking of dependencies with Makinen's class structure
   * Change C++ strings to char arrays for performance
   * Look into <nop>FMIndex's rank structure, and see whether the binary rank is separated from the alphabet rank
      * If so, try substituting the binary rank by [[http://sux.dsi.unimi.it/][Vigna's rank9]] implementation
      * If not, forget about this for now
   * Standardize the integer types used... currently in discussion via email

---+++06/09/10

Tackling the items from yesterday:
   * Confirmed that =readaligner= supports multiple references
      * Given a FASTA file with separate reference entries (each with a unique =>= header) and a single index is built (well, okay, two: forward and reverse)
      * The resultant SAM output has reads mapped to all three references
      * Some points of concern
         * Currently, I keep a =reference_name= member inside =<nop>NGS_FMIndex=, which is a single constant name
            * We'll need to find how to extract the correct reference name from Makinen's <FMIndex>, which I suspect may be related to that second variable returned by the locate function (the one besides the =position=, which I've been ignoring completely... and even considered removing it)
            * I've yet to confirm if the positions are correct... but I'll assume they are
   * Traced =bowtie= to find clues of its high performance, some observations:
      * The alphabet used in =bowtie= sequence is much smaller, which consists only of four symbols ={0, 1, 2, 3}= defined in the [[http://www.seqan.de/dddoc/html/SPEC_Dna.html][SeqAn]] library
         * Consequently, they don't need a wavelet tree, which may be contributing to some slowdown in Makinen's implementation
      * I've confirmed that =chomp= is constant table lookup
         * The table, called =ftab= is a =uint32_t= array of =4^10= entries
         * I would've imagined it would be a table of pairs (for =ep= and =sp=)... but it's not
         * Related to the above observation, I doubt =ftab= is a simple list of position, as occasionally, the value of an =ftab= entry will return a value out of range of the largest suffix array index
            * This is caught by a special conditional and an additional lookup is done on a supplementary =eftab= table ('e' for extended)
      * I've discovered a peculiar behavior of the =SideLocus= structure that may give hint as to what it actually is
         * There is a =_fw= boolean that's set to true if it's forward side (else it's the backward side)
         * In the =LF= function, there is a conditional, that branches depending on how =_fw= is set (=ebwt.h:2204=)
            * If =_fw= is =true=, then =LF= is calculated normally, by adding the results of =C= and =Occ= (=ebwt.h:1913=)
            * When =_fw= is =false=, =LF= is calculated by *subtracting* =Occ= from =C= (=ebwt.h:2017=), maybe Chris can deduce something from this...
               * Of course, when I say =C=, that's not technically correct at all... it's actually a count function for a char across the <nop>SideLocus, but I'm hoping it's analogous
      * Also, by inspecting the functions called in the general case (=LF=), and when =top + 1 equals bot= (=LF1=), I believe nothing special is done
         * However, in the general case, =LF= is called twice, once per pointer
         * When =top + 1 equals bot=, =LF1= is called once and the resulting value is assigned to both (followed by =++bot=)
         * This saves one call, which I imagine may add up to a lot, as the special case occurs quite often
   * Concerning integer types:
      * I believe we've settled on trying to stick to =uint32_t= to handle all potential positions given a human reference genome
         * Although, knowing that =readaligner= handles multiple references, we no longer have the need to concatenate all the chromosomes to obtain a references of about 3 billion bp... so maybe this wasn't that big of a concern afterall?
      * For performance critical functions like =rank=, =uint64_t= can be used to take advantage of 64-bit architecture
      * We'll avoid the system-dependently-sized =size_t= to ensure constant sized structures
   * Jay is handling std::string to char* conversions

Next up:
   * Implement =chomp=
      * I'll ignore whatever =bowtie= may be doing and make a 2 by 4^10 =uint32_t= array to store the =ep= and =sp= positions of all suffixes of length 10 composed of the 4 bases
   * Figure out how Makinen's supporting multiple references and adapt it to our library
      * Only after this should we start unencapsulating =<nop>FMIndex=, to avoid breaking any essential dependencies
   * Integrate Vigna's 64-bit broadword [[http://sux.dsi.unimi.it/][structure]]
      * Chris has mentioned that this can be done by using the =static_bitsequence.h= interface
   * Look into =bowtie's= mismatch modes

---+++06/10/10

Started implementing =chomp=:
   * Rather that worry about exactly where this structure would go in our class hierarchy, I've simply started making everything in a huge =main=
   * There are 4 main parts to creating the kick start table:
      1. Generating all 1048576 unique sequences of length 10
         * Fortunately, the =<nop>FastqEntryIterator= already does this (to create mutants)
      1. Map these unique sequences to unique numerical values in the range [0, 1048575]
         * Function already present, used to build =<nop>KmerIndex=
      1. Get the =sp= and =ep= positions for the sequence
         * Using the =Search= function inside =<nop>FMIndex=
      1. Save the structure
         * I'll have to figure this out... using "old school serializable"
            * While I'm at this, I could potentially implement load and save in =<nop>KmerIndex=
   * Unfortunately, a simple program using the first three parts died very silent and very odd death
      * I believe I have traced it to an infinite recursive loop occurring in =fastq_entry_iterator.cc:76=, although I've never seen a program that crashes silently in an infinite loop
      * ...unless, of course, the infinite loop was also leaking resources, but I wasn't able to confirm that either...
      * Oddly, it only occurs when I try to generate all sequences greater than length of 8
      * *UPDATE*: After some thought, and a quick google, I believe the error is a stack overflow, caused by infinite recursion.

Next up:
   * Resolve the bug and continue implementing chomp

---+++06/11/10

Finished implementing =<nop>JumpStartTable=, which I've placed into the =/index= directory for now...
   * The bug mentioned yesterday was not infinite recursion
      * It was a very very very long... but finite recursion, which sort of makes sense, as the algorithm worked fine with smaller input
      * Nonetheless, it overflowed the stack before the final answer was evaluated
      * I resolved it by making the loop iterative, rather than recursive
      * That broke something else... which I caught with a conditional when the input length is 1
   * A new =<nop>JumpStartExactMapper= has been implemented
   * I've intergrated it into =saligner=
      * On my computer, without any loss in specificity, I get a 33% decrease in runtime

Next up:
   * Investigate and integrate Makinen's multi-reference support
   * Replace rank structure
   * Look into bowtie's mismatch mode

---+++06/14/10

Started to integrate Makinen's multi-reference support:
   * Two main parts:
      1. Modifying the code that reads the FASTA reference and builds the file
         * Straightforward loop that extracts multiple FASTA entries from the file and adds each of them to the index
      1. Modifying the code that queries the index
         * In the past, such queries would only return the position
         * I'm currently changing it so two values are returned, the position + the reference it occurs on
         * Unfortunately, this requires changing just about every single class... as queries are key to everything
            * I'm currently bombarded by this huge cascading effect... and it'll probably take a while to make it compile... then reconfirm that I haven't broken anything
   * Unfortunately, we'll have to build something else to support multi-reference support for the Kmer Index
      * As I'm forced to change the Index Interface to support multi-references, the Kmer Index implements a lot of dummy functions...

Next up:
   * Continue integrating multi-reference support...

---+++06/15/10

Still integrating multi-reference support
   * Cleared all compile errors, fixed a few bugs, and got multi-reference support integrated into the <nop>FMIndex
   * Once that was done, started to modify =<nop>KmerIndex= as follows:
      * Implemented multi-reference support
      * Implemented IO support (save and load)
         * I've yet to test the entire class though...
   * Also, instead of returning a pair of values (reference_id + position) as a pair, as previously mentioned, it is now a two-member =struct= known as =<nop>IndexCoordinates=, some notes:
      * Returned by all locate-like functions
      * Main parameter for extract

Next up:
   * Continue integration
      * Remove all dummy functions
      * Fully implement =<nop>NGS_FMIndex= IO operations, i.e. remove =<nop>TempInit()=
      * Test correctness of new code
      * Redefine =SamFile::CreateHeader= to support multiple references (copy bowtie output)

---+++06/16/10

Finished integrating multi-reference support into the entire code base:
   * Removed all dummy functions
   * Implemented =<nop>NGS_FMIndex= IO operations
   * Redefine =SamFile::CreateHeader= to support multiple references
   * Made sure that all newly implemented code would run... not correctly... simply run

Up next:
   * Track down this bug in =<nop>OneMismatchMapper= quality values
   * Ensure correct SAM entry output for other mappers
   * Replace rank structure
   * Look into bowtie mismatch modes

---+++06/17/10

Tracked down bugs in =<nop>OneMismatchMapper=:
   * The quality bug was caused by a bad reference
      * The bad reference was caused by an off by one index access error...
   * Another error was found, which would cause positions returned to be incorrect when querying the reverse index
      * Firstly, I wasn't querying the reverse... but the forward index
      * After querying the location, I didn't recalculate the position relative to the forward orientation...
      * Tracing =readaligner= fixed this

Started to integrate =rank9=, Vigna's 64-bit rank implementation:
   * The interface =static_bitsequence= already has =virtual= base implementations for =rank0=, =select0= and =select1= which depend on =rank1=
      * This was very good news, as the =rank9= class I'm using only implements =rank1= and =select1=
   * The only real function I have to implement from scratch are the I/O operations
      * This is fairly straight forward after checking out what other classes do, and after doing something similar with the jump table
   * Unfortunately, as I'm on a 32-bit platform, I'm using the 32-bit class for now... although changing to 64 bit should be very simple

I also looked up a [[http://hgdownload.cse.ucsc.edu/downloads.html#human][human genome]] we can use... although I hesitate, as it's almost a gig:
   * The full genome [[http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/][here]] in =<nop>chromFa.tar.gz=  
   * As separate chromosomes [[http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/][here]]
   * *UPDATE*: The huge file from the first link is simply a tarball of all the separate files from the second link
      * Our aligner currently assumes that multiple chromosomes come in a single FASTA file, but we'll probably have to support multiple chromosomes from multiple files
      * The sequences in the files contain lots of Ns, along with upper case (for normal) and lower case (for high repetition regions) bases.  Our aligner would probably break with these...

Next up:
   * Finish integrating =rank9=
   * Look into bowtie mismatch modes

---+++06/18/10

Integrated the 64-bit rank implementation:
   * Created a =static_bitsequence_brw64.cpp= class that extends the =static_bitsequence= interface and encapsulates the 64-bit =rank9sel= class
   * Implemented =save= and =load= functions
   * Made 2 changes in =<nop>FMIndex.cpp= at lines 323 and 426 to use the new class

Tried to run the new implementation:
   * Long story short... it crashed
   * I initially attempted to use the 32-bit implementation, but that ran into a very odd array access error 
      * Very odd, since the the function called works right after the object is built... but after some time, the same call with the same argument causes an memory access error
   * Rather than attempting to resolve the bug in the 32-bit implementation, I swapped out the 32-bit for the 64-bit implementation
      * Unfortunately, my computer is not 64-bits, so I'm settings things up on skorpios for further testing and development

Next up:
   * Integrate and debug 64-bit rank implementation on skorpios
   * Check out bowtie's mismatch modes

---+++06/21/10

Integrating 64-bit rank on skorpios:
   * Compiled and ran previous code on skorpios
      * The rank structure was constructed successfully, confirming that the previous error was most likely 32-bit platform specfic
   * Unfortunately, the program goes into an seemingly infinite loop during wavelet tree construction
      * Using two parallel instances of =gdb=, one using 32-bit and one using the 64-bit, I have located the point of divergence at =wt_node_internal.cpp:83=
      * Long story short, the same rank query returned two differing values, causing a proceeding conditional to evaluate differently
         * The 32-bit implementation would proceed normally
         * The 64-bit implementation would go into a self-recursive pithole with no end in sight
   * Thus, I'm now attempting to construct two simple rank structures, a 32-bit and a 64-bit, in an attempt to determine where and why their answers diverge
      * This involved some rather messy manual shuffling of header files inside the =include= folders, due to the way I included Makinen's rank library... (i.e. I didn't change the header declarations to conform to our current style...)
   * Finally, once I had every set up, I found that the destructor would cause a seg fault... so I'll have to solve that even before I begin testing =rank1= outputs

Next up:
   * Resolve seg fault in =rank9= destructor
   * Consider changing Makinen's rank structure library header declarations to conform to current styles... just to make things easier with cmake down the road
   * Determined and resolve divergence of answers between different implementations

---+++06/22/10

Had a meeting today, some notes:
   * Shift focus away from optimization, and back to tailoring the specific needs required by client
      * This is, of course, assuming our software can handle data at the scale they require
      * Look into supporting human sized references
   * Concentrate on algorithmic improvements, by checking out other software (e.g. bowtie mismatch modes)
   * Don't try too hard with =rank9= integration

Thus, redirecting my efforts, my progress today:
   * Fixed a few bugs Jay informed me about
      * The =<nop>KmerMapper= was generating negative positions, which became very large unsigned integers, causing out of bound array access seg faults
      * Improved =<nop>MutantMapper= by only allowing mutations if the exact read returned no hits
   * Tried integrating =rank9=
      * Confirmed that outputs were identical for a string of 32 ones...
         * Previously untrue, as the 32-bit implementation would return the value of the rank one greater than the 64-bit implementation
      * It broke all the same when making wavelet trees
      * Putting this aside on the branch =64-bit_rank= and ignoring for now
   * Attempted to build human sized genomes on my computer (3.25GB RAM):
      * I tried this on =bowtie=, =bwa= and =readaligner=
         * =bowtie= can take multiple FASTA files in its command line (separated by commas)
         * =bwa= and =readaligner= cannot take multiple FASTA files, so I appended all the files in a large 3 gig file I called =big.fa=
      * =readaligner= failed:
<verbatim>
Inserting: chr1 (0%, elapsed 0.00 s, 0.00 hours)
Inserting: chr10 (1%, elapsed 1.00 s, 0.00 hours)
Inserting: chr11 (1%, elapsed 4.00 s, 0.00 hours)
terminate called after throwing an instance of 'std::bad_alloc'
</verbatim>
      * =bwa= failed
<verbatim>
C:\cygwin\home\Daniel\ngs-aligner\bwa-0.5.7>bwa index -a bwtsw ..\chromFa\big.fa
[bwa_index] Pack FASTA... 79.77 sec
[bwa_index] Reverse the packed sequence...       1 [main] bwa 2772 exception::handle: Exception: STATUS_ACCESS_VIOLATION
</verbatim>
      * =bowtie= also failed with both forms of input, single and multiple...
<verbatim>
Input files DNA, FASTA:
  ..\chromFa\chr1.fa
  ..\chromFa\chr2.fa
  ..\chromFa\chr3.fa
  ..\chromFa\chr4.fa
  ..\chromFa\chr5.fa
  ..\chromFa\chr6.fa
  ..\chromFa\chr7.fa
  ..\chromFa\chr8.fa
  ..\chromFa\chr9.fa
  ..\chromFa\chr10.fa
  ..\chromFa\chr11.fa
  ..\chromFa\chr12.fa
  ..\chromFa\chr13.fa
  ..\chromFa\chr14.fa
  ..\chromFa\chr15.fa
  ..\chromFa\chr16.fa
  ..\chromFa\chr17.fa
  ..\chromFa\chr18.fa
  ..\chromFa\chr19.fa
  ..\chromFa\chr20.fa
  ..\chromFa\chr21.fa
  ..\chromFa\chr22.fa
  ..\chromFa\chrX.fa
  ..\chromFa\chrY.fa
  ..\chromFa\chrM.fa
Reading reference sizes
  Time reading reference sizes: 00:02:04
Calculating joined length
Writing header
Reserving space for joined string
Could not allocate space for a joined string of 2861343702 elements.
Switching to a packed string representation.
Reading reference sizes
  Time reading reference sizes: 00:01:44
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:01:32
bmax according to bmaxDivN setting: 715335925
Using parameters --bmax 536501944 --dcv 1024
  Doing ahead-of-time memory usage test
  Ran out of memory; automatically trying more memory-economical parameters.
Using parameters --bmax 402376458 --dcv 2048
  Doing ahead-of-time memory usage test
  Ran out of memory; automatically trying more memory-economical parameters.

[...etc...]

Using parameters --bmax 42 --dcv 4096
  Doing ahead-of-time memory usage test
  Ran out of memory; automatically trying more memory-economical parameters.
Using parameters --bmax 32 --dcv 4096
  Doing ahead-of-time memory usage test
  Ran out of memory; automatically trying more memory-economical parameters.
Could not find approrpiate bmax/dcv settings for building this index.
Already tried a packed string representation.
Please try indexing this reference on a computer with more memory.
If this computer has more than 4 GB of memory, try using a 64-bit executable;
this executable is 32-bit.
Total time for call to driver() for forward index: 00:05:34
Command: bowtie-build ..\chromFa\chr1.fa,..\chromFa\chr2.fa,..\chromFa\chr3.fa,..\chromFa\chr4.fa,..\chromFa\chr5.fa,..\chromFa\chr6.fa,..\chromFa\chr
7.fa,..\chromFa\chr8.fa,..\chromFa\chr9.fa,..\chromFa\chr10.fa,..\chromFa\chr11.fa,..\chromFa\chr12.fa,..\chromFa\chr13.fa,..\chromFa\chr14.fa,..\chro
mFa\chr15.fa,..\chromFa\chr16.fa,..\chromFa\chr17.fa,..\chromFa\chr18.fa,..\chromFa\chr19.fa,..\chromFa\chr20.fa,..\chromFa\chr21.fa,..\chromFa\chr22.
fa,..\chromFa\chrX.fa,..\chromFa\chrY.fa,..\chromFa\chrM.fa big
</verbatim>
      * I'll need to ask Chris what do to about this...

Up next:
   * Somehow, somewhere, obtain an index for Makinen's =readaligner=
      * Our software can use indices built by Makinen's software
   * Trace bowtie mismatch modes for strategies

---+++06/23/10

   * Built a human index using 64-bit =bowtie= on =skorpios=
      * Took about an hour per direction (forward and reverse)
      * I couldn't include the X and Y chromosome, as /tmp ran out of space
   * While that was going on, I also started tracing =bowtie= in 1-mismatch mode
      * The general search has two phases, defined in =search_1mm_phase1.c= and =search_1mm_phase2.c=
         * Phase 1 tries to align it with exact alignments on the forward and reverse strand
            * If that fails, it attempts to align it allowing 1 mismatch on the 3' end on the forward and reverse strand
         * Phase two commences when both subphases of Phase 1 fail
            * Attempts to align it allowing 1 mismatch on the 5' end (via usage of the reverse index)
      * The overall strategy is the same as to what I'm doing in =<nop>OneMismatchMapper=
      * The actual details is slightly more cryptic, and depends on the =backtrack= function again (same as the one used in exact match mode)
         * The difference lies in the initialization, changing many variables, resulting in a different conditional evaluations
      * Some notes (more for myself...)
         * =<nop>bt.SetOffs()= in =search_1mm_phase1.c= sets =unrevOff= at line 50 and 61 to half the length of the reads
         * =<nop>unrevOff= cause the boolean =<nop>curIsAlternative= to be set to true at =ebwt_searchb_backtrack.h:486= under specific conditions
         * =top= and =bot= are subsequently updated by a difference non-LF function at =ebwt_searchb_backtrack.h:553= (previously at =566=)
         * =<nop>curIsAlternative= set to =true= also triggers the updating of an =elims= array at =580= (=altNum++= at 623 also plays a role)
         * When a mismatch is detected, and alternatives exist, the =elim= array is used to select a potential match at =745=
      * Honestly, I still don't quite understand what's happening at all...

Next up:
   * Try to build a large index using readaligner
   * Continue tracing =bowtie= 1 mismatch mode

---+++06/24/10

   * Attempted to build a large index using =readaligner= and discovered a few general guidelines, after trial and error, and staring at the memory usage of the process:
      * Given a FASTA file with multiple sequences, I assume something similar to the following happens:
         1. Each reference is inserted into the index
            * The memory required for each insertion is roughly 10 time the size of the reference to be inserted
         1. After all references are inserted, the index needs to be initialized
            * The memory required for this initialization is roughly 2 times the size of all the reference combined
         1. This entire process is repeated twice, to build the reverse index
            * The same memory requirements apply
      * My computer allows a single process up to roughly 1GB of memory (despite having 3.25GB RAM... for reasons I'm unsure of), so the largest index I've been able to build is:
         * A single FASTA file of 580MB
            * Step 2 requires about 1.2GB
         * Composed of chromosomes 14 to 20
            * Chromosome 14 is the largest reference, at 107MB, requiring 1GB in Step 1
      * Knowing this, in theory, we should be able to build in human index (3GB) with a computer that allows a process 6GB of memory
         * The largest human chromosome (=chr1.fa=) is 250MB, which requies 2.5 gigs of memory to insert, well below the 6GB requirement

   * Continued tracing =bowtie= 1 mismatch mode:
      * A bit more detail about the overall 2 phase strategy mentioned yesterday, which is more accurately described in 6 steps
         1. Try exact alignment on forward strand using forward index
         1. Try exact alignment on reverse strand using forward index
         1. Allow 1 mismatch in the 3' half on *reverse strand* using forward index
         1. Allow 1 mismatch in the 3' half on forward strand using forward index
         1. Allow 1 mismatch in the 5' half on *reverse strand* using reverse index
         1. Allow 1 mismatch in the 5' half on forward strand using reverse index
         * Step 2 and step 3 both attempt alignments on the reverse strand, much like readaligner, which I assume saves one reversal operations
         * Along the same lines though, it makes little sense why step 5 isn't on the forward strand...
         * Nonetheless, as =bowtie= and =readaligner= both use the strategy seen between steps 2 and 3, it may be something worth noting
      * I believe I have some grasp of the basic 1 mismatch flow:
         * Starting from the end of the read find the corresponding suffix ranges as usual
            * If a mismatch is encountered before the halfway point, the trace is terminated to prevent excessive backtracking, and the alignment is performed with the reverse index
         * Once past the half way point, the =<nop>curIsAlternative= is set to true, and 5 values are updated and tracked (=ebwt_search_backtrack.h:580=):
            * =pairs=: an array of =uint32_t= values.  Adds 8 entries per position past the halfway point, corresponding to the =top= and =bot= values for the 4 possible bases at each position
               * As mentioned yesterday, once past the halfway point, the =mapLF= function (line =566=) becomes replaced with =mapLFEx= (line =550=), which determines the 8 =top= and =bot= ranges for each position (instead of 2)
               * From these pairs, by subtracting =bot= from respective =top= values, you can determine the =spread= at a specific position with a specific base
            * =elims=: an array of =uint8_t= flags.  Each position has a 4 bit flag, corresponding to the 4 bases.  When the spread of a pair range is 0, the respective elims bit is set to 1 at line =591= (as a range of 0 means that no possible suffixes exist, and is thus "eliminated" from the eligible list of potential alignments).  Note that an integer value of 15 is =1111= in binary, meaning all four bases are eliminated.
            * =eligibleNum= is the number of eligible paths (think suffix tree branch).  A path is only eligible if is has not been eliminated, and progressing down the path doesn't "blow the mismatch quality budget" (this latter point is irrelevant in in the mode I'm currently in)
            * =eligibleSz= is the number of eligible suffixes (think leaf nodes below the aforementioned branches)
            * =altNum= is the number of alternative paths.  A path is alternative if it hasn't been eliminated, and isn't affected by quality.  In the quality-independent mode I'm currently in, I assume this value is the same as =eligibleNum=
         * When a mismatch is encountered and alternatives exist (=745=):
            * Goes back to the last valid position (the one position right before the mismatch)
            * Using =pairs= and =elims=, randomly selects one base with a eligible path (=791=)
         * Once an eligible path is selected, there are two cases I've seen:
            * The mismatch occurred in the very last character of the search, and the result is returned (=908=)
            * The mismatch occurred before the end, and the search continues with a recursive call at =961=.
               * Note that in this call, no mismatches are allowed, and is effectively the same as a call during exact match mode
               * Where exact match mode typically seeds the function with =top= and =bot= generated from the =ftab= chomp function, the function is seeded with a pair of values from =pairs=
               * The call completes the iteration across the remaining part of the read as usual

Next up:
   * Try more interesting 1 mismatch nodes, as all my current traces have only 1 alternative path
      * May be faster to create my own reference
   * Try 2/3 mismatch modes to see the full effect of =pairs= and =elims=, as their effects don't seem too impressive in 1 mismatch mode
   * Think about potential integration of similar strategies in our aligner... only if they prove useful

---+++06/25/10

   * Built the complete human genome using =skorpios= overnight
      * Two 3GB references were created, one per direction using =readaligner=
         * Accessible at =/ubc/cs/research/condon/share/projects/ngs-aligner/data/full=
      * Currently attempting to align to it... although I worry how the program will load 6GB worth of structures...
         * *UPDATE*: The two indices have been successfully loaded into memory on =skorpios=... if not barely.  Using =top=, 6017912k/6055126k memory is being used, leaving 37214k free
      * Not to mention that the input FASTQ file is 1.6 GB itself...
         * Accessible at =/ubc/cs/research/condon/share/projects/ngs-aligner/data=
         * *UPDATE*: I believe this isn't being completely loaded into memory... and being read in small bits from the disk... fortunately
      * *UPDATE*: The results using =-k 0=:
<verbatim>
Number of reads: 9563070 (skipped first 0)
Successfully aligned: 2055421 reads (21.49 %)
Number of reported alignments: 2055421
Wall-clock time: 2489.00 seconds (0.69 hours)
Average wall-clock time per read: 0.26 milliseconds

real    135m3.209s
user    8m30.408s
sys     0m27.482s
</verbatim>

      * Once this is confirmed working, then I'll try it on =saligner=...
   * Continued tracing =bowtie=
      * By constructing my own reference, I was able to test some more interesting 1 mismatch traces
         * When a mismatch is encountered at a position, another alternative path is selected randomly (=791=)
            * If the path is valid, a match is found, as described yesterday
            * If the path was invalid (i.e. another mismatch is encountered), the recursive call returns false (=974= fails), the loop tries again (=745=) at a new position (=772=)
               * Using the =elims= array, a lot of unproductive traceback is avoided (=777=)
            * I need to confirm if this random choice only selects 1 base at a position... and see if it dismisses valid paths at the same position but using a different base
      * Started looking into 2 mismatch mode
         * There are 3 phases, best described as 6 cases (2 cases per phase)
            1. Forward index: match exact reads on the forward strand
            1. Forward index: match reads with 0 to 2 mismatches on the right half of reads on the reverse strand
            1. Reverse index: match reads with 1 to 2 mismatches on the right half of reads on the forward strand
            1. Reverse index: match reads with 1 to 2 mismatches on the left half of reads on the reverse strand
            1. Forward index: match reads with 1 to 2 mismatches on the left half of reads on the forward strand
            1. Forward index: match reads with 2 mismatches on different halves of reads on either strand
      * In the general cases (cases 1-5), the first mismatch causes a recursive 1 mismatch backtrace, described in detail yesterday
         * Note that each recursive layer keeps its own unique =pairs= and =elims= array
         * Due to the depth wise method it searches to the next alternative, it will sometimes prefer a 2 mismatch alignment over a 1 mismatch alignment
            * This doesn't seem to be the case when qualities are considered... but that's only in the seeded quality mode
      * I've yet to determine the exact mechanism of case 6, which I suspect will be somewhat different, as one mismatch occurs in the =chomp= region
   * Slightly off topic, but I was browsing around and confirmed that as of 5 months ago, =bowtie= was still in the midst of [[http://seqanswers.com/forums/showthread.php?p=13124][implementing]] gapped alignment support
      * Consequently, I'll either have to look into how =readaligner= or =bwa= does gapped alignments
         * =readaligner= is very slow though...
         * ...and =bwa= code is very gruesome

Next up:
   * Pray that the alignment goes well with =readaligner=, then repeat the process with =saligner=
   * Continue tracing =bowtie= 2 mismatch mode
   * Trace =bowtie= 3 mismatch mode and seeded quality mode

---+++06/28/10

   * Starting to debug and rewrite classes to support the human genome
      * As mentioned last time, simply loading the index from file takes 90 minutes (assumed to be from mounting remove drives)
         * This can by reduced to 90 seconds by accessing the index file directly on skorpios' disk.
            * =/var/tmp/= has sufficient memory and access to achieve this for all future runs
      * Previously, besides the two indices created by Makinen's =<nop>FMIndex= class, I saved an additional structure to disk:
         * It contained:
            * A set of original genomes for quick extraction
            * A list of lengths
            * A list of reference names
         * As =readaligner= doesn't build this, I had no such file for the human genome, and if I did, it would be 3GB (for a total of 9GB, over skoprios's memory)
         * I attempted to completely remove the need for this file, and was able to eliminate everything but the actual sequences themselves
            * In theory, a true extract function using the compressed index could replace it
               * However, after consulting Chris, I've decided not to pursue this option (not to mention that we'll be replacing the entire index)
            * Alternatively, I'll replace the 3GB genome with a 750MB packed DNA string
               * In the likely case where 3GB + 750MB is too much for skorpios to handle, I'll rebuild the 3GB index at a more memory friendly setting (at the cost of query speed)

Next up:
   * Implement a quick DNA bit string class to compact the genome
   * Run and debug alignments on the human index
   * Investigate =bowtie= mismatch modes and incorporate strategies
      * Things readily adaptable:
         * The =elims= and =pairs= system
         * The phases
      * I imagine I'll probably just copy the general flow of =bowtie= functions instead of reinventing something and incorporating it

---+++06/29/10

   * Created a =<nop>DnaBitstring= class, which packs the 3GB reference genome into 750MB
   * Integrated the class into the code base
   * Built and saved a file of the 3GB reference
   * Currently running an alignment with =saligner=
      * Results will be posted later...
      * *UPDATE:* =saligner= with =<nop>MismatchMapper= at =-k 2= on a =Debug= build:
         * I wasn't concerned about speed, and was more interested in whether it would crash or not... which it didn't.
<verbatim>
Elapsed time: 13489.8 seconds
Processed   : 9563070
Average time: 0.00141062 s/read
Mapped      : 3601713
Unmapped    : 5463992
Invalid     : 497365

real    333m35.102s
user    223m31.726s
sys     1m35.954s
</verbatim>

Next up:
   * Confirm that =saligner= can handle human genome alignments
   * =bowtie= investigation and strategy integration

---+++06/30/10

   * Some faster alignment results on =saligner= with =OneMismatchMapper= on a Release build
<verbatim>
Elapsed time: 1295.97 seconds
Processed   : 9563070
Average time: 0.000135518 s/read
Mapped      : 2928602
Unmapped    : 6137103
Invalid     : 497365

real    57m46.685s
user    21m16.380s
sys     0m36.042s
</verbatim>

      * Loading the index takes 90 seconds, like =readaligner=, however, after loading the index, and prior to aligning, there is a roughly 30 minute "pause" (difference between =real= and =user=)
         * Staring at the memory usage and the code, I believe this 30 minutes is processing the 1.6GB FASTA file
            * After the index is loaded, 6GB of RAM and 0.4GB of the virtual memory is occupied
            * The virtual memory starts to slowly grow until reaching a max at around 2GB 15 minutes after the index is loaded (I assume this is the FASTA file)
            * There is then a 19 minute gap, with no visible change in memory usage...
            * Only half an hour after the start of the program does actual aligning begin, which lasts about 20 minutes for 10 million reads
      * Although, when I try aligning the 1.6GB file against a negligibly small reference, there is no lag
         * Instead the 1.6GB is quickly loaded into RAM and the alignment begins
         * The RAM usage gradually increases from 1.6GB until it peaks at 3.0GB, which I believe is holding the 1.4GB output file
      * From these experiments, I believe this problem shouldn't be too major if our custom index is small enough to allow both the index and all input and output files to fit on the RAM...

   * Continued to trace =bowtie=
      * Went through case 6 described on 06/25/10
         * It handles a "split" mismatch, where there is 1 mismatch on both ends of the read
         * Such a read usually returns an invalid range after =chomp=
         * The alignment starts from the very end (instead of 10 bases in), until the first mismatch is hit
         * After the mismatch is hit, a valid base is chosen (i.e. a base that has a valid range) and chomp is performed once again
            * I believe they also compute a "KmerValue" at =923=
         * The alignment then proceeds as a normal 1 mismatch alignment
      * Started tracing the maq-like quality aware mode... which has 4 phases and 9 or so steps...

Next up:
   * Note to self... Try the variant "split" mismatch with mismatches occuring just outside of chomp range, but still before the halfway point...
   * Continue tracing =bowtie= maq-like mode
   * Implement =bowtie= like strategies
Raw edit | More topic actions
Topic revision: r1 - 2010-07-01 - jujubix