Tags:
create new tag
view all tags

06/01/10

Meeting with Jay and Chris, up next:

  • Create a simple minimal locate function
  • Modify the aforementioned function to make a single mismatch locate
  • Compare with bowtie's analogous functions
  • Copy bowtie if required

Also, future performance tests on skopios should read and write files from /tmp/ so results aren't affected by network lag... which caused 19 seconds of additional runtime in yesterday's results, making them useless.

06/02/10

Making a simple minimal locate function:

  • Read another paper on the FMIndex, which describes the basic logic behind the count and locate algorithms in sections 3.1 and 3.3.
  • Confirmed that the locate I implemented for ExactMapper follows the minimal basic algorithm described by the above paper
  • From Jay's journal, it seems that our code base is thrice as slow as bowtie's... either that, or bowtie's taking some shortcuts to speed things up.
  • Started implementing a simple OneMismatchLocate inside /external/readaligner/FMIndex.cc, as it needs access to 2 private class members
    • Unfortunately, this also means I have to add yet another virtual function in TextCollection
    • I've tried using FMIndex class directly, but including the FMIndex.h header in any file causes the compiler to complain a lot... due to the way includes are set up
      • The quick dirty resolution would be to force the cmake to copy a lot more files
      • The real solution would be to resolve ticgit 4b282e

Next up:

  • Finish implementing the 1 mismatch locate... which does not look efficient in the very least...
  • Investigate bowtie code

06/03/10

Implemented OneMismatchLocate inside OneMismatchMapper:

  • Uses LF and getPositions from TextCollection
  • Non-rigorous testing using a single read, and manually confirming the result seem to indicate that it is correct (I had to work out a few edge cases though...)
  • It runs in three phases, which I assume is a bruteforce depth first search
    1. Locate the unmodified read, and store the sp and ep positions generated
    2. If the read missed, backtrack to last valid position range, and change the base (repeat for every base and position (increasing) until a hit is made)
    3. A special case is required to handle mismatches in the very last base (first base read in matching algorithm), as no pre-existing position range was stored
  • Preliminary results seems to indicate that it is slower than MismatchQuery with k = 1 due to excessive backtracking
    • In case anyone is interested, by uncommenting all the "printf" statements in the function (use replace), you get a very nice visual presentation of the backtracking
    • Can be easily remedied by copying readaligner's strategy by using both a forward and reverse (not complemented) index
    • The number of read mapped and unmapped are identical, which I hope is good news
  • A slightly more complicated but potentially beneficial change is to swap bases with lowest qualities first

Next up:

  • Improve performance of OneMismatchLocate by using the reverse index to reduce excessive backtracking

06/04/10

Improved OneMismatchLocate:

  • By using the reverse index along with the forward index, I have improved the performance of the OneMismatchMapper
  • Details best summarized in the following table, generated using saligner on my computer, with the 125000 set, just for a rough comparison:

Mapper max_results = 10 max_results = 1
ExactMapper 7.2s 6.5s
OneMismatchMapper (old) 33.4s 22.0s
OneMismatchMapper (new) 11.1s 9.4s
MismatchMapper (k = 1) 12.3s 11.0s
MismatchMapper (k = 2) 15.5s 13.8s

  • I'm pleased to see that the improved OneMismatchMapper outperforms MismatchMapper at equivalent settings (k = 1), barely...
  • Of course, bowtie does the same thing in 3 seconds... so I'll have to see what it's doing

Staring at bowtie:

  • With the aid of callgrind, I have located the core logic of the bowtie alignments to be in backtrack function in ebwt_search_backtrack
  • There are actually 3 backtrack functions, which call each other in a sequential fashion (and the last and largest one is self-recursive)
  • Combined, they have about 1000 lines of code that makes little sense to me... even with the descriptive comments

Next up:

  • bowtie? Although it feels somewhat futile at the moment...
  • Reviewing some old entries, maybe I'll give that KickStartLocate a try, which would theoretically work with a pre-computed table to speed up alignments.

06/07/10

Stared at bowtie:

  • Tracing the exact alignment of a single read on the reverse strand with gdb and callgrind in 0-mismatch mode, I have determined a few things that contributes to its speed:
    • Firstly, it uses the standard reverse depth first search algorithm standard to FM Indexes, but it has a lot of extra conditionals and branches to deal with mismatches and backtracking, top and bot are used in the code analogously to sp and ep
    • It "kick starts" the traditional locate from the 11th (to last) base (i.e. it skips the first 10 iterations)
      • This is effectively the same as Chris' idea, although I am not sure if bowtie uses a precomputed matrix or not
      • The exact functions (one for top and bot) called on ebwt_search_backtrack.h:258-259 and defined in ebwt.h
      • By the comments, this "chomp" can actually be longer than 10 bases, and there are conditional to catch cases where the "chomp" locates the entire read
    • The top and bot ranges converge much quicker, leading to much quicker rejections for inexact hits
      • I have no clue how this is achieved... short of assuming that the suffix indexes are built different
      • UPDATE: On second thought, this is merely a side effect of the aforementioned seeding process. With the 10 base seed, the starting range is much narrower, and the convergence much quicker.
    • There is a special function called when top + 1 equals bot (i.e. when only 1 suffix is valid in the search range) to update the two pointer... at ebwt_search_backtrack.h:563, defined in ebwt.h:2256
      • I'm not entirely sure if this is contributing to anything
    • I am highly confused by this SideLocus structure that is passed around...
      • From LF signatures, it seems analogous to the BWT(T), but it doesn't quite make sense why top and bot use different unique copies of this structure
      • It is defined in ebwt.h

I've simulated what it would be like to kick start the locate 10 bases in on saligner, and I can shave about 10% off the current runtime, so it may be worthwhile to check this out.

Next up:

  • Trace bowtie in 1-mismatch mode
  • Implement the kick started locate

06/08/10

Meeting today, some things to do:

  • Confirm whether readaligner supports multiple references
    • If so, adapt it
    • If not, we'll need to design something...
  • Continue tracing bowtie's exact alignment mode
  • Implement a kick started version of locate, also known as "chomp"
    • This may involve pre-computing and saving a structure
    • Probably a separate structure than readaligner's two index files
  • Look into unencapsulating FMIndex from TextCollection
    • May involve breaking of dependencies with Makinen's class structure
  • Change C++ strings to char arrays for performance
  • Look into FMIndex's rank structure, and see whether the binary rank is separated from the alphabet rank
    • If so, try substituting the binary rank by Vigna's rank9 implementation
    • If not, forget about this for now
  • Standardize the integer types used... currently in discussion via email

06/09/10

Tackling the items from yesterday:

  • Confirmed that readaligner supports multiple references
    • Given a FASTA file with separate reference entries (each with a unique > header) and a single index is built (well, okay, two: forward and reverse)
    • The resultant SAM output has reads mapped to all three references
    • Some points of concern
      • Currently, I keep a reference_name member inside NGS_FMIndex, which is a single constant name
        • We'll need to find how to extract the correct reference name from Makinen's , which I suspect may be related to that second variable returned by the locate function (the one besides the position, which I've been ignoring completely... and even considered removing it)
        • I've yet to confirm if the positions are correct... but I'll assume they are
  • Traced bowtie to find clues of its high performance, some observations:
    • The alphabet used in bowtie sequence is much smaller, which consists only of four symbols {0, 1, 2, 3} defined in the SeqAn library
      • Consequently, they don't need a wavelet tree, which may be contributing to some slowdown in Makinen's implementation
    • I've confirmed that chomp is constant table lookup
      • The table, called ftab is a uint32_t array of 4^10 entries
      • I would've imagined it would be a table of pairs (for ep and sp)... but it's not
      • Related to the above observation, I doubt ftab is a simple list of position, as occasionally, the value of an ftab entry will return a value out of range of the largest suffix array index
        • This is caught by a special conditional and an additional lookup is done on a supplementary eftab table ('e' for extended)
    • I've discovered a peculiar behavior of the SideLocus structure that may give hint as to what it actually is
      • There is a _fw boolean that's set to true if it's forward side (else it's the backward side)
      • In the LF function, there is a conditional, that branches depending on how _fw is set (ebwt.h:2204)
        • If _fw is true, then LF is calculated normally, by adding the results of C and Occ (ebwt.h:1913)
        • When _fw is false, LF is calculated by subtracting Occ from C (ebwt.h:2017), maybe Chris can deduce something from this...
          • Of course, when I say C, that's not technically correct at all... it's actually a count function for a char across the SideLocus, but I'm hoping it's analogous
    • Also, by inspecting the functions called in the general case (LF), and when top + 1 equals bot (LF1), I believe nothing special is done
      • However, in the general case, LF is called twice, once per pointer
      • When top + 1 equals bot, LF1 is called once and the resulting value is assigned to both (followed by ++bot)
      • This saves one call, which I imagine may add up to a lot, as the special case occurs quite often
  • Concerning integer types:
    • I believe we've settled on trying to stick to uint32_t to handle all potential positions given a human reference genome
      • Although, knowing that readaligner handles multiple references, we no longer have the need to concatenate all the chromosomes to obtain a references of about 3 billion bp... so maybe this wasn't that big of a concern afterall?
    • For performance critical functions like rank, uint64_t can be used to take advantage of 64-bit architecture
    • We'll avoid the system-dependently-sized size_t to ensure constant sized structures
  • Jay is handling std::string to char* conversions

Next up:

  • Implement chomp
    • I'll ignore whatever bowtie may be doing and make a 2 by 4^10 uint32_t array to store the ep and sp positions of all suffixes of length 10 composed of the 4 bases
  • Figure out how Makinen's supporting multiple references and adapt it to our library
    • Only after this should we start unencapsulating FMIndex, to avoid breaking any essential dependencies
  • Integrate Vigna's 64-bit broadword structure
    • Chris has mentioned that this can be done by using the static_bitsequence.h interface
  • Look into bowtie's mismatch modes

06/10/10

Started implementing chomp:

  • Rather that worry about exactly where this structure would go in our class hierarchy, I've simply started making everything in a huge main
  • There are 4 main parts to creating the kick start table:
    1. Generating all 1048576 unique sequences of length 10
      • Fortunately, the FastqEntryIterator already does this (to create mutants)
    2. Map these unique sequences to unique numerical values in the range [0, 1048575]
      • Function already present, used to build KmerIndex
    3. Get the sp and ep positions for the sequence
      • Using the Search function inside FMIndex
    4. Save the structure
      • I'll have to figure this out... using "old school serializable"
        • While I'm at this, I could potentially implement load and save in KmerIndex
  • Unfortunately, a simple program using the first three parts died very silent and very odd death
    • I believe I have traced it to an infinite recursive loop occurring in fastq_entry_iterator.cc:76, although I've never seen a program that crashes silently in an infinite loop
    • ...unless, of course, the infinite loop was also leaking resources, but I wasn't able to confirm that either...
    • Oddly, it only occurs when I try to generate all sequences greater than length of 8
    • UPDATE: After some thought, and a quick google, I believe the error is a stack overflow, caused by infinite recursion.

Next up:

  • Resolve the bug and continue implementing chomp

06/11/10

Finished implementing JumpStartTable, which I've placed into the /index directory for now...

  • The bug mentioned yesterday was not infinite recursion
    • It was a very very very long... but finite recursion, which sort of makes sense, as the algorithm worked fine with smaller input
    • Nonetheless, it overflowed the stack before the final answer was evaluated
    • I resolved it by making the loop iterative, rather than recursive
    • That broke something else... which I caught with a conditional when the input length is 1
  • A new JumpStartExactMapper has been implemented
  • I've intergrated it into saligner
    • On my computer, without any loss in specificity, I get a 33% decrease in runtime

Next up:

  • Investigate and integrate Makinen's multi-reference support
  • Replace rank structure
  • Look into bowtie's mismatch mode

06/14/10

Started to integrate Makinen's multi-reference support:

  • Two main parts:
    1. Modifying the code that reads the FASTA reference and builds the file
      • Straightforward loop that extracts multiple FASTA entries from the file and adds each of them to the index
    2. Modifying the code that queries the index
      • In the past, such queries would only return the position
      • I'm currently changing it so two values are returned, the position + the reference it occurs on
      • Unfortunately, this requires changing just about every single class... as queries are key to everything
        • I'm currently bombarded by this huge cascading effect... and it'll probably take a while to make it compile... then reconfirm that I haven't broken anything
  • Unfortunately, we'll have to build something else to support multi-reference support for the Kmer Index
    • As I'm forced to change the Index Interface to support multi-references, the Kmer Index implements a lot of dummy functions...

Next up:

  • Continue integrating multi-reference support...

06/15/10

Still integrating multi-reference support

  • Cleared all compile errors, fixed a few bugs, and got multi-reference support integrated into the FMIndex
  • Once that was done, started to modify KmerIndex as follows:
    • Implemented multi-reference support
    • Implemented IO support (save and load)
      • I've yet to test the entire class though...
  • Also, instead of returning a pair of values (reference_id + position) as a pair, as previously mentioned, it is now a two-member struct known as IndexCoordinates, some notes:
    • Returned by all locate-like functions
    • Main parameter for extract

Next up:

  • Continue integration
    • Remove all dummy functions
    • Fully implement NGS_FMIndex IO operations, i.e. remove TempInit()
    • Test correctness of new code
    • Redefine SamFile::CreateHeader to support multiple references (copy bowtie output)

06/16/10

Finished integrating multi-reference support into the entire code base:

  • Removed all dummy functions
  • Implemented NGS_FMIndex IO operations
  • Redefine SamFile::CreateHeader to support multiple references
  • Made sure that all newly implemented code would run... not correctly... simply run

Up next:

  • Track down this bug in OneMismatchMapper quality values
  • Ensure correct SAM entry output for other mappers
  • Replace rank structure
  • Look into bowtie mismatch modes

06/17/10

Tracked down bugs in OneMismatchMapper:

  • The quality bug was caused by a bad reference
    • The bad reference was caused by an off by one index access error...
  • Another error was found, which would cause positions returned to be incorrect when querying the reverse index
    • Firstly, I wasn't querying the reverse... but the forward index
    • After querying the location, I didn't recalculate the position relative to the forward orientation...
    • Tracing readaligner fixed this

Started to integrate rank9, Vigna's 64-bit rank implementation:

  • The interface static_bitsequence already has virtual base implementations for rank0, select0 and select1 which depend on rank1
    • This was very good news, as the rank9 class I'm using only implements rank1 and select1
  • The only real function I have to implement from scratch are the I/O operations
    • This is fairly straight forward after checking out what other classes do, and after doing something similar with the jump table
  • Unfortunately, as I'm on a 32-bit platform, I'm using the 32-bit class for now... although changing to 64 bit should be very simple

I also looked up a human genome we can use... although I hesitate, as it's almost a gig:

  • The full genome here in chromFa.tar.gz
  • As separate chromosomes here
  • UPDATE: The huge file from the first link is simply a tarball of all the separate files from the second link
    • Our aligner currently assumes that multiple chromosomes come in a single FASTA file, but we'll probably have to support multiple chromosomes from multiple files
    • The sequences in the files contain lots of Ns, along with upper case (for normal) and lower case (for high repetition regions) bases. Our aligner would probably break with these...

Next up:

  • Finish integrating rank9
  • Look into bowtie mismatch modes

06/18/10

Integrated the 64-bit rank implementation:

  • Created a static_bitsequence_brw64.cpp class that extends the static_bitsequence interface and encapsulates the 64-bit rank9sel class
  • Implemented save and load functions
  • Made 2 changes in FMIndex.cpp at lines 323 and 426 to use the new class

Tried to run the new implementation:

  • Long story short... it crashed
  • I initially attempted to use the 32-bit implementation, but that ran into a very odd array access error
    • Very odd, since the the function called works right after the object is built... but after some time, the same call with the same argument causes an memory access error
  • Rather than attempting to resolve the bug in the 32-bit implementation, I swapped out the 32-bit for the 64-bit implementation
    • Unfortunately, my computer is not 64-bits, so I'm settings things up on skorpios for further testing and development

Next up:

  • Integrate and debug 64-bit rank implementation on skorpios
  • Check out bowtie's mismatch modes

06/21/10

Integrating 64-bit rank on skorpios:

  • Compiled and ran previous code on skorpios
    • The rank structure was constructed successfully, confirming that the previous error was most likely 32-bit platform specfic
  • Unfortunately, the program goes into an seemingly infinite loop during wavelet tree construction
    • Using two parallel instances of gdb, one using 32-bit and one using the 64-bit, I have located the point of divergence at wt_node_internal.cpp:83
    • Long story short, the same rank query returned two differing values, causing a proceeding conditional to evaluate differently
      • The 32-bit implementation would proceed normally
      • The 64-bit implementation would go into a self-recursive pithole with no end in sight
  • Thus, I'm now attempting to construct two simple rank structures, a 32-bit and a 64-bit, in an attempt to determine where and why their answers diverge
    • This involved some rather messy manual shuffling of header files inside the include folders, due to the way I included Makinen's rank library... (i.e. I didn't change the header declarations to conform to our current style...)
  • Finally, once I had every set up, I found that the destructor would cause a seg fault... so I'll have to solve that even before I begin testing rank1 outputs

Next up:

  • Resolve seg fault in rank9 destructor
  • Consider changing Makinen's rank structure library header declarations to conform to current styles... just to make things easier with cmake down the road
  • Determined and resolve divergence of answers between different implementations

06/22/10

Had a meeting today, some notes:

  • Shift focus away from optimization, and back to tailoring the specific needs required by client
    • This is, of course, assuming our software can handle data at the scale they require
    • Look into supporting human sized references
  • Concentrate on algorithmic improvements, by checking out other software (e.g. bowtie mismatch modes)
  • Don't try too hard with rank9 integration

Thus, redirecting my efforts, my progress today:

  • Fixed a few bugs Jay informed me about
    • The KmerMapper was generating negative positions, which became very large unsigned integers, causing out of bound array access seg faults
    • Improved MutantMapper by only allowing mutations if the exact read returned no hits
  • Tried integrating rank9
    • Confirmed that outputs were identical for a string of 32 ones...
      • Previously untrue, as the 32-bit implementation would return the value of the rank one greater than the 64-bit implementation
    • It broke all the same when making wavelet trees
    • Putting this aside on the branch 64-bit_rank and ignoring for now
  • Attempted to build human sized genomes on my computer (3.25GB RAM):
    • I tried this on bowtie, bwa and readaligner
      • bowtie can take multiple FASTA files in its command line (separated by commas)
      • bwa and readaligner cannot take multiple FASTA files, so I appended all the files in a large 3 gig file I called big.fa
    • readaligner failed:
Inserting: chr1 (0%, elapsed 0.00 s, 0.00 hours)
Inserting: chr10 (1%, elapsed 1.00 s, 0.00 hours)
Inserting: chr11 (1%, elapsed 4.00 s, 0.00 hours)
terminate called after throwing an instance of 'std::bad_alloc'
    • bwa failed
C:\cygwin\home\Daniel\ngs-aligner\bwa-0.5.7>bwa index -a bwtsw ..\chromFa\big.fa
[bwa_index] Pack FASTA... 79.77 sec
[bwa_index] Reverse the packed sequence...       1 [main] bwa 2772 exception::handle: Exception: STATUS_ACCESS_VIOLATION
    • bowtie also failed with both forms of input, single and multiple...
Input files DNA, FASTA:
  ..\chromFa\chr1.fa
  ..\chromFa\chr2.fa
  ..\chromFa\chr3.fa
  ..\chromFa\chr4.fa
  ..\chromFa\chr5.fa
  ..\chromFa\chr6.fa
  ..\chromFa\chr7.fa
  ..\chromFa\chr8.fa
  ..\chromFa\chr9.fa
  ..\chromFa\chr10.fa
  ..\chromFa\chr11.fa
  ..\chromFa\chr12.fa
  ..\chromFa\chr13.fa
  ..\chromFa\chr14.fa
  ..\chromFa\chr15.fa
  ..\chromFa\chr16.fa
  ..\chromFa\chr17.fa
  ..\chromFa\chr18.fa
  ..\chromFa\chr19.fa
  ..\chromFa\chr20.fa
  ..\chromFa\chr21.fa
  ..\chromFa\chr22.fa
  ..\chromFa\chrX.fa
  ..\chromFa\chrY.fa
  ..\chromFa\chrM.fa
Reading reference sizes
  Time reading reference sizes: 00:02:04
Calculating joined length
Writing header
Reserving space for joined string
Could not allocate space for a joined string of 2861343702 elements.
Switching to a packed string representation.
Reading reference sizes
  Time reading reference sizes: 00:01:44
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:01:32
bmax according to bmaxDivN setting: 715335925
Using parameters --bmax 536501944 --dcv 1024
  Doing ahead-of-time memory usage test
  Ran out of memory; automatically trying more memory-economical parameters.
Using parameters --bmax 402376458 --dcv 2048
  Doing ahead-of-time memory usage test
  Ran out of memory; automatically trying more memory-economical parameters.

[...etc...]

Using parameters --bmax 42 --dcv 4096
  Doing ahead-of-time memory usage test
  Ran out of memory; automatically trying more memory-economical parameters.
Using parameters --bmax 32 --dcv 4096
  Doing ahead-of-time memory usage test
  Ran out of memory; automatically trying more memory-economical parameters.
Could not find approrpiate bmax/dcv settings for building this index.
Already tried a packed string representation.
Please try indexing this reference on a computer with more memory.
If this computer has more than 4 GB of memory, try using a 64-bit executable;
this executable is 32-bit.
Total time for call to driver() for forward index: 00:05:34
Command: bowtie-build ..\chromFa\chr1.fa,..\chromFa\chr2.fa,..\chromFa\chr3.fa,..\chromFa\chr4.fa,..\chromFa\chr5.fa,..\chromFa\chr6.fa,..\chromFa\chr
7.fa,..\chromFa\chr8.fa,..\chromFa\chr9.fa,..\chromFa\chr10.fa,..\chromFa\chr11.fa,..\chromFa\chr12.fa,..\chromFa\chr13.fa,..\chromFa\chr14.fa,..\chro
mFa\chr15.fa,..\chromFa\chr16.fa,..\chromFa\chr17.fa,..\chromFa\chr18.fa,..\chromFa\chr19.fa,..\chromFa\chr20.fa,..\chromFa\chr21.fa,..\chromFa\chr22.
fa,..\chromFa\chrX.fa,..\chromFa\chrY.fa,..\chromFa\chrM.fa big
    • I'll need to ask Chris what do to about this...

Up next:

  • Somehow, somewhere, obtain an index for Makinen's readaligner
    • Our software can use indices built by Makinen's software
  • Trace bowtie mismatch modes for strategies

06/23/10

  • Built a human index using 64-bit bowtie on skorpios
    • Took about an hour per direction (forward and reverse)
    • I couldn't include the X and Y chromosome, as /tmp ran out of space
  • While that was going on, I also started tracing bowtie in 1-mismatch mode
    • The general search has two phases, defined in search_1mm_phase1.c and search_1mm_phase2.c
      • Phase 1 tries to align it with exact alignments on the forward and reverse strand
        • If that fails, it attempts to align it allowing 1 mismatch on the 3' end on the forward and reverse strand
      • Phase two commences when both subphases of Phase 1 fail
        • Attempts to align it allowing 1 mismatch on the 5' end (via usage of the reverse index)
    • The overall strategy is the same as to what I'm doing in OneMismatchMapper
    • The actual details is slightly more cryptic, and depends on the backtrack function again (same as the one used in exact match mode)
      • The difference lies in the initialization, changing many variables, resulting in a different conditional evaluations
    • Some notes (more for myself...)
      • bt.SetOffs() in search_1mm_phase1.c sets unrevOff at line 50 and 61 to half the length of the reads
      • unrevOff cause the boolean curIsAlternative to be set to true at ebwt_searchb_backtrack.h:486 under specific conditions
      • top and bot are subsequently updated by a difference non-LF function at ebwt_searchb_backtrack.h:553 (previously at 566)
      • curIsAlternative set to true also triggers the updating of an elims array at 580 (altNum++ at 623 also plays a role)
      • When a mismatch is detected, and alternatives exist, the elim array is used to select a potential match at 745
    • Honestly, I still don't quite understand what's happening at all...

Next up:

  • Try to build a large index using readaligner
  • Continue tracing bowtie 1 mismatch mode

06/24/10

  • Attempted to build a large index using readaligner and discovered a few general guidelines, after trial and error, and staring at the memory usage of the process:
    • Given a FASTA file with multiple sequences, I assume something similar to the following happens:
      1. Each reference is inserted into the index
        • The memory required for each insertion is roughly 10 time the size of the reference to be inserted
      2. After all references are inserted, the index needs to be initialized
        • The memory required for this initialization is roughly 2 times the size of all the reference combined
      3. This entire process is repeated twice, to build the reverse index
        • The same memory requirements apply
    • My computer allows a single process up to roughly 1GB of memory (despite having 3.25GB RAM... for reasons I'm unsure of), so the largest index I've been able to build is:
      • A single FASTA file of 580MB
        • Step 2 requires about 1.2GB
      • Composed of chromosomes 14 to 20
        • Chromosome 14 is the largest reference, at 107MB, requiring 1GB in Step 1
    • Knowing this, in theory, we should be able to build in human index (3GB) with a computer that allows a process 6GB of memory
      • The largest human chromosome (chr1.fa) is 250MB, which requies 2.5 gigs of memory to insert, well below the 6GB requirement

  • Continued tracing bowtie 1 mismatch mode:
    • A bit more detail about the overall 2 phase strategy mentioned yesterday, which is more accurately described in 6 steps
      1. Try exact alignment on forward strand using forward index
      2. Try exact alignment on reverse strand using forward index
      3. Allow 1 mismatch in the 3' half on reverse strand using forward index
      4. Allow 1 mismatch in the 3' half on forward strand using forward index
      5. Allow 1 mismatch in the 5' half on reverse strand using reverse index
      6. Allow 1 mismatch in the 5' half on forward strand using reverse index
      • Step 2 and step 3 both attempt alignments on the reverse strand, much like readaligner, which I assume saves one reversal operations
      • Along the same lines though, it makes little sense why step 5 isn't on the forward strand...
      • Nonetheless, as bowtie and readaligner both use the strategy seen between steps 2 and 3, it may be something worth noting
    • I believe I have some grasp of the basic 1 mismatch flow:
      • Starting from the end of the read find the corresponding suffix ranges as usual
        • If a mismatch is encountered before the halfway point, the trace is terminated to prevent excessive backtracking, and the alignment is performed with the reverse index
      • Once past the half way point, the curIsAlternative is set to true, and 5 values are updated and tracked (ebwt_search_backtrack.h:580):
        • pairs: an array of uint32_t values. Adds 8 entries per position past the halfway point, corresponding to the top and bot values for the 4 possible bases at each position
          • As mentioned yesterday, once past the halfway point, the mapLF function (line 566) becomes replaced with mapLFEx (line 550), which determines the 8 top and bot ranges for each position (instead of 2)
          • From these pairs, by subtracting bot from respective top values, you can determine the spread at a specific position with a specific base
        • elims: an array of uint8_t flags. Each position has a 4 bit flag, corresponding to the 4 bases. When the spread of a pair range is 0, the respective elims bit is set to 1 at line 591 (as a range of 0 means that no possible suffixes exist, and is thus "eliminated" from the eligible list of potential alignments). Note that an integer value of 15 is 1111 in binary, meaning all four bases are eliminated.
        • eligibleNum is the number of eligible paths (think suffix tree branch). A path is only eligible if is has not been eliminated, and progressing down the path doesn't "blow the mismatch quality budget" (this latter point is irrelevant in in the mode I'm currently in)
        • eligibleSz is the number of eligible suffixes (think leaf nodes below the aforementioned branches)
        • altNum is the number of alternative paths. A path is alternative if it hasn't been eliminated, and isn't affected by quality. In the quality-independent mode I'm currently in, I assume this value is the same as eligibleNum
      • When a mismatch is encountered and alternatives exist (745):
        • Goes back to the last valid position (the one position right before the mismatch)
        • Using pairs and elims, randomly selects one base with a eligible path (791)
      • Once an eligible path is selected, there are two cases I've seen:
        • The mismatch occurred in the very last character of the search, and the result is returned (908)
        • The mismatch occurred before the end, and the search continues with a recursive call at 961.
          • Note that in this call, no mismatches are allowed, and is effectively the same as a call during exact match mode
          • Where exact match mode typically seeds the function with top and bot generated from the ftab chomp function, the function is seeded with a pair of values from pairs
          • The call completes the iteration across the remaining part of the read as usual

Next up:

  • Try more interesting 1 mismatch nodes, as all my current traces have only 1 alternative path
    • May be faster to create my own reference
  • Try 2/3 mismatch modes to see the full effect of pairs and elims, as their effects don't seem too impressive in 1 mismatch mode
  • Think about potential integration of similar strategies in our aligner... only if they prove useful

06/25/10

  • Built the complete human genome using skorpios overnight
    • Two 3GB references were created, one per direction using readaligner
      • Accessible at /ubc/cs/research/condon/share/projects/ngs-aligner/data/full
    • Currently attempting to align to it... although I worry how the program will load 6GB worth of structures...
      • UPDATE: The two indices have been successfully loaded into memory on skorpios... if not barely. Using top, 6017912k/6055126k memory is being used, leaving 37214k free
    • Not to mention that the input FASTQ file is 1.6 GB itself...
      • Accessible at /ubc/cs/research/condon/share/projects/ngs-aligner/data
      • UPDATE: I believe this isn't being completely loaded into memory... and being read in small bits from the disk... fortunately
    • UPDATE: The results using -k 0:
Number of reads: 9563070 (skipped first 0)
Successfully aligned: 2055421 reads (21.49 %)
Number of reported alignments: 2055421
Wall-clock time: 2489.00 seconds (0.69 hours)
Average wall-clock time per read: 0.26 milliseconds

real    135m3.209s
user    8m30.408s
sys     0m27.482s

    • Once this is confirmed working, then I'll try it on saligner...
  • Continued tracing bowtie
    • By constructing my own reference, I was able to test some more interesting 1 mismatch traces
      • When a mismatch is encountered at a position, another alternative path is selected randomly (791)
        • If the path is valid, a match is found, as described yesterday
        • If the path was invalid (i.e. another mismatch is encountered), the recursive call returns false (974 fails), the loop tries again (745) at a new position (772)
          • Using the elims array, a lot of unproductive traceback is avoided (777)
        • I need to confirm if this random choice only selects 1 base at a position... and see if it dismisses valid paths at the same position but using a different base
    • Started looking into 2 mismatch mode
      • There are 3 phases, best described as 6 cases (2 cases per phase)
        1. Forward index: match exact reads on the forward strand
        2. Forward index: match reads with 0 to 2 mismatches on the right half of reads on the reverse strand
        3. Reverse index: match reads with 1 to 2 mismatches on the right half of reads on the forward strand
        4. Reverse index: match reads with 1 to 2 mismatches on the left half of reads on the reverse strand
        5. Forward index: match reads with 1 to 2 mismatches on the left half of reads on the forward strand
        6. Forward index: match reads with 2 mismatches on different halves of reads on either strand
    • In the general cases (cases 1-5), the first mismatch causes a recursive 1 mismatch backtrace, described in detail yesterday
      • Note that each recursive layer keeps its own unique pairs and elims array
      • Due to the depth wise method it searches to the next alternative, it will sometimes prefer a 2 mismatch alignment over a 1 mismatch alignment
        • This doesn't seem to be the case when qualities are considered... but that's only in the seeded quality mode
    • I've yet to determine the exact mechanism of case 6, which I suspect will be somewhat different, as one mismatch occurs in the chomp region
  • Slightly off topic, but I was browsing around and confirmed that as of 5 months ago, bowtie was still in the midst of implementing gapped alignment support
    • Consequently, I'll either have to look into how readaligner or bwa does gapped alignments
      • readaligner is very slow though...
      • ...and bwa code is very gruesome

Next up:

  • Pray that the alignment goes well with readaligner, then repeat the process with saligner
  • Continue tracing bowtie 2 mismatch mode
  • Trace bowtie 3 mismatch mode and seeded quality mode

06/28/10

  • Starting to debug and rewrite classes to support the human genome
    • As mentioned last time, simply loading the index from file takes 90 minutes (assumed to be from mounting remove drives)
      • This can by reduced to 90 seconds by accessing the index file directly on skorpios' disk.
        • /var/tmp/ has sufficient memory and access to achieve this for all future runs
    • Previously, besides the two indices created by Makinen's FMIndex class, I saved an additional structure to disk:
      • It contained:
        • A set of original genomes for quick extraction
        • A list of lengths
        • A list of reference names
      • As readaligner doesn't build this, I had no such file for the human genome, and if I did, it would be 3GB (for a total of 9GB, over skoprios's memory)
      • I attempted to completely remove the need for this file, and was able to eliminate everything but the actual sequences themselves
        • In theory, a true extract function using the compressed index could replace it
          • However, after consulting Chris, I've decided not to pursue this option (not to mention that we'll be replacing the entire index)
        • Alternatively, I'll replace the 3GB genome with a 750MB packed DNA string
          • In the likely case where 3GB + 750MB is too much for skorpios to handle, I'll rebuild the 3GB index at a more memory friendly setting (at the cost of query speed)

Next up:

  • Implement a quick DNA bit string class to compact the genome
  • Run and debug alignments on the human index
  • Investigate bowtie mismatch modes and incorporate strategies
    • Things readily adaptable:
      • The elims and pairs system
      • The phases
    • I imagine I'll probably just copy the general flow of bowtie functions instead of reinventing something and incorporating it

06/29/10

  • Created a DnaBitstring class, which packs the 3GB reference genome into 750MB
  • Integrated the class into the code base
  • Built and saved a file of the 3GB reference
  • Currently running an alignment with saligner
    • Results will be posted later...
    • UPDATE: saligner with MismatchMapper at -k 2 on a Debug build:
      • I wasn't concerned about speed, and was more interested in whether it would crash or not... which it didn't.
Elapsed time: 13489.8 seconds
Processed   : 9563070
Average time: 0.00141062 s/read
Mapped      : 3601713
Unmapped    : 5463992
Invalid     : 497365

real    333m35.102s
user    223m31.726s
sys     1m35.954s

Next up:

  • Confirm that saligner can handle human genome alignments
  • bowtie investigation and strategy integration

06/30/10

  • Some faster alignment results on saligner with OneMismatchMapper on a Release build
Elapsed time: 1295.97 seconds
Processed   : 9563070
Average time: 0.000135518 s/read
Mapped      : 2928602
Unmapped    : 6137103
Invalid     : 497365

real    57m46.685s
user    21m16.380s
sys     0m36.042s

    • Loading the index takes 90 seconds, like readaligner, however, after loading the index, and prior to aligning, there is a roughly 30 minute "pause" (difference between real and user)
      • Staring at the memory usage and the code, I believe this 30 minutes is processing the 1.6GB FASTA file
        • After the index is loaded, 6GB of RAM and 0.4GB of the virtual memory is occupied
        • The virtual memory starts to slowly grow until reaching a max at around 2GB 15 minutes after the index is loaded (I assume this is the FASTA file)
        • There is then a 19 minute gap, with no visible change in memory usage...
        • Only half an hour after the start of the program does actual aligning begin, which lasts about 20 minutes for 10 million reads
    • Although, when I try aligning the 1.6GB file against a negligibly small reference, there is no lag
      • Instead the 1.6GB is quickly loaded into RAM and the alignment begins
      • The RAM usage gradually increases from 1.6GB until it peaks at 3.0GB, which I believe is holding the 1.4GB output file
    • From these experiments, I believe this problem shouldn't be too major if our custom index is small enough to allow both the index and all input and output files to fit on the RAM...

  • Continued to trace bowtie
    • Went through case 6 described on 06/25/10
      • It handles a "split" mismatch, where there is 1 mismatch on both ends of the read
      • Such a read usually returns an invalid range after chomp
      • The alignment starts from the very end (instead of 10 bases in), until the first mismatch is hit
      • After the mismatch is hit, a valid base is chosen (i.e. a base that has a valid range) and chomp is performed once again
        • I believe they also compute a "KmerValue" at 923
      • The alignment then proceeds as a normal 1 mismatch alignment
    • Started tracing the maq-like quality aware mode... which has 4 phases and 9 or so steps...

Next up:

  • Note to self... Try the variant "split" mismatch with mismatches occuring just outside of chomp range, but still before the halfway point...
  • Continue tracing bowtie maq-like mode
  • Implement bowtie like strategies
Topic revision: r1 - 2010-07-01 - jujubix
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback