DanielJulyArchive < BETA

07/01/10

Traced bowtie in maq-like quality aware alignments
- The maq like part with seeds and 4 phases confuses me and the results seem inconsistent compared to what's explained in the papers
  - I've decided to ignore it... for the most part, not seeing any real benefit simulating the behavior of a hash-based implementation with a succinct index
- The quality aware part was worth looking into though
  - It's a rather crude mechanism, which tracks the lowest quality seen (best candidate to swap in a mismatch)
    - When evaluating new alternatives (see previous entries), branches are only added if the quality they have is equal to the lowest quality seen
    - If the quality is less than the tracked quality, all previous branches are removed from consideration
    - If the quality is greater than the tracked quality, the branch is completely ignored
  - When a mismatch is encountered, there are a few cases:
    - Only 1 valid path exists (i.e. only 1 position has the lowest quality)
      - The backtrack jumps straight to the position (marked in special buffers), replaces the base, and continues with a recursive call as usual
    - When multiple paths exists (i.e. multiple position have the same lowest quality)
      - A loop will iterate backwards until it hits the first position with the lowest quality and attempt to align the rest after replacing the base
    - Due to this, I've seen many cases where the algorithm prefers a 2 mismatched alignment when a 1 mismatch solution exists... (then again, maybe it's the Maq-like behavior causing it)

Some bowtie times on default setting (which I believe is -n 3) for comparison... well, more out of curiosity, as I didn't really set equivalent settings in the following tests:
- Note the very fast index loading time and higher hit percentage... not to mention similar alignment times despite more mismatches

Time loading forward index: 00:00:03
Time loading mirror index: 00:00:01
Seeded quality full-index search: 00:19:26
# reads processed: 9563070
# reads with at least one reported alignment: 4836833 (50.58%)
# reads that failed to align: 4726237 (49.42%)
Reported 4836833 alignments to 1 output stream(s)
Time searching: 00:19:30
Overall time: 00:19:30

bwa times on the same data set... the actual alignment, followed by converting to SAM
- Unable to confirm how many successfully mapped reads, but these alignments include indels

jujubix@skorpios:bwa-0.5.8a$ time ./bwa aln -f /var/tmp/kdecache-jujubix/ngsa_temp/bwa_output.sai /var/tmp/kdecache-jujubix/ngsa_temp/small.fa /var/tmp/kdecache-jujubix/ngsa_temp/SRR002807.fastq
real    40m45.378s
user    40m36.156s
sys     0m4.148s

jujubix@skorpios:bwa-0.5.8a$ time ./bwa samse -f /var/tmp/kdecache-jujubix/ngsa_temp/bwa_output.sam /var/tmp/kdecache-jujubix/ngsa_temp/small.fa /var/tmp/kdecache-jujubix/ngsa_temp/bwa_output.sai /var/tmp/kdecache-jujubix/ngsa_temp/SRR002807.fastq
real    2m26.611s
user    0m47.927s
sys     1m8.752s

Next up:

Implement bowtie strategies in aligner

07/02/10

Started implementing a mapper based off of bowtie 1 mismatch mode (see 06/24/10):

Originally wanted to make a backtrack function like the one in bowtie from scratch, but the thought of it was slightly daunting
Instead, I will takes things in baby steps, building on previous aligners and potentially working my way to the full recursive backtrack in bowtie
- I've begun to implemented a 1 mismatch mapper that uses the jump table, and will follow a similar 6 step strategy described on 06/24/10
- I plan on incorporating quality awareness, either as a difference function or a switch of some sort...
I'm also relieved to see that the jump table seems to be constant in size, so I can use the jump table with the human genome
- I had to modify the table a bit to handle queries with both the forward and reverse index

Next up:

Continue implementing bowtie k 1 mapper
- Include quality awareness after basic flow is complete
Make function recursive to handle more than 2 mismatches

07/05/10

Partway through the new bowtie-like mapper:

Tracks pairs and elims
Essentially a combination of two previously implemented mappers: OneMismatchMapper and JumpStartExactMapper
I haven't added quality awareness yet... just to keep things simpler

Next up:

Finished implementing the 1 mismatch bowtie like mapper
Add quality awareness to mapper
Implement the multi mismatch bowtie like mapper

07/06/10

Still implementing bowtie-like mapper:

Finished core logic of bowtie-like mapper, the recursive-in-theory backtracking locate function:
- Confirmed correctness with forward index
- Confirmed incorrectness with reverse index
- Recursive in theory... as I haven't tried it out yet (I instead call a simplified version of the function, instead of the function itself)... but it should work, in theory

Next up:

Complete bowtie-like mapper by getting the reverse index to work
Try having mappers output, instead of drivers
Confirm correct recursion of function
- In theory, with a small change, I should be able to support any number of mismatches...
- Essentially, bowtie keeps track of 3 "halfway" values, after which mismatches are allows
  - After every mismatch is encountered, the halfway values are updated by taking the value of the one after it
  - In theory then, instead of tracking 3 values, tracking a vector of values will allow more than 3 mismatches
Add quality-awareness

07/07/10

Completed the bowtie-like mapper, and went through a lot of debugging:

All 6 steps work as intended
- After a lot of small bug fixes and going back to confirm things in bowtie
- I may still have to do some changes to ensure that hits returned by one step aren't duplicated in another...
Currently, the mapper returns all possible results
- I still need to change some things so it caps at max_result
- This also raised the question of whether I would return 1 mismatch hits when 0 mismatch hits exist
  - I decided to follow the -a behavior in bowtie, which answers yes to the above question
  - This also explains A LOT of the conditionals bowtie does, which essentially allow the fine control of the type of results returned

Next up:

Make modifications so mapper returns up to max_result number of hits
Try having mappers output, instead of drivers
Make a recursive version of the mapper
Make a quality-aware version of the recursive mapper
Something to try: Quick hamming distance check when only 1 suffix exists in the range

07/08/10

Finished implementing a fully working of the bowtie-like mapper:

Fixed a bug... which would occur randomly, due to a time-seeded random number generator... (I completely missed this in last nights testing)
Made a few modifications:
- So it would obey max_results
- To improve performance (see below)
Ran it on various sets of data and recorded some crude times below...

Some stats for comparison sake on my computer, using multiple.fna as the reference on e_coli_125000_1.fq with saligner with various mappers:

NOTE: the following times are extremely crude, and are typically the lowest time seen after a few trials... they are more for comparison sake rather than benchmarking.

`max_mismatches`	Mapper	Time (s)		note
`max_mismatches`	Mapper	`max_result = 1`	`max_result = 10`	note
0	`bowtie_k1`	4.093	4.594	effectively equal to `jump_start`
1	`bowtie_k1`	4.688	15.781	note large change...
1	`bowtie_k1`	14.828		same observed at `max_result = 2`
1	`one_mismatch`	7.095	8.766	equivalent to `bowtie_k1` with 1 mismatch without jump table
0	`mismatch`	8.579	9.297	`readaligner` implementation
1	`mismatch`	10.704	11.985	`readaligner` implementation
0	`exact`	5.047	5.734	equivalent to `jump_start` without jump table
0	`jump_start`	4.219	4.608

The large jump in runtime by changing max_result in the bowtie like mapper is due to the following:

The algorithm is 6 steps:
- The first two are very fast, searching for exact hits on the forward and reverse
- The latter four are relatively slower, searching for all possible single mismatches (incidentally, these are skipped when max_mismatches is set to 0)
When max_results = 1, nearly all reads are taken care of by the first two
When max_results > 1, nearly all reads will continue to seek non-existant mismatch hits when they don't exist
- Non-existant mismatch hits are even worse than existant ones, as existing ones will typically terminate the search early, while a search for a non-existant match will continue until all latter four steps are exhausted

Based on this theory, I ran some tests on bowtie to see if it had the same behavior:

`max_mismatches`	Time (s)			note
`max_mismatches`	`max_result = 1`	`max_result = 2`	`max_result = 10`	note
0	2.694	2.765	2.890	difference arises from the fact that some reads terminate at step 1, while other proceed to step 2
1	2.968	6.3227	6.733	relatively large jump also observed when `max_results > 1`

To eliminate this in bowtie, one can specify a strata option, which doesn't search for mismatch hits when exact hits are found (mode doesn't work with max_results = 1):

`max_mismatches`	Time (s)		note
`max_mismatches`	`max_result = 2`	`max_result = 10`	note
0	3.077	3.249	oddly, time increases even with exact hits... so it must by doing something additional
1	3.702	3.952	will not search for mismatches when exact hits are found... reducing run time

I inserted a conditional boolean check (report_mismatch_with_exacts_) between the exact search steps and mismatch search steps to yield the following results with my bowtie-like mapper:

`max_mismatches`	Time (s)
`max_mismatches`	`max_result = 1`	`max_result = 2`	`max_result = 10`
0	4.016	4.156	4.531
1	4.047	4.172	4.656

The large gap has been eliminated, at the expense of not reporting mismatch hits when to do in fact exist
Both times roughly equivalent to searching for exact hits only
I have yet to confirm this, but I believe that all other test mappers should have strata behavior... or else some very magical method to search for mismatches

And finally... just for fun, I ran bowtie and saligner at equivalent settings for strata, mismatch = 1 and max_results = 10 on the 2.6 mil reads set against the multiple reference:

bowtie: 83 seconds
saligner: 96 seconds
- I'm surprised it's this close... but again, these are extremely crude quick trials on my 32 bit computer, and bowtie slows down significant with strata mode... for reasons unknown

The same settings, but without strata and setting max_results = 1:

bowtie: 62 seconds
saligner: 94 seconds

Next up:

Try having mappers output instead of drivers
- I imagine there will be two improvements
  - A lot of function signatures will be shortened, as results variables not longer have to be propagated back to the top caller
  - Some performance benefits, as results no longer have to be converted and stored intermediately (i.e. pushed onto vectors), and instead are used and discarded immediately after they're computed
Make bowtie like mapper recursive and handle multiple mismatches
Add quality awareness
About that hamming distance thing when top equals bot...
- The quick suffix extraction and hamming distance computation will return whether its a hit or miss...
  - But after finding its a hit, how do we compute the final top and bot values required to extract the position of the final hit...? Or is it only good to quickly reject misses... Need to ask Chris someday...

07/09/10

Finished modifying the bowtie-like mapper so it outputs in the mapper instead of the driver.

Firstly, I realized that I previously chose to have drivers output, since this was key for handling pair ends...
- I match reads up as a post-process, so results of individual alignments had to be passed back up to the driver
- To see how other aligners solve this, I did some looking and tracing:
  - readaligner doesn't have this problem... as it doesn't support pair-end reads
  - bowtie uses a completely different algorithm to handle pair-ends... so that didn't help either
The change was actually a lot less destructive than I thought:
- I kept the same interface, and instead simply ignored the result vector
- Added a new function in the mapper class so it would accept an output reference, instead of changing the constructor
I soon realized that instead of passing values back up to the caller, I was instead forced to pass values into callees (as the output function was at the top of the call stack, instead of the bottom)
- I chose to hold values in class variables... taking into account of the recursion that may come soon
- Although this made quite a few const functions break...
Long story short, I believe I may have shaved about 10% off the runtime when testing with the 2.6 mil read set
- However, during this implementation, I realized by implementation of strata was incorrect and the times changed quite a bit after fixing, so I'll need to redo a few tests with the older implementation

Next up:

Compare times of implementation and possible some formal benchmarking on skorpios
Make recursive mapper
Add quality awareness

07/12/10

Struck by a flood of bugs while trying to setup things for comparison tests...

Modified a single line (dividing exact and inexact matching logic) in the original version (output in driver) to make the strata behavior correct...
- In the old version, inexact matching would never occur. It has been corrected so it occurs if no exact matches have been found, but is skipped when exact matches exist
Unfortunately, there was a memory leak inside the inexact matching logic (I never caught this in past trials, as the input size was small)
Using valgrind, I was able to track down the allocations that were never freed... and freed them
That eliminated the leak, but in turn, revealed that freed memory was still being accessed illegally
- These accesses are actually fine most of the time, except in rare cases where key values (e.g. array index) are changed (yet to determine cause... but I suspect it's being illegally overwritten) and seg faults occur
- It turns out these rare cases are a combination of the actual read, and the random number generator... so I had a hard time replicating the error consistently at first
I have inserted a constant in place of the random number to make debugging better, and I currently have an input file consisting of the single read...
valgrind has given be the exact lines where the erroneous access occurs, but rather than being a bad array index, I still can't quite characterize the exact cause of the problem...

Next up:

Squash bug
Recommence comparison tests between outputting in the driver vs mapper
Make recursive mapper
Add quality awareness

07/13/10

Resolved bugs in the driver output implementation
- Through a somewhat trial-and-error-ish method, I have resolved all known bugs in the current bowtie-like mapper
- I assume that the memory bug was caused when a certain chunk of memory wasn't being freed consistently (depending on how the function exited)
  - Of course, this still doesn't quite explain how it's tied to the RNG, and purposely trying to create the bug was unsuccessful
- Confirmed that the output of selected reads matched bowtie with and without strata mode on
  - One read had multiple mismatch hits, and should return hits with and without strata
  - The other read had multiple exact and mismatch hits, and should only return mismatched hits with strata off
- UPDATE: No... apparently there's still something wrong with the output
  - I'm seeing mismatched hits when I set mismatch to 0
Output comparison
- Ran some preliminary tests which showed dramatic improvement in runs with strata off
- This prompted me to confirm that the results were the same for both methods of output
- They weren't... the output in mapper was missing some reads...
- I've also discovered a few other bugs (e.g. incorrect CIGAR) in strata off with the mapper output implementation
Thought about "that hamming distance thing when top equals bot"...
- I haven't tested this out yet, but I believe that to extract the proper suffix corresponding to the entire query, we would need specific sp and ep for the entire sequence, which we cannot obtain unless we iterate through the entire sequence in the first place, making this strategy impractical.
- I'll need to play around with the getSuffix function in the FM Index to confirm this
- Also of concern is the implementation of so said getSuffix, which may simply be yet another loop, which may make this strategy (assuming it works) not that effective

Next up:

Resolve logic bugs in mapper output implementation UPDATE: and driver output implementation
Compare implementations
Make bowtie mapper recursive
Add quality awareness

07/14/10

Resolved all known bugs in both output versions (to replace values of the bowtie like mapper in 07/08/10):

Apparently the bug I saw that caused those UPDATE addendums were simply because I was staring the the wrong output file...
The output in mapper had a few bugs and inconsistencies with the output of the original version:
- It didn't output unmapped reads
- It didn't output invalid reads (with N)
- All mismatched reads had incorrect CIGARs
- Mismatched reads would sometimes not output
These have been resolved, and the output of the two versions were confirmed to be identical using diff on the 125000 read set
- This isn't exhaustive though... so I'm not too sure if they're still identical for all possible settings
- Also, the output in mapper doesn't append .# after multimapped reads, as it cannot tell if a read is multimapped during its output...
  - I had to use a quick regex search and replace to rid those to make the two files truly identical

Redid previous time trials on the correct bowtie-like mapper with both version:

Reference: multiple.fna
Reads: e_coli_125000_1.fq
On my computer via saligner... 8deba for output in driver and 7f4aab for output in mapper
Parameters adjusted by editing the saligner main and recompiling
All times are averages (of 4 to 5 trials) taken from the Elapsed time of the program output
strata means that mismatches are not considered when exact matches occur

Output in Driver
		Time (s)
`max_mismatch`		1	0	1	0	1	0
`max_result`		10	10	2	2	1	1
`strata`	`false`	15.184	3.973	14.238	3.866	4.441	3.602
`strata`	`true`	4.922	3.957	5.805	3.856	4.531	3.762

Output in Mapper
		Time (s)
`max_mismatch`		1	0	1	0	1	0
`max_result`		10	10	2	2	1	1
`strata`	`false`	12.314	4.147	11.669	3.947	4.294	3.763
`strata`	`true`	4.644	4.047	4.485	3.972	4.197	3.778

Some observations:

Concerning strata:
- When max mismatch is 0, strata makes little difference
- When max mismatch is 1, strata makes a big difference when max results > 1
Concerning output method:
- Output in mapper is slower when max mismatch is 0
- Output in mapper is faster when max mismatch is 1
- Output in mapper really shines when strata is off (i.e. mismatches are allowed when exacts are found)
  - Especially evident when strata is off and max results is high

Some follow-up trials to confirm the above output observations:

Same setup, but with the 2.6 mil read set
All trials run with strata on (ignore mismatches when exacts are found)
- I believe the advantage of output in mapper when strata is off is highly obvious from previous trials
All times are single runs instead of averages

		Time (s)
	`max_mismatch`	1	0	1	0
	`max_results`	1	1	10	10
Output	driver	93.719	78.313	104.406	84.750
Output	mapper	91.328	80.031	98.594	87.812

Some conclusions about output:

Same trends are the smaller trials, although the differences are less than 5% either way...
I actually thought it would be faster with a high max results regardless of mismatch count
- Since I tweaked the logic, so a search would terminate as soon as possible...
- While output in driver typically completes the search, returns all the results, and filters them for output afterwards...

Overall then, there is a clear advantage of output in mapper (at least the way I implemented it) in the bowtie-like mapper when strata is off and multiple reads are required...

However, I'm not too sure about how common this setting is...
Also, I'm sure my implementation (for both types of output) can be improved, so the those very small differences may simply be due to sloppy implementation (e.g. not reserving space, passing by reference, making things const)

For now, I'll stick with the original output in driver, as there's no clear speed advantage in the common case, and it'll save me the trouble of converting all the other classes. On the other hand, this also provides some clues as to where output in driver is slowing down...

Next up:

Make recursive mapper (note: not make mapper recursive)
Make quality aware mapper

07/15/10

Made mapper recursive:

Relative simple... although it still only supports 1 mismatch at maximum

Did some bowtie traces to confirm how 2 and 3 mismatch modes were done:

2 mismatch mode was described on 06/25/10 and is relatively straight forward
3 mismatch mode actually uses a completely different algorithm (the same one used for pair ends...)
- This is supposedly "much faster", and controlled by a boolean at ebwt_search.cpp:872
I commented out this boolean, and confirmed that 3 mismatch was possible with the six step 2 mismatch model, as follows:

Step	Index	Strand	Mismatches
1	fw	fw	none
2	fw	rc	0 to 3 on right
3	rv	fw	1 to 3 on right
4	rv	rc	1 to 3 on left
5	fw	fw	1 to 3 on left
6	?	both	mismatches on both sides

It becomes quite apparent that this strategy only works well for mismatches less than 2, and becomes highly inefficient once mismatches reach 3, as most cases fall into the final the most complicated case (not to mention that getting a match in case 6 requires failing in the first 5 beforehand)
It would also be hard to adopt this model for an arbitrary number of mismatches, as the jump table and strategies to prevent excessive backtracking probably don't do well for anything more than 2 mismatches

Next up:

Finish recursive 2/3 mismatch mapper
Add quality awareness

07/16/10

Implementing bowtie like 2 mismatch mapper:

Modified the k1_mapper to support cases 1 to 5 mentioned above for 2 mismatches.
- Relative smoothly, due to the recursive work done yesterday
- Most of the work went into tracking the actual position and bases of mismatches
  - Required yet another specialized SamEntry generator which should support an arbitrary number of mismatches
Traced bowtie to confirm how the last case that handles split mismatches works:
- There are two general cases:
  1. Mismatch on one half within jump zone, mismatch on other half
    - After finding the first mismatch, use the jump table and continue as usual allowing 1 mismatch across halfway point
  2. Mismatch on one half outside of jump zone (but before halfway point), mismatch on other half
    - After finding the first mismatch, do not jump and continue as usual allowing 1 mismatch across halfway point

Next up:

Finish final step of 2 mismatch mapper
Modify mapper to support 3 mismatches
Add quality awareness

07/19/10

Finished 2/3 mismatch mismatch mapper:

Has a 10 step process:
- Steps 1 to 2 handle exacts
  - 0 mismatch on forward strand
  - 0 mismatch on reverse strand
- Steps 3 to 6 handle all possible 1 mismatch cases
  - 1-3 mismatch on left half of forward strand
  - 1-3 mismatch on right half of reverse strand
  - 1-3 mismatch on left half of reverse strand
  - 1-3 mismatch on right half of forward strand
- Steps 7 to 8 handle all remaining 2 mismatch cases
  - 1 mismatch on right half, 1-2 mismatches on left half of forward strand
  - 1 mismatch on left half, 1-2 mismatches on right half of reverse strand
- Steps 9 and 10 handle all remaining 3 mismatch cases
  - 1 mismatch on left half, 2 mismatches on right half of forwards strand
  - 2 mismatch on right half, 1 mismatch on left half of reverse strand
This is a somewhat hybrid version of bowtie's cases, and up to 2 mismatches, the number of DPS backtracks are roughly equivalent
For 3 mismatches, it becomes increasing inefficient, and bowtie switches to a completely different algorithm

Next up:

Confirm correctness of mapper
Make quality aware mapper
Pair end...? Human genome...? Bowtie's other algorithm? Gapped alignments?

07/20/10

Tested the correctness of the 2/3 mismatch mapper, which went a lot smoother than I thought...

Created a correctness_test.fq file of reads that covers most general and edge cases up to 3 mismatches
Discovered and resolved an edge case bug
Committed and pushed to main repo

Some times on skorpios, writing to /var/tmp/ with the 2.6 mil read set on multiple reference, max_results set to 1 (making the state of strata irrelevant):

Mismatches	3	2	1	0
Time (s)	111	89	61	50

Compared to the results here, we have improved in terms of running time and ratios, but both versions of bowtie still outpace us considerably...

I also compared the actual number of mapped and unmapped reads, and bowtie has the exact same number of exact hits, but slightly more on all mismatch hits
- I believe this is because bowtie will align some reads with N, while we completely ignore them and treat them as invalid

Started new mapper based off the above mapper that incorporates quality-awareness:

The overall algorithm and recursive logic is identical, which the sole difference being the order of the backtrack
- In the previous version, it was strictly DPS (favouring the last valid branch)
- With quality-awareness, it favours the last valid branch with the lowest quality
This strategy assumes that there is some dependency between mismatches and the quality of the base...
I reviewed the bowtie code a bit, and the addition should be relatively straight forward

Next up:

Finish quality aware mapper
Pair end...? Human genome...? Bowtie's other algorithm (for > 3 mismatches)? Gapped alignments?
- Personally, I feel the most urgent is gapped alignments followed by pair end
- While both of these have been implemented previously, they are older implementations which may not be fast enough
  - Bluntly put, I haven't the slightly clue how to do gapped alignments efficiently with a succinct index
    - bowtie doesn't support gapped alignments yet...
    - The one from readaligner is horrendously slow
    - I believe bwa may have some algorithm in there, which is supposedly a heuristic deviation of the SW...
- Only after these two are completed, do we have a minimal aligner for actual human sized data
  - And only after confirming that the results display in tview

07/21/10

Meeting today:

Updated progress with Anne
- I'm scheduled to have a talk about this project in 6 weeks
Discussed overall picture with Chris
1. Quality (both using and producing)
  - Using quality: my current work in progress
  - Producing quality... I can probably make something up along the way one day...
2. FM Index (Jay's work in progress)
3. Strategies (my next work in progress)
4. Maq-like
  - Hash-based strategies
  - Modification of pre-existing kmer and packstring code
- We may not have big guns, but we have many guns... it'll be like the swiss army knife of aligners!

Next up:

See yesterday

07/22/10

Finished quality-aware mapper... but hit a rather nasty bug during testing:

Lots of small bugs were found, and fixed
- Everything looked fine with small data sets...
An odd bug was discovered using the 125000 set
- It causes an assertion to fail
- It was rather difficult to pin down, as I believe it depends not on the read per se, also what was previously existing in memory (so previously aligned reads)
- Other than that, I have a slightly idea of where to begin debugging it...
  - Add an edge case, when the mismatched position is the first position of the recursive call (although I'm pretty sure I caught this... or else my correctness test should fail, nonetheless, check it out)
  - I suspect the parameters used when then recursive call was made may be problematic (potentially a bad low_alt_qual? Then again... I'm pretty sure I sorted these out too...)

Next up:

Squash bug in quality mapper...
Look into creating an efficient 1 gap mapper... possible looking at bwa code
Confirm that pair end still works
Confirm that hash based mappers and indexes still work

07/23/10

Resolved all known bugs in quality mapper:

The odd bug was caused by bad loop termination (I was looping over values I shouldn't have)...
Fixing it revealed a few other bugs
- One was a bad edge case in the mapper
- Another was a bad edge case in the FASTQ quality parser (which truncated the quality if it didn't end with '\n', which occasionally occurs at the end of the file...)
Fixed all issues, committed and pushed, and ran some trials on skorpios for comparison
- Times are trivially slower...
  - Which I guess isn't too bad, considering the increase in code

Mapper	Mismatches	3	2	1	0
Recursive	Time (s)	111	89	61	50
Quality	Time (s)	112	92	64	51

Started pondering about what to do next:

Checked out a bit of bwa code, although I'll probably have to run it through callgrind to determine the key functions
Went over the bwa paper, which has some cryptic pseudocode of the algorithm... and claims that the addition of two lines makes the algorithm support insertions and deletions
Found a highly illuminating and somewhat daunting paper on the current status of aligner algorithms
- Shows that I should probably give a bit more thought into supporting gaps
- Shows that my qmer classes are far from complete
- Shows that I may need to create a new FM Index based BFS algorithm... (similar to bowtie in --best mode, or bwa)

Up next:

Trace bwa to develop gapped alignments and possible BFS alignments
Ensure kmer classes still work
- Improve them...

07/26/10

Started to trace bwa:

Compared to bowtie, the code analysis was both easier and harder...
- Easier as its in C, so the entire program was extremely procedural and easy to follow, compared to bowtie, which had a bunch of classes and factories specific to the alignment mode specified
- The main algorithm is shorter... which is always a good thing
- Harder, from the complete lack of comments and ambiguous variable names (top and bot are known as k and l...)
The main algorithm is an iterative stack-based BFS of the FM index located at bwtgap.c:104 (160 lines, if you exclude the inlined functions, compared to bowtie, which was around 600)
Overall, the backwards search algorithm is the same (fortunately, there's only so much you can do with an FM index)
A few unique and adaptable features are:
- The width array, which is constructed for every sequence, prior to calling the actual alignment function, and serves to restrict the search space (described as CalculateD in the paper)
- The BFS search facilitated by the stack
  - Stack entries contain top, bot, and pos... to name a few expected values (defined in bwtgap.h as a struct)
- A stateful method of handling insertions, deletions, and extensions of both

Next up:

Continue tracing bwa, by feeding it specific reads to see indel and mismatch mechanisms at work
Create bwa-like mapper
Pair-end and kmer

07/27/10

Continued tracing bwa:

Discovered that the stack wasn't a stack... but an array of stacks
- The number of stacks in the array is determined ahead of time, and is the maximum penalty possible with the given penalty scores and allowable differences
  - On that note, the stateful method of processing indels and mismatches means that a SW-like score can be calculated (constant, linear and affined are all possible)
- Every alignment keeps track of its own score, and is pushed onto the stack containing alignments of identical scores
  - For example, perfect alignments are pushed onto stack[0] and alignments with 1 mismatch (-3 score) are pushed onto stack[3]
- Analogous to the elim array in bowtie, only alignments with valid ranges (sp < ep) are pushed onto their respective stacks
  - Key values of a stack item:
    - Score, position, and strand (uint32_t)
    - Number of mismatches, gaps, extensions, the state and mismatches in seed (uint32_t)
    - sp and ep
    - The last position with a difference
- The pop operation returns the top item of the lowest stack
  - This is the key to obtaining a BFS
  - It also confused me at first... as I assumed the stack was a simple stack... and was bewildered when something from the bottom of the stack was suddenly returned
Traced multiple varying mismatch cases
- Unlike bowtie, it doesn't prevent excessive backtracking by using the reverse index
  - It does however, still use the reverse index:
    - It aligns the reversed forward strand using the reverse index
    - ...and the reverse complement strand using the forward index
      - This allows the algorithm to access the correct positions using the same code...
- Both forward and reverse strand alignments are done in one function call (both are pushed onto the stack at the beginning)
- Instead of using some higher level strategy, the array of stack ensures that lower mismatch alignments are attempted and returned prior to higher mismatch count alignments
Tracked down and traced most of the terminating cases:
- 143 break: when no other alignments of less than or equal score exists
  - Typically occurs when the second to best stack has been exhausted
- 147 continue: when the number of differences has been exceeded
  - I've never seen this occur...
- 155 continue: when the width array shows that it is impossible to obtain a valid alignment within the number of allowable differences
  - Catches a majority of cases and limits the search space
- 162 continue: when an exact alignment fails
  - Analogous to the typical case in bowtie
Started tracing indel reads... and they appear anything but efficient...

Next up:

Continue tracing indel reads in bwa
Create a bwa-like mapper
Pairend and Kmer support

07/28/10

Finished tracing bwa:

Indels were relatively straight forward in theory within the stack-based BFS algorithm:
- At every position, multiple possible alignments are pushed on the stack
  - 1 read insertion (+ gap opening penalty): pushed onto the stack without updating new top and bot values but decrementing the position
  - 4 read deletions (+ gap opening penalty): one for each possible base, pushed onto the stack with an updated top and bot range, without decrementing the position
    - i.e. the next iteration reads the same position once again
  - A gap extension is possible if the alignment popped of the stack was an indel
    - The alignments pushed onto the stack for gap extensions are identical to gap opening insertions or deletions
  - 4 read mismatches: one for each base (will include the exact match, if one exists), updating the top and bot and decrementing the position
    - or 1 exact match match
- In the next iteration, an item is popped off the stack:
  - The item returned is dependent on the penalty scores used, but typically exacts are preferred over mismatches which are preferred over indels
- The algorithm repeats until the stack runs out or one the terminating case (see yesterday) is triggered, not when the first hit is found (to ensure that the best hit is returned)
Seeing this, it also becomes somewhat apparent that a DFS algorithm with indels and mismatches would probably be quite inefficient due to the number of potential paths possible

Started implementing the bwa like algorithm:

The algorithm will probably be a straight forward function
The stack needs to be built
- Rather than building and debugging my own, I will probably copy the code directly from bwa

Also discussed the possibility of an half-index mode with the bowtie-like mapper:

The algorithm is expected to run faster if we only use a single index in cache, instead of using both forward and reverse simultaneously
We can rearrange the algorithm to find all possible hits with the forward index, then swap in the reverse index and handle all remaining reads

Next up:

Implement the bwa like mapper
Pairend and Kmer and the half-index mode

07/29/10

Started implementing the bwa-like mapper:

Created a PriorityStack class by copying and modifying the code of the array of stacks mentioned 2 days ago
- Tested and confirmed the class was working as expected
Starting creating the BwaMapper class
- Currently, the Map() function will not be const, as it heavily modifies the internal PriorityStack
  - There are two solutions to this:
    - The first is to create and destroy the priority stack within the map function, but this probably a very bad idea due to the intense memory allocation required
    - The second is to pass in the priority stack from the driver into the mapper... (I'll probably modify the class tomorrow to accommodate this)

Next up:

Continue implementing the bwa mapper:
Pairend and Kmer and the half-index mode

07/30/10

Finished implementing the bwa-like mapper:

It compiles and runs and spits out SAM results... but the results are all wrong >_<
Also, besides having completely incorrect positions, the CIGAR and MD fields are currently dummy values
- I'll need to develop some efficient method of keeping track of the CIGAR during the algorithm
- The alternative is to generate the CIGAR afterwards... but this is rather inefficient

Next up:

Debug BWA mapper
- After it works properly, add proper SAM entry support
Pairend, kmer and half-index

Raw edit | More topic actions

Topic revision: r1 - 2010-08-04 - jujubix