JayMayArchive < BETA

05/06/10

I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days:

Read through a bunch of Daniel's journal entries to get myself a little more familiar with the project so far.
Had a meeting with Daniel and Chris where we discussed my first task:
- First, I should make or beef up the existing test suite.
- Then, I'm to make a vectorized version of the Smith-Waterman aligner that Daniel has working.
- I should also familiarize myself with existing tools (e.g. Bowtie, BWA, Mosaik, and SamTools).
I ended up spending a lot of time reading through the various guides Daniel posted, as well as the Git guide and the Google C++ style guide to get myself more familiarized with what I should be doing and how I should be doing it.
I also spent some time reading up on the Smith-Waterman algorithm and reading up on sequence alignment in general to get myself a little more familiar with the project.
Checked out Daniel's existing code, especially local_aligner.cc bit (which I'm assuming my vectorized SW will be based a lot off of). I can understand the majority of the code, although I'm a little iffy on some of the gap penalty calculations.

So after all that, I think I'm finally ready to do a little coding. Today, I'll mostly be working on the test suite and trying to get that out of the way.

End-of-the-day post:

I've looked through the local aligner source code in a bit more detail and I think I know my way around it pretty well now. I'm still a little unclear on the free gap policies (I understand the premise and a little of how it works, but I still need to wrap my head around some of the cases in the algorithm), but Chris told me not to worry too much about it, as it's not a priority for now.

I've managed to modify Daniel's test/correctness_test.cc so that it works on the existing code, and all the test cases pass. The test cases Daniel wrote were pretty complete but only tested the ScoreOnlyAlignment method in the local aligner. So next up will be writing some test cases for the FullAlignment, which may be a little harder since it doesn't just output numbers. I've written a few tests so far, and I think I may have found one or two bugs (but I'm going to have to look into that some more to see if it was my test case or the code that was failing). So tomorrow, I will mostly be doing...

Finishing up the test cases for the FullAlignment method.
Fixing any bugs that might come up along the way.
If I get a chance...maybe even start looking into the vectorized methods!

05/07/10

Finished writing the test cases for the local aligner and corrected a very minor bug I found along the way. I mostly used Daniel's test cases for the ScoreOnlyAlignment method, but added the FullAlignment method to the tests. However, I didn't manage to get middle-of-sequence mismatch and gap tests working, since there are so many different possibilities for the alignment to turn out, and it depends very heavily on the scoring method (and the values of the different penalties) as well.

Chris also gave me the source code for a vectorized Smith-Waterman function (from the SHRiMP library) and I spent the rest of the afternoon looking at that and trying to understand it. I found the documentation for the SSE2 functions on the Intel website (link to come later, not on the right now). From my understanding (and Chris' explanation), the function goes through the matrix using a diagonal vector ("positive slope") operating on 8*16bit values, and calculates the highest score from that vector. It uses the affine gap penalty method to calculate scores. From my discussion with Chris, I should now be working on:

Integrating the vectorized Smith-Waterman function into the NGSA library (I can either try using the source code given or make my own).
Configuring CMake so that it detects support for SSE2 (and maybe SSE3) so that it will compile either the vectorized Smith-Waterman or non-vectorized Smith-Waterman, depending on the machine.
Take a look at the banded Smith-Waterman function and try integrating that, too, perhaps. This function only checks a "strip" of cells going down the diagonal of the score matrix, since in practice, alignments aren't as useful when there are large variations from a full match.
Finally, in the distant future, start thinking about the free gap algorithms and vectorizing them.

05/10/10

I spent the morning trying to integrate the SHRiMP vectorized Smith-Waterman code into a new local aligner class, but I couldn't get it to pass the test cases and kept getting segmentation faults, etc. So, I decided it was more trouble than it was worth and to try building my own version of the vectorized Smith-Waterman directly. I also had a lot of trouble getting CMake to do the right thing, but I later figured out it was because I forgot to use the -msse2 -msse3 compiler flags.

I'm about a quarter complete on a score-only vectorized version for the local aligner, and I've got a basic game plan laid out for completing it; I've also found that the best way to understand some code is to try to code it yourself! I now have a much firmer grasp on the whole SHRiMP aligner and on why it does some of the things it does (for example, one question I had was why it stored values for b_gap, but no a_gap array). I think by tomorrow, I can:

Get the score-only vectorized Smith-Waterman finished and hopefully tested out.
Time permitting, get started on the full alignment with traceback.
Read up on CMake a bit more; it's still giving me a strange amount of trouble...

05/11/10

Today felt pretty pointless. I managed to finish my score-only Smith-Waterman alignment, but it ended up failing a bunch of tests. So I took a look at the SHRiMP code again and made a few changes based on that code to iron out the bugs. In the end, my code ended up looking almost exactly like the SHRiMP code, and yet was still buggy, so I ended up just copying and pasting the code into my version. I decided to cut and paste the SHRiMP code into my own class for a couple of reasons:

I'll have to use that code again, but modified when I do the full alignment (with traceback).
The SHRiMP code required a call to initialization, which was troublesome and not really required given the way the NGSA LocalAligner works.
Since SHRiMP was in C, I couldn't wrap their functions in namespaces.

I'll still have to make sure if my decision was correct with Chris, though.

The good news is, I got all the tests to pass on the score-only alignment. Now all I have to do is get a full alignment working. I also want to try to optimize for the linear gap penalty method, since that case is much simpler than the affine gap penalty. Although the linear case is just affine with a gap open penalty set to 0, I think I can do away with some of the vectors in the calculation, and so make it a little faster. That'll be a low priority, though. So for tomorrow, I think I'll be able to finish:

The full Smith-Waterman alignment (vectorized)
Hopefully get that to pass all the tests (i.e. make the behaviour the same as the non-vectorized version)
Take a look at optimizing for linear gap penalty if I have time.

05/12/10

Chris told me that the full alignment was a lower priority, since the score-only alignment is used much more often. So, I've tried optimizing for linear gap penalty instead. I managed to improve performance by not doing a few calculations required in the affine and also improve the memory usage slightly (by not having to use one genome-length-size array).

I did a few preliminary tests, using a randomly generated genome/read sequences of length 1000 each. Over 1000 iterations, my results were:

38.836576365 seconds using affine gap penalty
28.907083597 seconds using linear gap penalty

So I get a roughly 25% decrease in calculation time. Tests using more/fewer iterations and using different genome/read lengths show more or less the same amount.

I also finished a very preliminary version of the banded algorithm (non-vectorized), though I haven't tested or even compiled it yet. It's basically just a recoding of the original Smith-Waterman aligner in that it goes through the matrix left-to-right and top-to-bottom, but instead of starting from x=0 and going to x=read length, it starts and ends based on boundaries defined by two diagonals going through the matrix. I'm not too happy about this arrangement right now, since if I were to do a full alignment, I would have to keep the whole matrix in memory, which is a waste since I'm only calculating a small subset of it. Another method I've been thinking of was to "flip" the matrix, so it's a diamond shape and start calculating in vertical lines. I think this would be the way to go for the vectorized version, but I have no idea how to do that right now!

I do remember doing a "flip" in CS320, but I think it was with a square, whereas this matrix could be rectangular (shouldn't be that much different). I should look up my old notes later on, too.

For tomorrow, I will:

Finish up my banded non-vectorized algorithm and test it
Try coming up with a vectorized version (may involve having to "flip" the matrix).

05/13/10

Had a discussion with Chris about implementing the banded local aligner. He told me that I have actually been doing my non-vectorized version right and I should go left-to-right, top-to-bottom through the array with artificial bounds instead of from 0 to max x. He also told me that I can pretty much assume the matrix will be square, since we will be passing output from the index into the aligner, and the index will find a reference that is roughly the same size as the read. So, that solves the slope issue. He also told me a general way to traverse the matrix with a vector: instead of going left-to-right, then starting at the beginning, the vector will follow a right-down-right-down pattern (so it will "bounce").

I came in today and realized my method of finding the bounding lines on the banded non-vectorized version was pretty complicated; instead of finding the equations for the two bounding lines given some k distance of separation from the diagonal (which involves finding the slope and y-intercept and has square roots in it), I instead take the diagonal line and calculate all cells that are a horizontal distance k from it. This is much easier, and works out to pretty much the same thing. It has the added advantage of letting me take in a horizontal k value for my band radius, which will let me guarantee a correct score given the number of gaps <= k.

I managed to finish up the non-vectorized version as well as make up a (I think) pretty comprehensive test suite for it. In the end, I passed all my own tests, though I did miss a "-1" in one line of code that had me banging my head for a while. I also managed to improve the memory efficiency of the score-only local alignment function in general based on a few tricks I learned while combing through the vectorized Smith-Waterman code, though I've only implemented them in my banded aligner for now. Improvements include:

Keeping only one read-sized row of scores instead of 2.
Keeping only one read-sized row to keep track of gaps in the x-axis instead of 2 rows.
Having just one int to keep track of the last match, instead of 2 read-sized rows.
Having just one int to keep track of the last gap in y, instead of 2 read-sized rows.

To do:

Start tackling the vectorized banded algorithm! That'll be pretty challenging and probably keep me busy all day.
(Much later) Add in my improvements to the original local aligner.
(Still much later) Polish up the different aligners and have CMake compile either the vectorized or non-vectorized versions of the aligners based on CPU support.

05/14/10

I finished the vectorized banded implementation with all test cases passing! I really thought it would take more time and I was surprised when all the test cases passed. The implementation works only with square matrices, and uses 16x 8-bit vectors, so it should be quite a speed-up from the regular banded implementation. The strange thing with 8-bit vectors is that the instruction set is missing a few things compared to 16-bit vectors. For example, there's no insert or extract function, so I ended up casting the __m128i 's into an int8_t array and inserting/extracting manually; I'm not really sure if that's a good idea, though. Also, there was no max function for signed integers, so I had to write my own for that as well.

For Monday:

Edit the CMakeList.txt files so that it compiles vector/non-vector versions based whether SSE2 is available or not.
Polish up all the code I've written and merge.
Add in my improvements to the original local aligner.

05/17/10

I've integrated all the different aligners I've made together into one package. So now, there are two different local aligners, banded and un-banded, each of which have two different implementations, vectorized and non-vectorized. Which version of those two aligners are compiled depends on whether the target machine supports at least SSE2. I haven't really figured out how to get CMake to automatically detect whether SSE2 is supported or not, but instead I've added a variable to the CMakeLists.txt to be set manually.

I also cleaned up my implementations and optimized the existing local aligner implementation (only the ScoreOnlyAlignment method) to be more memory-efficient and faster. All public interfaces for both vectorized/non-vectorized versions of the aligners now match in both input and output and should be more-or-less equivalent in usage (with the exception of the free gap bits in the regular local aligner).

I'm not really sure what else to do for now, so I'll probably have to ask Chris/Daniel tomorrow.

05/18/10

Whoops, I realized I forgot to make an update yesterday. I merged all my changes into the main dev trunk and had a meeting with Anne and Daniel today. My next task is to profile the baligner using the Valgrind suite. I spent the rest of the day looking at how the baligner works and trying to get some working output out.

05/19/10

Did some performance tests on my local aligner implementations. Using my latest commit (5aff5ddf561ee005fb96fa01082479a53364d475), over 1,000 randomly generated sequence pairs of length 500:

Aligner	Avg Time (s)	Total Time (s)
Normal Local Aligner	0.01465	14.65
Banded Local Aligner	0.0009	0.9
Normal Vectorized Local	0.00389	3.89
Banded Vectorized Local	0.00029	0.29

Note these were not done on a Release build because the Release build messes up the vectorized banded aligner for some reason (outputs all 0). I'll have to look into why building as "Release" build is wrong but a regular build is not. I think it might be going wrong in my "custom" signed max function.

I also started profiling the baligner, after making sure all the output was correct. Since aligning the 125000 reads Daniel usually uses would have been way too slow, I used the e_coli_1000_1.fq reads file as a test. I ran memcheck first and fixed all the memory leaks that arose. I also ran callgrind over baligner and I found a significant bottleneck in the gap aligner, where a constant 256*256 matrix was being initialized and reinitialized multiple times over all the reads, when it only needed to be initialized once. After the fix, the speed increase was significant:

Test	Time (s)
Before fix, 1,000 reads	20.26
Before fix, 125,000 reads	401.71
After fix, 1,000 reads	2.72
After fix, 125,000 reads	71.98

For tomorrow:

Find out why the vectorized banded local aligner isn't working on a Release build and fix it.
Benchmark the local aligner using Release builds.
Do more profiling of the baligner.

Note to self: SSE 4.1 seems to have the _mm_max_epi8 and _mm_extract_epi8 functions that I needed! I might be able to fix the "Release" build error using that instruction set.

05/20/10

I managed to fix my vectorized banded aligner for Release mode. It turns out I calculated the value in one of my helper functions, but forgot to return it; I'm surprised it even worked in Debug mode! Anyway, here's the results for commit 5ab570185f6ba90e100f19ae10d938d7d813f08b, compiled under Release mode and doing 10,000 random pairs of sequences of length 500 each.

Aligner	Vectorized avg time (s)	Regular avg time (s)	Regular:Vectorized ratio
Normal Local Aligner	0.000213	0.003323	15.6
Banded Local Aligner	0.000023	0.00024	10.4

So in the end, there's quite a big difference between the regular and vectorized versions. It's interesting to see the improvement in the normal vectorized aligner is greater than the banded aligner, even though the banded aligner does twice as many calculations per vector. It's probably due to me having to code in my own max function in the banded aligner.

Afternoon update:

I did some more optimization (it's strangely fun!) of the banded case and managed to speed it up considerably. Following Chris' suggestions, I changed everything over to unsigned int8_t 's instead of signed (never noticed the unsigned saturated subtract function in SSE2, so I didn't know I could do this). I also changed some things over, like my "_mm_insert_epi8" workaround, and made the horizontal max calculation a bit faster (also following Chris' suggestion). Times are now: (using Release builds, 10,000 runs of 500nt each):

Aligner	Vectorized avg time (s)	Regular avg time (s)	Regular:Vectorized ratio
Normal Local Aligner	0.000215	0.003326	15.5
Banded Local Aligner	0.000007	0.000241	34.4

I realized an alignment size of 500nt isn't that realistic, since the vectorized banded version can only handle 255nt before it starts suffering from integer rollovers.

Anyways, I've also been profiling the baligner some more and trying to figure out why it's roughly 2x slower than the readaligner. I thought I had a breakthrough when I noticed baligner called Query::align() twice as many times as readaligner (for aligning reverse sequences), when Query already dealt with reverse alignments, but it turned out Daniel had already accounted for that. So, I'm pretty much back to square one!

To do:

More profiling to figure out what's wrong; maybe play with report, k, and iteration limit values.
Try making CMake compile different implementations of the local aligner based on CPU features.

05/21/10

Okay, I've been staring at the computer screen for way too long. I tried to figure out what's different between our baligner and the readaligner and makes readaligner so much faster, but I just can't seem to find it. Here are some of my findings:

On 125,000 reads, readaligner runs in about 9-10 seconds, while baligner runs in between 15-16 seconds.
If I comment out the lines that actually do the alignment (the call to Query::align()), readaligner takes <1 second, while baligner takes about 3 seconds. This means baligner has a lot more overhead when starting and finishing (i.e. not actually "aligning"), but this still doesn't account for the 6-7 second difference.
On 1000 reads, baligner and readaligner call Query::align() the same amount of times (if I use the Query class's own reverse functionality and comment out Daniel's). However, it seems calls after that start to diverge, but only slightly. For example, the call after Query::align(), MismatchQuery::firstStep(), gets called 3634 times on baligner, but only 3630 times on readaligner. It looks like this small difference is propagated downward through the calls, and ends in a difference of about 1000 more calls in baligner in a low-level function. This small difference in the beginning might be the source of the problem, since the increase in function calls could potentially get huge in 125,000 reads. However, I haven't profiled that yet, since it would take so long. Perhaps this is a good next step?
baligner is much more complicated than readaligner in terms of class structure, so this might be giving us some overhead. Also, I've noticed vectors are used a lot when passing information around; this might be a source of overhead, especially since just the vectors seem to be returned, instead of a pointer to a vector. This could be a future point of optimization, but I doubt it would save on very much time.

I think the most promising area to look into right now is still the Query::align() area, and why it's calling MismatchQuery::firstStep() slightly more than the readaligner. I think I'm going to run a profile comparing 125,000 reads on both aligners, just to confirm if it is that portion of the code that's giving us the problem. Also, it might just be a bunch of small optimization problems and overhead from using vectors, etc. that could be resulting in the performance loss, although I kind of doubt it.

Todo:

More profiling!

05/25/10

I did another profile on the baligner (now called saligner), and it looks like Query::align() is called the same number of times now. However, MismatchQuery::firstStep(), and all subsequent functions, weren't called the same number of times, which was strange, since it's only caller is Query::align(). After a lot more investigation, I finally found the problem:

In readaligner, they instantiate a single Pattern object for each read. In our saligner, we instantiate one Pattern per read per max mismatch value. When the Pattern is checked, it's first checked "as-is", and then reversed and checked again. Since readaligner's Pattern object is kept between k values, it will check the forward and reverse strands in different order than saligner. If a match is found, the alignment function terminates.

Unfortunately, this "discrepancy" isn't really a slowdown, since it's dependent on the given read set. For example, if all reads given match on the forward strand, then our saligner would be "faster". But, since I've finally found the error (or lack thereof), we can finally start to optimize.

To do:

Optimize!
Read up on policy classes (Chris mentioned this briefly but said it was a low priority, so I just wanted to jot it down before I forgot about it).

05/26/10

Spent a lot of time just thinking...Daniel and I were revisiting the class design for the aligner, and we encountered some design issues with the Mapper and Index classes. Specifically, at this point, the Mapper classes feel too dependent on the specific Index implementation, a problem arising from using external libraries (readaligner). For example, I wanted to move the NGS_FMIndex::LocateWithMismatches and NGS_FMIndex::LocateWithGaps methods into their respective Mapper classes, but I found it troublesome, since the LocateWithX methods encapsulate the Query::align() method. The Query class is a readaligner class and must take in TextCollection objects to be instantiated. Those objects are outputted from the FM Index used in readaligner only, so if the Mappers were to instantiate their own Query classes, they would essentially have to be used with only the FM Index implementation.

Daniel and I have also decided to put an option in the Mapper classes for the user to choose the maximum number of results they want. This is a minor optimization that would allow us to terminate the alignment function once a set amount of results have been reached, rather than going through with the whole alignment. This might also allow us to use fixed-length arrays instead of vectors (or at least reserve the space in the vector beforehand to prevent reallocs).

Another area I've investigated today was the cost of abstraction in our Index classes. I tried unabstracting the NGS_FMIndex and comparing runtimes. On 125k reads, I get about 9.8-10.1 seconds for both implementations, so the abstraction cost isn't really an issue. I just realized that Query objects are also abstracted those methods are called much more often. This might also be a good area to investigate.

Finally, I just realized that string 's can reserve space, too. I think this might be a good area to improve on, as I'm seeing a lot of calls to malloc from both string and vector. Considering both of these could get quite large in some cases, reserving memory beforehand may be a good idea so we don't get frequent realloc 's.

To do:

Discuss class design with Daniel and come up with a finalized design we're both happy with. This may or may not mean changing the current design (which is good, except for the bit with Index and Mapper.
Look into unabstracting the Query class and see how that fares. If it's a big improvement, look into policy classes.
Add an option to limit number of results.
Move instantiation of Query objects into the Mapper 's (if Daniel agrees)

05/27/10

Moved the NGS_FMIndex::Locate and NGS_FMIndex::LocateWithMismatches into the ExactMapper and MismatchMapper class respectively. I also "added" in multimap functionality into both these classes. I put quotes around it because it was already there to begin with (especially in MismatchMapper), but it was slightly different behaviour.

In ExactMapper, I didn't really get a speed improvement. This was pretty much expected because the alignment function it uses is part of TextCollection (as opposed to the other two mappers, which use the Query class). The method, TextCollection::locate, pretty much searches the entire index for all occurrences of the sequence, not for the first x occurences. So, adding a max results parameter pretty much just prevents overflow matches from being outputted, which wasn't that slow to begin with.

MismatchMapper gave me a better result, though the news is bittersweet. With multimaps turned off (setting max_results to 1), I get about a 10-20% speed improvement over the old mapper (all numbers are preliminary, I haven't actually tested on Skorpios yet). However, with any sort of multimapping, even with max_results = 2, I get pretty significant slowdown WRT the old mapper (set to the same max_results value). This isn't really a regression, though; the old MismatchMapper code actually had different behaviour for multimaps. In cases where alignments are found at a lower mismatch value, the old mapper would not continue searching at higher mismatch values, even if there are fewer alignments than max_results. The new MismatchMapper actually continues searching until it's reached its maximum mismatch value or it's found the maximum number of results. This behaviour is more in line with readaligner, although I'm considering pushing it back to the old behaviour because it makes a little more sense to me (and is much faster at high max_results). I should discuss with Daniel/Chris about this.

While testing the new MismatchMapper, I noticed something interesting: if I increase the max_results value (it's report in readaligner) for both saligner and readaligner, I get about the same 2-4 second difference in runtime, no matter what value I use. Since increasing max_results effectively increases the number of alignment steps, the fact that the time difference is pretty much constant supports my previous conjecture that it's the pre/post-processing (parsing of FASTQ/FASTA files, outputting to SAM format, etc) of our aligner that's leading to the time difference. So, again, that should be an area worth investigating.

To do:

Discuss class design with Daniel and come up with a finalized design we're both happy with. This may or may not mean changing the current design (which is good, except for the bit with Index and Mapper (haven't really done it today...)
Look into unabstracting the Query class and see how that fares. If it's a big improvement, look into policy classes (also hasn't been visited at all)
Add a max_results option to the GapMapper and move NGS_FMIndex::LocateWithGaps into GapMapper::Map.
Find more areas where reserving space for strings and vectors would be useful.
Benchmark the new implementations vs the old on Skorpios.
Find other areas to optimize.

05/28/10

Did some more cleanup today. I finished changing the GapMapper over to the new method (see yesterday's post). One point of concern: I saw when I first tested out gapped alignments that aligning 125,000 reads took about 71 seconds. However, I was testing today and the aligner now takes 129.64 seconds to align 125,000 reads, even with k=0 (which is pretty much "exact matching"). It's probably nothing, though, since my previous encounters with the gap mappers were long ago, before the huge refactoring job Daniel did. The test was most likely under different parameters.

Preliminary testing on my own machine shows that saligner, using the GapMapper, is slightly faster than the old version (about 5-10% faster). Further tests need to be done on Skorpios to confirm. Another interesting point is that the GapMapper actually uses the LocalAligner to output to SAM format. However, it uses LocalAligner::FullAlignment functions to do so, which have not been optimized at all. I can also see GapMapper benefiting from using the BandedLocalAligner instead, since gaps tend to be small (BandedLocalAligner supports up to 15 gaps). It might also be worth vectorizing the FullAlignment function; currently, only ScoreOnlyAlignment is vectorized.

I also came across an interesting result, today. I noticed that at k=0, GapMapper, MismatchMapper and ExactMapper all have the same output (no difference whatsoever using diff), even though ExactMapper is faster than the other two. In fact, ExactMapper is something like 16x faster than GapMapper at k=0 on my machine! So, I think an interesting optimization would be to have ExactMapper do an initial alignment and using the more complex mappers only after it has returned negative. This follows through with existing behaviour anyway (both GapMapper and MismatchMapper will try the lowest k value, k=0, first before moving on to progressively higher k values). With data that is mostly exact, this could lead to a serious improvement, since most reads won't even need to touch the more complex alignment functions. This does lead to a question of whether or not the more complex mappers should even start at k=0 at all, since in most cases, ExactMapper will be there to cover it. The question is, I'm not sure if I should have Gap/MismatchMapper call ExactMapper directly within their Map functions, or whether I should let the user decide to call ExactMapper first. I'm currently leaning towards the latter option, since, as implied by their names, Mismatch/GapMapper are there to find alignments with gaps or mismatches, not to find alignments with exact matches. So, I'm proposing to start both mappers at k=1 instead of k=0, and let ExactMapper handle the k=0 condition (but the user has to do that explicitly). Finally, one more point to note is that even if those mappers start at k=1, they still find exact results, which leads me to think that Query::align checks for k=0 and k=1 when I call it for k=1 (and etc. for higher k). This might be an area of optimization worth looking into, since there would be a lot of overlap as we go higher in k value. However, it might not actually be possible to optimize this; I'm not familiar enough with the Query::align functions to know.

Finally, Daniel and I finished our discussion on class hierarchy and I think we're both pretty happy with the design. I've sent an email off to Chris for the finalized version (available on the discussion page of the Twiki).

To do:

Add in a new mapper, SequentialMapper, which allows the user to pass in two mappers. SequentialMapper will then map using the first mapper, and use the second mapper only if the first mapper fails. This should introduce more flexibility in the aligner.
Look into unabstracting the Query class and seeing how it fares (still haven't touched this yet).
Benchmark on Skorpios
Find more areas where reserving space will benefit (haven't touched this either)
Make a vectorized LocalAligner::FullAlignment function.
Make a BandedLocalAligner::FullAlignment function (both vectorized and non-vectorized).

05/31/10

Merged in my changes and did a few benchmarks on Skorpios. I'm not really expecting much of a speed increase, since most of the changes were just refactoring the code and making the classes more cohesive. Here are my results with commit 6e1407b164c7dacb84d857ad57c24ba10eee5f65 (new) compared to commit 5b2b316746ca115f88cc315a21b298c6d3cbedcd (old):

Alignment Type	k	max_results	Old time/read (s)	New time/read (s)	Old:New ratio
Exact	-	1	4.384e-05	3.84267e-05	1.14
Exact	-	10	4.37333e-05	4.088e-05	1.07
Mismatch	2	1	7.56267e-05	6.68267e-05	1.13
Mismatch	2	10	8.016e-05	7.528e-05	1.06
Gap	2	1	8.8936e-04	8.79333e-04	1.01
Gap	2	10	9.348e-04	9.36373e-04	0.998

Tests were done using the provided E. coli genome (NC_010473.fasta) on 375,000 reads (e_coli_125000_1.fq copied 3 times). Tested using the single-end aligner (saligner) release build, times taken from saligner output.

So there were slight speed improvements to exact and mismatch mappers and a negligible speed different in the gap mapper. The speed improvements are probably from some minor fixes I made during the cleanup, but I haven't really touched the readaligner Query classes, which takes up much more time in GapMapper than the other two mappers. Also, the more calls to the readaligner align functions in Exact/Mismatch mappers, the less speed improvement we see.

I spent a while verifying the output between the two versions match, and for the most part they do (tested Exact, Mismatch and Gap mappers at max_results = 1, and max_results = 10 at k = 2).

Finally, I revisited the LocalAligner and tried to make a more efficient FullAlignment function by using arrays instead of vectors and speeding up the recursive Traceback function (it had a bunch of string-passing by value). I ended up failing, getting segmentation faults everywhere, but I think it was just because of me getting my columns and rows mixed up in my matrices. Anyways, more of that tomorrow, when I have a fresher mind.

To do:

I'm going to table the SequentialMapper thing for now, since it's not very high priority. Chris also thought it might be a good idea to look into why MismatchMapper and GapMapper are so slow at k=0, instead of just replacing them with ExactMapper (ideally, ExactMapper shouldn't even be needed).
Unabstracting Query - it's not very necessary on second look, since I usually use the specific Query I need instead of the general Query object anyway.
Find more areas where reserving space will benefit (still not touched)
Improve the AlignerInterface::FullAlignment functions.
Benchmark BWA on skorpios.

Raw edit | More topic actions

Topic revision: r1 - 2010-06-02 - jayzhang