local_aligner.cc
bit (which I'm assuming my vectorized SW will be based a lot off of). I can understand the majority of the code, although I'm a little iffy on some of the gap penalty calculations.
So after all that, I think I'm finally ready to do a little coding. Today, I'll mostly be working on the test suite and trying to get that out of the way.
I've managed to modify Daniel's test/correctness_test.cc
so that it works on the existing code, and all the test cases pass. The test cases Daniel wrote were pretty complete but only tested the ScoreOnlyAlignment
method in the local aligner. So next up will be writing some test cases for the FullAlignment
, which may be a little harder since it doesn't just output numbers. I've written a few tests so far, and I think I may have found one or two bugs (but I'm going to have to look into that some more to see if it was my test case or the code that was failing). So tomorrow, I will mostly be doing...
FullAlignment
method.
ScoreOnlyAlignment
method, but added the FullAlignment
method to the tests. However, I didn't manage to get middle-of-sequence mismatch and gap tests working, since there are so many different possibilities for the alignment to turn out, and it depends very heavily on the scoring method (and the values of the different penalties) as well.
Chris also gave me the source code for a vectorized Smith-Waterman function (from the SHRiMP library) and I spent the rest of the afternoon looking at that and trying to understand it. I found the documentation for the SSE2 functions on the Intel website (link to come later, not on the right now). From my understanding (and Chris' explanation), the function goes through the matrix using a diagonal vector ("positive slope") operating on 8*16bit values, and calculates the highest score from that vector. It uses the affine gap penalty method to calculate scores. From my discussion with Chris, I should now be working on:
-msse2 -msse3
compiler flags.
I'm about a quarter complete on a score-only vectorized version for the local aligner, and I've got a basic game plan laid out for completing it; I've also found that the best way to understand some code is to try to code it yourself! I now have a much firmer grasp on the whole SHRiMP aligner and on why it does some of the things it does (for example, one question I had was why it stored values for b_gap
, but no a_gap
array). I think by tomorrow, I can:
The good news is, I got all the tests to pass on the score-only alignment. Now all I have to do is get a full alignment working. I also want to try to optimize for the linear gap penalty method, since that case is much simpler than the affine gap penalty. Although the linear case is just affine with a gap open penalty set to 0, I think I can do away with some of the vectors in the calculation, and so make it a little faster. That'll be a low priority, though. So for tomorrow, I think I'll be able to finish:
I did a few preliminary tests, using a randomly generated genome/read sequences of length 1000 each. Over 1000 iterations, my results were:
I also finished a very preliminary version of the banded algorithm (non-vectorized), though I haven't tested or even compiled it yet. It's basically just a recoding of the original Smith-Waterman aligner in that it goes through the matrix left-to-right and top-to-bottom, but instead of starting from x=0 and going to x=read length, it starts and ends based on boundaries defined by two diagonals going through the matrix. I'm not too happy about this arrangement right now, since if I were to do a full alignment, I would have to keep the whole matrix in memory, which is a waste since I'm only calculating a small subset of it. Another method I've been thinking of was to "flip" the matrix, so it's a diamond shape and start calculating in vertical lines. I think this would be the way to go for the vectorized version, but I have no idea how to do that right now!
I do remember doing a "flip" in CS320, but I think it was with a square, whereas this matrix could be rectangular (shouldn't be that much different). I should look up my old notes later on, too.
For tomorrow, I will:
I came in today and realized my method of finding the bounding lines on the banded non-vectorized version was pretty complicated; instead of finding the equations for the two bounding lines given some k
distance of separation from the diagonal (which involves finding the slope and y-intercept and has square roots in it), I instead take the diagonal line and calculate all cells that are a horizontal distance k
from it. This is much easier, and works out to pretty much the same thing. It has the added advantage of letting me take in a horizontal k
value for my band radius, which will let me guarantee a correct score given the number of gaps <= k
.
I managed to finish up the non-vectorized version as well as make up a (I think) pretty comprehensive test suite for it. In the end, I passed all my own tests, though I did miss a "-1" in one line of code that had me banging my head for a while. I also managed to improve the memory efficiency of the score-only local alignment function in general based on a few tricks I learned while combing through the vectorized Smith-Waterman code, though I've only implemented them in my banded aligner for now. Improvements include:
int
to keep track of the last match, instead of 2 read-sized rows.
int
to keep track of the last gap in y, instead of 2 read-sized rows.
To do:
__m128i
's into an int8_t
array and inserting/extracting manually; I'm not really sure if that's a good idea, though. Also, there was no max function for signed integers, so I had to write my own for that as well.
For Monday:
I also cleaned up my implementations and optimized the existing local aligner implementation (only the ScoreOnlyAlignment method) to be more memory-efficient and faster. All public interfaces for both vectorized/non-vectorized versions of the aligners now match in both input and output and should be more-or-less equivalent in usage (with the exception of the free gap bits in the regular local aligner).
I'm not really sure what else to do for now, so I'll probably have to ask Chris/Daniel tomorrow.
baligner
using the Valgrind suite. I spent the rest of the day looking at how the baligner
works and trying to get some working output out.
5aff5ddf561ee005fb96fa01082479a53364d475
), over 1,000 randomly generated sequence pairs of length 500:
Aligner![]() |
Avg Time (s) | Total Time (s) |
---|---|---|
Banded Local Aligner | 0.0009 | 0.9 |
Banded Vectorized Local | 0.00029 | 0.29 |
Normal Local Aligner | 0.01465 | 14.65 |
Normal Vectorized Local | 0.00389 | 3.89 |
Note these were not done on a Release build because the Release build messes up the vectorized banded aligner for some reason (outputs all 0). I'll have to look into why building as "Release" build is wrong but a regular build is not. I think it might be going wrong in my "custom" signed max function.
I also started profiling the baligner, after making sure all the output was correct. Since aligning the 125000 reads Daniel usually uses would have been way too slow, I used the e_coli_1000_1.fq reads file as a test. I ran memcheck first and fixed all the memory leaks that arose. I also ran callgrind over baligner and I found a significant bottleneck in the gap aligner, where a constant 256*256 matrix was being initialized and reinitialized multiple times over all the reads, when it only needed to be initialized once. After the fix, the speed increase was significant:
Test | Time (s) |
---|---|
Before fix, 1,000 reads | 20.26 |
Before fix, 125,000 reads | 401.71 |
After fix, 1,000 reads | 2.72 |
After fix, 125,000 reads | 71.98 |
For tomorrow:
Note to self: SSE 4.1 seems to have the _mm_max_epi8
and _mm_extract_epi8
functions that I needed! I might be able to fix the "Release" build error using that instruction set.
5ab570185f6ba90e100f19ae10d938d7d813f08b
, compiled under Release mode and doing 10,000 random pairs of sequences of length 500 each.
Aligner | Vectorized avg time (s) | Regular avg time (s) | Regular:Vectorized ratio |
---|---|---|---|
Normal Local Aligner | 0.000213 | 0.003323 | 15.6 |
Banded Local Aligner | 0.000023 | 0.00024 | 10.4 |
So in the end, there's quite a big difference between the regular and vectorized versions. It's interesting to see the improvement in the normal vectorized aligner is greater than the banded aligner, even though the banded aligner does twice as many calculations per vector. It's probably due to me having to code in my own max function in the banded aligner.
unsigned int8_t
's instead of signed (never noticed the unsigned saturated subtract function in SSE2, so I didn't know I could do this). I also changed some things over, like my "_mm_insert_epi8" workaround, and made the horizontal max calculation a bit faster (also following Chris' suggestion). Times are now: (using Release builds, 10,000 runs of 500nt each):
Aligner | Vectorized avg time (s) | Regular avg time (s) | Regular:Vectorized ratio |
---|---|---|---|
Normal Local Aligner | 0.000215 | 0.003326 | 15.5 |
Banded Local Aligner | 0.000007 | 0.000241 | 34.4 |
I realized an alignment size of 500nt isn't that realistic, since the vectorized banded version can only handle 255nt before it starts suffering from integer rollovers.
Anyways, I've also been profiling the baligner some more and trying to figure out why it's roughly 2x slower than the readaligner. I thought I had a breakthrough when I noticed baligner called Query::align()
twice as many times as readaligner (for aligning reverse sequences), when Query already dealt with reverse alignments, but it turned out Daniel had already accounted for that. So, I'm pretty much back to square one!
To do:
Query::align()
), readaligner takes <1 second, while baligner takes about 3 seconds. This means baligner has a lot more overhead when starting and finishing (i.e. not actually "aligning"), but this still doesn't account for the 6-7 second difference.
Query::align()
the same amount of times (if I use the Query class's own reverse functionality and comment out Daniel's). However, it seems calls after that start to diverge, but only slightly. For example, the call after Query::align()
, MismatchQuery::firstStep()
, gets called 3634 times on baligner, but only 3630 times on readaligner. It looks like this small difference is propagated downward through the calls, and ends in a difference of about 1000 more calls in baligner in a low-level function. This small difference in the beginning might be the source of the problem, since the increase in function calls could potentially get huge in 125,000 reads. However, I haven't profiled that yet, since it would take so long. Perhaps this is a good next step?
I think the most promising area to look into right now is still the Query::align()
area, and why it's calling MismatchQuery::firstStep()
slightly more than the readaligner. I think I'm going to run a profile comparing 125,000 reads on both aligners, just to confirm if it is that portion of the code that's giving us the problem. Also, it might just be a bunch of small optimization problems and overhead from using vectors, etc. that could be resulting in the performance loss, although I kind of doubt it.
Todo:
Query::align()
is called the same number of times now. However, MismatchQuery::firstStep()
, and all subsequent functions, weren't called the same number of times, which was strange, since it's only caller is Query::align()
. After a lot more investigation, I finally found the problem:
In readaligner, they instantiate a single Pattern object for each read. In our saligner, we instantiate one Pattern per read per max mismatch value. When the Pattern is checked, it's first checked "as-is", and then reversed and checked again. Since readaligner's Pattern object is kept between k values, it will check the forward and reverse strands in different order than saligner. If a match is found, the alignment function terminates.
Unfortunately, this "discrepancy" isn't really a slowdown, since it's dependent on the given read set. For example, if all reads given match on the forward strand, then our saligner would be "faster". But, since I've finally found the error (or lack thereof), we can finally start to optimize.
To do:
Mapper
and Index
classes. Specifically, at this point, the Mapper
classes feel too dependent on the specific Index
implementation, a problem arising from using external libraries (readaligner). For example, I wanted to move the NGS_FMIndex::LocateWithMismatches
and NGS_FMIndex::LocateWithGaps
methods into their respective Mapper
classes, but I found it troublesome, since the LocateWithX
methods encapsulate the Query::align()
method. The Query
class is a readaligner class and must take in TextCollection
objects to be instantiated. Those objects are outputted from the FM Index used in readaligner only, so if the Mappers
were to instantiate their own Query
classes, they would essentially have to be used with only the FM Index implementation.
Daniel and I have also decided to put an option in the Mapper
classes for the user to choose the maximum number of results they want. This is a minor optimization that would allow us to terminate the alignment function once a set amount of results have been reached, rather than going through with the whole alignment. This might also allow us to use fixed-length arrays instead of vectors (or at least reserve the space in the vector beforehand to prevent reallocs
).
Another area I've investigated today was the cost of abstraction in our Index
classes. I tried unabstracting the NGS_FMIndex
and comparing runtimes. On 125k reads, I get about 9.8-10.1 seconds for both implementations, so the abstraction cost isn't really an issue. I just realized that Query
objects are also abstracted those methods are called much more often. This might also be a good area to investigate.
Finally, I just realized that string
's can reserve space, too. I think this might be a good area to improve on, as I'm seeing a lot of calls to malloc
from both string
and vector
. Considering both of these could get quite large in some cases, reserving memory
beforehand may be a good idea so we don't get frequent
realloc
's.
To do:
Index
and Mapper
.
Query
class and see how that fares. If it's a big improvement, look into policy classes.
Query
objects into the Mapper
's (if Daniel agrees)
NGS_FMIndex::Locate
and NGS_FMIndex::LocateWithMismatches
into the ExactMapper
and MismatchMapper
class respectively. I also "added" in multimap functionality into both these classes. I put quotes around it because it was already there to begin with (especially in MismatchMapper
), but it was slightly different behaviour.
In ExactMapper
, I didn't really get a speed improvement. This was pretty much expected because the alignment function it uses is part of TextCollection
(as opposed to the other two mappers, which use the Query
class). The method, TextCollection::locate
, pretty much searches the entire index for all occurrences of the sequence, not for the first x occurences. So, adding a max results parameter pretty much just prevents overflow matches from being outputted, which wasn't that slow to begin with.
MismatchMapper
gave me a better result, though the news is bittersweet. With multimaps turned off (setting max_results
to 1), I get about a 10-20% speed improvement over the old mapper (all numbers are preliminary, I haven't actually tested on Skorpios yet). However, with any sort of multimapping, even with max_results
= 2, I get pretty significant slowdown WRT the old mapper (set to the same max_results
value). This isn't really a regression, though; the old MismatchMapper
code actually had different behaviour for multimaps. In cases where alignments are found at a lower mismatch value, the old mapper would not continue searching at higher mismatch values, even if there are fewer alignments than max_results
. The new MismatchMapper
actually continues searching until it's reached its maximum mismatch value or it's found the maximum number of results. This behaviour is more in line with readaligner, although I'm considering pushing it back to the old behaviour because it makes a little more sense to me (and is much faster at high max_results
). I should discuss with Daniel/Chris about this.
While testing the new MismatchMapper
, I noticed something interesting: if I increase the max_results
value (it's report
in readaligner) for both saligner and readaligner, I get about the same 2-4 second difference in runtime, no matter what value I use. Since increasing max_results
effectively increases the number of alignment steps, the fact that the time difference is pretty much constant supports my previous conjecture that it's the pre/post-processing (parsing of FASTQ/FASTA files, outputting to SAM format, etc) of our aligner that's leading to the time difference. So, again, that should be an area worth investigating.
To do:
Index
and Mapper
(haven't really done it today...)
Query
class and see how that fares. If it's a big improvement, look into policy classes (also hasn't been visited at all)
max_results
option to the GapMapper
and move NGS_FMIndex::LocateWithGaps
into GapMapper::Map
.
GapMapper
over to the new method (see yesterday's post). One point of concern: I saw when I first tested out gapped alignments that aligning 125,000 reads took about 71 seconds. However, I was testing today and the aligner now takes 129.64 seconds to align 125,000 reads, even with k=0 (which is pretty much "exact matching"). It's probably nothing, though, since my previous encounters with the gap mappers were long ago, before the huge refactoring job Daniel did. The test was most likely under different parameters.
Preliminary testing on my own machine shows that saligner, using the GapMapper
, is slightly faster than the old version (about 5-10% faster). Further tests need to be done on Skorpios to confirm. Another interesting point is that the GapMapper
actually uses the LocalAligner
to output to SAM format. However, it uses LocalAligner::FullAlignment
functions to do so, which have not been optimized at all. I can also see GapMapper
benefiting from using the BandedLocalAligner
instead, since gaps tend to be small (BandedLocalAligner
supports up to 15 gaps). It might also be worth vectorizing the FullAlignment
function; currently, only ScoreOnlyAlignment
is vectorized.
I also came across an interesting result, today. I noticed that at k=0, GapMapper
, MismatchMapper
and ExactMapper
all have the same output (no difference whatsoever using diff
), even though ExactMapper
is faster than the other two. In fact, ExactMapper
is something like 16x faster than GapMapper
at k=0 on my machine! So, I think an interesting optimization would be to have ExactMapper
do an initial alignment and using the more complex mappers only after it has returned negative. This follows through with existing behaviour anyway (both GapMapper
and MismatchMapper
will try the lowest k value, k=0, first before moving on to progressively higher k values). With data that is mostly exact, this could lead to a serious improvement, since most reads won't even need to touch the more complex alignment functions. This does lead to a question of whether or not the more complex mappers should even start at k=0 at all, since in most cases, ExactMapper
will be there to cover it. The question is, I'm not sure if I should have Gap/MismatchMapper
call ExactMapper
directly within their Map functions, or whether I should let the user decide to call ExactMapper
first. I'm currently leaning towards the latter option, since, as implied by their names, Mismatch/GapMapper
are there to find alignments with gaps or mismatches, not to find alignments with exact matches. So, I'm proposing to start both mappers at k=1 instead of k=0, and let ExactMapper
handle the k=0 condition (but the user has to do that explicitly). Finally, one more point to note is that even if those mappers start at k=1, they still find exact results, which leads me to think that Query::align
checks for k=0 and k=1 when I call it for k=1 (and etc. for higher k). This might be an area of optimization worth looking into, since there would be a lot of overlap as we go higher in k value. However, it might not actually be possible to optimize this; I'm not familiar enough with the Query::align
functions to know.
Finally, Daniel and I finished our discussion on class hierarchy and I think we're both pretty happy with the design. I've sent an email off to Chris for the finalized version (available on the discussion page of the Twiki).
To do:
SequentialMapper
, which allows the user to pass in two mappers. SequentialMapper
will then map using the first mapper, and use the second mapper only if the first mapper fails. This should introduce more flexibility in the aligner.
Query
class and seeing how it fares (still haven't touched this yet).
LocalAligner::FullAlignment
function.
BandedLocalAligner::FullAlignment
function (both vectorized and non-vectorized).
6e1407b164c7dacb84d857ad57c24ba10eee5f65
(new) compared to commit 5b2b316746ca115f88cc315a21b298c6d3cbedcd
(old):
Alignment Type | k | max_results | Old time/read (s) | New time/read (s) | Old:New ratio |
---|---|---|---|---|---|
Exact | - | 1 | 4.384e-05 | 3.84267e-05 | 1.14 |
Exact | - | 10 | 4.37333e-05 | 4.088e-05 | 1.07 |
Mismatch | 2 | 1 | 7.56267e-05 | 6.68267e-05 | 1.13 |
Mismatch | 2 | 10 | 8.016e-05 | 7.528e-05 | 1.06 |
Gap | 2 | 1 | 8.8936e-04 | 8.79333e-04 | 1.01 |
Gap | 2 | 10 | 9.348e-04 | 9.36373e-04 | 0.998 |
Tests were done using the provided E. coli genome (NC_010473.fasta
) on 375,000 reads (e_coli_125000_1.fq
copied 3 times). Tested using the single-end aligner (saligner
) release build, times taken from saligner
output.
So there were slight speed improvements to exact and mismatch mappers and a negligible speed different in the gap mapper. The speed improvements are probably from some minor fixes I made during the cleanup, but I haven't really touched the readaligner Query
classes, which takes up much more time in GapMapper
than the other two mappers. Also, the more calls to the readaligner align functions in Exact/Mismatch mappers, the less speed improvement we see.
I spent a while verifying the output between the two versions match, and for the most part they do (tested Exact, Mismatch and Gap mappers at max_results = 1, and max_results = 10 at k = 2).
Finally, I revisited the LocalAligner
and tried to make a more efficient FullAlignment
function by using arrays instead of vectors and speeding up the recursive Traceback
function (it had a bunch of string-passing by value). I ended up failing, getting segmentation faults everywhere, but I think it was just because of me getting my columns and rows mixed up in my matrices. Anyways, more of that tomorrow, when I have a fresher mind.
To do:
SequentialMapper
thing for now, since it's not very high priority. Chris also thought it might be a good idea to look into why MismatchMapper
and GapMapper
are so slow at k=0, instead of just replacing them with ExactMapper
(ideally, ExactMapper
shouldn't even be needed).
Query
- it's not very necessary on second look, since I usually use the specific Query I need instead of the general Query object anyway.
AlignerInterface::FullAlignment
functions.