Also worked on the rank structure a little more. I can now build sequences greater than 64 bases in length (so as big as needed, with memory restrictions of course). I figured out how to align to 16 bytes with posix_memalign
. I've almost completed the rank function as well, except I'm just getting a small bug with a boundary condition (I'm pretty sure I'm just missing a +1 somewhere).
I was thinking about the calculation needed for the rank structure some more. Currently, the calculation goes something like this (for example, for a C): (~second & first) >> (BITS-i-1)
. Chris told me that this takes something like three operations to do, since some operations that are independent can be done in parallel. I figure the following equation would also work for calculation: (~second & first) & (0xFFFF... >> BITS-i)
(these two formulas aren't numerically equivalent, but they should give the same popcount
result). I'm not sure if this will do anything, but it looks like I might be able to drop the operations to 2, if I can somehow calculate the mask fast. I'm going to have to play around with that a bit more.
To do:
uint64_t
arrays to store the two (first and second) bitsequences. Before I start an implementation that stores precalculated values though, I want to do a few experiments with one array that holds each block sequentially.
I also did a few tests with the new calculation method ( ~second & first) & (0xFFFF...FF >> BITS-i
). I tried making an array of length BITS
to store the precalculated masks for every i
. That way might lead to fewer operations (not really sure on that point). Preliminary tests didn't show much difference, especially on -O3
. The strange thing is, I actually compiled a minimal rank function using either of the two calculations as assembly, and at -O3
, there wasn't even any difference in the assembly code! Strange, since the two aren't really numerically equivalent, just "popcount equivalent".
To do:
I also did some benchmarks: On skorpios at O3, using a test sequence of 1280000 characters, ranking from position 1230000-1280000:
Implementation | A (ms) | C (ms) | G (ms) | T (ms) |
---|---|---|---|---|
Parallel | 5164 | 4661 | 4661 | 4921 |
Sequential | 5025 | 4793 | 4792 | 4858 |
It's interesting to note that on the parallel implementation, ranking C and G is actually faster than T! This is strange because the calculation for T involves an &
operation on the two blocks, whereas C/G involves an &
and a ~
operation. The times are more as expected under the sequential implementation, although T ranking should still be faster (it's got fewer operations!). What's more surprising is that the speeds for C/G are faster on parallel!
Using cachegrind
, it seems that there are far fewer (about 75%) "D refs" when ranking C/G than A/T in the parallel implementation (turns out "D refs" are main memory references). Specifically, it seems that there are much fewer D writes occurring in the C/G cases (reads are fewer as well, but not as significantly as writes). Furthermore, under the sequential implementation, the D references for all bases are more on par with the A/T rank functions in the parallel implementation. It seems there might be some specific optimization in the parallel C/G case that the compiler is making. I kind of want to investigate this a bit further; ideally, I can somehow end up with the parallel C/G times with the sequential A/T times.
To do:
Also, I finished implementing count blocks; currently, each 64-bit block has a count block (which is 32 bits), with 4 count blocks per rank block, one for each base.
So here are the results, run on skorpios
at -O3
, with 40,000,000 rank calls each. Times are wall time:
Sequence length (*10^6) | Normal (ms) | Interleaved | Normal + HW popcount | Interleaved + HW popcount |
---|---|---|---|---|
1 | 792 | 750 | 244 | 237 |
2 | 864 | 861 | 265 | 250 |
4 | 924 | 935 | 274 | 255 |
8 | 1134 | 1063 | 449 | 363 |
16 | 1567 | 1460 | 752 | 639 |
64 | 3857 | 3018 | 1242 | 1034 |
256 | 4495 | 3910 | 1901 | 1358 |
2048 | 5332 | 4639 | 2915 | 1957 |
To do:
skorpios
is down right now, I just did some preliminary testing with my laptop (32-bit, Intel Core 2 Duo, no hardware popcount, 6 MB cache, 3.2 GB RAM).
Again, times are in wall time, compiled at -O3
, single threaded and run with 50,000,000 ranks each.
Sequence length (*10^6) | Normal (ms) | Interleaved | Interleaved + counts | Interleaved, 3-base |
---|---|---|---|---|
1 | 1046 | 985 | 1105 | 1013 |
8 | 1382 | 1202 | 1208 | 1245 |
16 | 3276 | 2768 | 2137 | 2522 |
32 | 5059 | 4525 | 3556 | 4474 |
64 | 5708 | 5100 | 4269 | 5292 |
128 | 5901 | 5303 | 4617 | 5545 |
256 | 5985 | 5387 | 4834 | 5686 |
512 | 6111 | 5556 | 4989 | 5786 |
1024 | 7967 | 7392 | 6690 | 7653 |
The 1024M times are again a bit dubious. I fixed the segfault bug, but I think my computer was running out of memory, and it looks like a big jump in time. Regardless, I'm going to test these again on skorpios
when it gets back. It doesn't look like there's too much of a performance hit with the 3-base implementation, but if we stick with interleaved counts, there's really no point in going that route, since the 4 32-bit counts fit nicely into 2 64-bit "rank blocks". Using only 3 counts wouldn't do much for memory, since there will still be those 32 bits available.
I also experimented with some different popcount functions, found on the Wikipedia entry. So far, it seems like the current popcount function (which I stole from
readaligner
) is the faster than the one proposed by Wikipedia. However, the current popcount function relies on a table of values, which might make it slower if there's a lot of cache misses on it, though it outperformed Wikipedia's function even at 1024M bases (don't have the exact values, another thing to benchmark).
Finally, I started implementing the sampling rates; there's currently a bug somewhere, but it shouldn't be too bad to fix.
PS: Before I forget, I realized I did actually lock the whole single end output processing section in my multithreaded driver (I originally thought I just locked the part where it writes to output, whereas now I realized that it locks before that, when it's processing the SamEntry into a string). I should experiment with locking it at a different place (though at this point, it's not that easy, unless I want to put locks in the sam file class itself.)
To do:
skorpios
.
valgrind
errors, so now everything runs (hopefully) well.
I also managed to finish an interleaved count implementation, which was much harder. In fact, I pretty much had to do a complete recode of the core build/rank functions. I need to run a few more tests on the implementation, but it should be running fine.
Since skorpios
is back now, I'll probably do a bunch of experiments tomorrow. I should also make up a 3-base sampling rate implementation, just to test the cost; again, however, if we end up going with the count-interleaved version, the 3-base version really isn't very practical.
To do:
Times are in milliseconds, tested on skorpios
with a constant seed, and ranked 75,000,000 times per test. Squares are interleaved, diamonds are 3-base, and triangles are interleaved count implementations. The three different versions all use different popcount
functions: blue is the function provided in readaligner, which uses a table; yellow is one provided by Wikipedia, which uses a bit-mangling method; and orange uses the hardware instruction.
In previous tests, I noted that the table version was faster than the bit-mangling version; it's interesting to note that in these implementations, the bit-mangling popcount
is actually better. My thoughts are that once everything isn't able to be fit nicely into the cache, then the popcount
table may encounter cache misses, which slows it down dramatically. It's also interesting to see that when using the tabular popcount, interleaving the counts is actually the slowest implementation. Again, this might be due to excessive cache misses from having to cache the table, and it's actually what led me to try the bit-mangling popcount
one more time. It's nice to see that interleaving the counts has some effect, especially in the hardware popcount
case, though.
Finally, I talked with Anne and Chris about a new project, possibly for my honours thesis, today. Chris proposed a GPGPU version of the Smith-Waterman aligner I vectorized earlier, or even GPGPU-capable succinct index. Anne also suggested that I look into the RNA folding pathway research currently going on as well, and I'll probably start reading some papers on that soon.
To do:
Edit: I realized I forgot to benchmark everything WRT sampling rate, so that's another item on the to-do list.
popcount
function in basics.h for NGSA, and it did actually turn out to be faster, at least with skorpios
. Times were 84.29 seconds vs 79.75 seconds, for tabular popcount vs bit-mangling popcount. So maybe we should change the function in basics to the bit-mangling version?
rank9
has no practical way of adjusting sampling rates, which is something we want. So, I think I'll stick with the regular two-level structure with a few modifications: The implementation shouldn't be too bad. I've decided to stick with the interleaved counts implementation for now, where all the counts are interleaved with the sequence blocks. Adding a higher level would just mean adding more blocks, this time with a constant sampling rate of 1023. Ranks would just be the sum of the nearest low-level block and the nearest top-level block. I don't think I will interleave the top level blocks, since they're much too far apart to benefit anyway.
Another optimization I've been thinking about is to limit the sampling rate to powers of 2. That way, when finding the nearest blocks, I can just do bit shifts instead of costly divisions. I'm also thinking about using a sampling rate of 512 for the top-level blocks, so that I can use bit shifts as well. The difference in space usage isn't very large, less than a megabyte (in theory) for a 4-billion base sequence (calculation here, multiplied by 4 for all four bases).
Finally, here is the graph I promised on Friday, compared to sampling rate. Tests were done on skorpios
with sequence size of 512,000,000 over 25,000,000 times each and a constant seed:
*Notes: "Interleaved" is the normal interleaved implementation, "3-base" is the slightly more memory-efficient method which keeps track of only 3 bases, and "Count-Interleaved" is the version with counts interleaved into the rank blocks. Blue dots are versions with the tabled popcount, yellow dots have the bit-mangling popcount, while orange dots use the SSE4.2 hardware instruction.
It's a little hard to tell, but it looks like the count interleaved implementation shows minor improvements in speed (about 5%) up until sampling rate=128, where it starts to get much slower.
To do:
So now, any sample factor that is inputted is rounded down to the next smallest power of two. This way, I can use bit operations instead of divisions and multiplications, which are slightly slower. Running a few tests, I see about a 10% performance gain.
To do:
Yesterday, I managed to actually finish up my two-level implementation (turns out there were a few bugs in certain cases, due to +1/-1 errors). I also implemented save/load functionality, so that the structure itself can be saved. To use, just call RankStructure::Save
on a RankStructure
that has already been initialized (by calling RankStructure::Init
). Loading works the same way, just call RankStructure::Load
with a filename supplied. Note that when loading a RankStructure
, RankStructure::Init
doesn't have to be called first. I'm not sure whether I should be taking in a file name as a string or whether I should just take in a stream pointer; readaligner's rank structure seems to just take in a pointer, so I might switch to that. Only problem is, I have to assume that the file's been opened as binary.
Today, I managed to finish my 32-bit implementation, which runs much faster on my machine (operating system is 32 bits). I also did a bunch of rigorous tests on both the 32-bit and 64-bit versions of the code, trying to fix all the bugs I could find. Both implementations are now pretty much valgrind-error free, and I'm pretty sure they both work correctly, too. One very annoying bug I had was during memory allocation, where I had an expression, size << 3
, where size
was a 32-bit unsigned integer. I later corrected this to a more generalized size * BITS/8
, which ended up blowing up at sizes of around 2,048,000,000 characters. Well, it turns out the multiplication step is done first, which, when done, makes it too big for a 32-bit integer and rolls over. Sigh...
I also did some tests on both implementations. Initially, I found the 32-bit version to be faster on skorpios
, which was very strange. However, I realized that this was because of the way I chose to represent the sampling rate (I chose to represent it in blocks, so a "sampling factor" of 1 is actually a sampling rate of either 64 or 32, depending on the implementation). So, the 64-bit implementation actually had twice the sampling rate given the same "sampling factor", which was what made it slower. Accounting for this, the 64-bit implementation is (rightly) faster.
Finally, I used the save functionality to test the memory size, just to make sure everything was going right. Using my two-level 64-bit implementation, I put in a 1,024,000,000 character sequence with a sampling factor of 1 (so sampling rate = 64) and saved it to a file. The file turned out to be 366.7 MB, which should be roughly the memory footprint of the structure. At 32 bits, with the same parameters (note sampling factor = 1, so sampling rate = 32), I get a file size of 488.8 MB. At sampling factor = 2 (sampling rate = 64), I get 366.7 MB again, which is exactly the same as the 64-bit implementation.
To do:
Also did a little reading on CUDA
and using GPGPUs. I found one paper detailing a GPGPU-based Smith-Waterman implementation, which I briefly glanced through and also went through the =CUDA= developer guide
.
Finally, had a long discussion with Daniel about his new BowtieK1Mapper
and multithreading. Daniel's new implementations seem to modify the index a little during the mapping process (he "flips" it), which doesn't go well with multithreading, because I assume the mapping function is constant and can be run independently. Long story short, we decided to hold off on modifications to the class, since we'll be implementing a new FM index soon, which will no doubt involve a lot of changes (the least of which is probably getting rid of the NGS_FMIndex
class, which is just acting as a middleman for readaligner's FMIndex
). I've also decided to hold off on committing my merge to the dev
trunk for multithreading, because this would introduce compile errors with BowtieK1Mapper
in its current state.
To do:
I've started reviewing how the FM index works, in preparation for building it. More details on that tomorrow, though.
To do:
utils
file
Also, I revisited the 32-bit code and was actually wondering if it was significantly faster than the 64-bit rank implementation on 32-bit systems. Chris suggested I run some tests on antiparos
, which probably has a 32-bit CPU. More on that later.
To do:
operator[]
function.
Also ran some tests on antiparos
using the 32-bit and 64-bit rank structures. Since antiparos's
CPU is actually 32-bit, there was a huge difference in speed between the two implementations. Of course, the 32-bit version was faster, ranging from 20% to as much as 70% faster (the difference gets wider with higher sampling rates). Given this, I will be keeping both implementations, which I've also managed to merge. The only thing I don't like is that I have to have a selective compilation in each of the rank functions, which makes the code look a little unwieldly. Fortunately, everything else was easy to merge by defining a constant.
I looked into Daniel's packed string class as well. I haven't actually started adding/modifying it yet, because I figure it'll be easier when I start merging the new index in; it's going to require a little CMake tweaking as well, because I want to give it a 64-bit implementation as well. Some changes I would like to make:
data_
member variable is a uint32_t
. This should be uint64_t
in 64-bit.
operator[]
method that returns the ith base; this shouldn't be too bad.
Finally, I started looking into BWT builders. The BWT builder is in the incbwt
folder (kind of obvious now that I think about it), so it's just a matter of extracting that folder and trimming down on some classes. I think the rotational index is also in that folder, which is why there are so many files in it currently. Chris also suggested an implementation he had from Karkkainen (blockwise suffix sorting), which is what bowtie uses (I think?). This version is able to build the BWT string using much less memory, but the implementation may be buggy, though it may just be that it uses signed integers (too small). As well, Chris isn't sure whether the code is open-source or not, so he's looking into that, too.
To do:
skorpios
, even though it compiled fine on my machine (both use gcc 4.4.1
). So I went back to readaligner's bwt, incbwt
. I ended up taking the whole incbwt
folder because they were pretty much all required classes. Building a BWT works through the class RLCSABuilder
.
Initially, I had concerns over a FIXME type comment in the actual RLCSA
class (which RLCSABuilder
builds), which said the program crashes when you enter in a sequence > 2 GB. Looking through the code more carefully though, I think the RLCSABuilder
class builds a separate RLCSA
for every chromosome (each of which is < 2 GB), and then merges them.
Also, I don't think using packed strings with the RLCSABuilder
is possible, because it requires that the sequence it's given has no '\0's (it uses '\0's as the end of sequence marker). Also, it uses char
arrays, which is pretty much hardcoded into the code, so using anything smaller isn't really possible without some major changes.
So my plan now is to just build a BWT string, pack it, run it through my rank structure, and then see how that goes from there.
To do:
TextCollectionBuilder
class specifies some options for an FM index compatible BWT string, so I just used those settings (documented in the code). Also had a discussion with Chris on matching N's and end markers ($'s). There are pretty much two options: The first option would be easier to do, and would provide support for wildcard matching (later). The second option is much faster and memory-efficient. For the same memory footprint, my 2-bit rank structure is about twice as fast as the 3-bit structure.
Today, I started building the 3-bit rank structure, since Chris tells me that we'll need it sometime in the future anyway and the 2-bit rank structure is still fresh in my mind. It looks like the structure works pretty much as desired, though I haven't done exhaustive bug tests yet. Also, I finished the BWT string builder integration (though not for multiple chromosomes), and I've successfully built a BWT string out of chromosome 10 of the human genomes, ran the resultant string through my 2-bit rank structure and got good ranks out of it, at least in the first 10 or 15 bases.
While typing this, I realized I forgot to change the save/load methods for the 3-bit rank structure, so I'll have to get on that tomorrow. Another thing I should do is to build a 1-bit ('regular') rank structure. The problem with this structure is that I can't interleave the lower-level count blocks, since I only need one 16-bit count per sample, as opposed to four different counts. So, it would be a big waste of memory if they were interleaved. So, I'm just going to have the counts separate from the rank structure.
Finally, for the initialization functions, I'm really considering using iterators instead of requiring an operator[]
to be present. char *
is already an iterator in and of itself, and it's just a matter of implementing an iterator for the packed string. This way, it's much more flexible, and I might later be able to make a "streaming" class that loads from a file a little bit at a time.
To do:
StringType
and length
to iterators for all the rank structures and finished (I hope) the 1-bit rank structure.
The change from iterators was actually surprisingly easy because I had the rank blocks build separately from the count blocks (even though they were part of the same array). Hopefully, this will make everything more flexible. As a bonus, I also cleaned up the rank structure classes quite a bit, removing extraneous methods and updating the documentation.
The 1-bit rank structure wasn't too bad to code, since it's so simple. I had to get rid of the interleaved count blocks because it just wasn't practical, but the memory usage is quite good because I only have to store counts for 1's (Rank0(position)
= position - Rank1(position)
). I haven't really done exhaustive testing for this class, so it might still have lots of bugs.
Finally, no speed tests yet, but I should have them up by tomorrow.
To do:
count
method, which simply counts the number of occurences of a read in a reference. This method should be relatively simple (can be found in the FM index paper). Then, we will do a locate
function, which will actually give the position of the sequence in the reference. This locate
function requires an additional rank structure that stores positions, sampled at every few positions in the original string. Then, it's just a matter of backtracking to the last sampled position and from there, we can calculate the position of the match. Since sampling the original sequence requires another rank structure (1-bit this time), Chris argued that it might be better to sample positions in a different way to get rid of that rank structure. The disadvantage to this approach is that it can't guarantee finding a position in a constant number of backtracks, but it does allow for better sampling for the same amount of memory.
Chris and I have also decided on a naming scheme, as follows:
I also benchmarked the various rank structures, shown below. These benchmarks were done on skorpios
in the /var/tmp/
folder. Times are reported over 25,000,000 calls to rank on a 512,000,000 base sequence with varied sampling rates. The graph reports times on the y axis (milliseconds) vs memory usage on the x axis (bytes). As can be seen, the StaticEDNASequence is much slower than StaticDNASequence given the same memory usage, but one good thing is that they're not too much slower given the same sampling rate.
Today, I successfully implemented a count
function using the StaticEDNASequence. Also tried to work out a nice hack for getting rid of the end markers when using the StaticDNASequence, but it doesn't seem possible. So right now, I'm going to build in another data structure into the StaticDNASequence that keeps track of all the $'s and subtracts these from the A ranks. It might get complicated, but hopefully the speed difference won't be too big.
To do:
count
method using both 3-bit ranks and 2-bit ranks, to get a good idea of what speed we may be able to achieve.
Finally, I did some benchmarks on the count
function. Test parameters are:
count
.
-O3
)
Sampling rate (bases) | Memory usage (bytes) | Type | SSE 4.2? | Time (ms) |
---|---|---|---|---|
64 | 1,784,768 | 2-bit | No | 27,212 |
256 | 1,339,168 | 2-bit | No | 46,663 |
64 | 2,973,048 | 3-bit | No | 31,454 |
256 | 2,081,848 | 3-bit | No | 52,876 |
2048 | 1,821,912 | 3-bit | No | 156,832 |
64 | 1,784,768 | 2-bit | Yes | 19,924 |
256 | 1,339,168 | 2-bit | Yes | 33,002 |
64 | 2,973,048 | 3-bit | Yes | 24,371 |
256 | 2,081,848 | 3-bit | Yes | 38,524 |
2048 | 1,821,912 | 3-bit | Yes | 90,328 |
For a (very) rough comparison, saligner
, using readaligner's FM index, takes 64.461 seconds to do an exact mapping for 2.7 million reads (each 35 bases in length). This means it would take roughly 2,387 seconds to align 100,000,000 reads. However, this is not a very good comparison because that one's doing a locate
function which is more complex, has to read from a file and write to an output file, and runs at 32 bits. Also, the rank structure might use a different sampling rate, although I believe it uses a sampling rate of 256, which is why I included that test in. So, this means we have quite a bit of room to work with!
Another note is that for the 3-bit structure, it requires a sampling rate of 2048 to be roughly comparable in memory usage. However, at this rate, it's much slower than the 2-bit structure (roughly 5-6 times slower).
To do: