Line: 1 to 1 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Added: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> > |
07/01/10Played around with multithreading some more. I tried making a light-weight test file, where I just add a value a few billion (probably more) times, just to see if it's really the aligner. I noticed that there also is a pretty big jump in user time between 2 and 3 threads, so that bit might not be saligner's fault. However, the ratios were a bit more like what they should've been (though not quite). I forgot to record down the numbers, but they weren't that crazy.
Also worked on the rank structure a little more. I can now build sequences greater than 64 bases in length (so as big as needed, with memory restrictions of course). I figured out how to align to 16 bytes with
I was thinking about the calculation needed for the rank structure some more. Currently, the calculation goes something like this (for example, for a C): To do:
07/02/10Fixed the bug in my rank function, turns out I did miss a +1. So now I have a working rank structure that I can build and query for any length sequence. The current implementation uses two paralleluint64_t arrays to store the two (first and second) bitsequences. Before I start an implementation that stores precalculated values though, I want to do a few experiments with one array that holds each block sequentially.
I also did a few tests with the new calculation method ( To do:
07/05/10Finished an implementation of sequential blocks instead of storing the first and second blocks in two parallel arrays. Chris argued that this might make it faster because one set of two blocks can then be cached together through one cache line, since they're side by side (I think that's the gist of it).I also did some benchmarks: On skorpios at O3, using a test sequence of 1280000 characters, ranking from position 1230000-1280000:
It's interesting to note that on the parallel implementation, ranking C and G is actually faster than T! This is strange because the calculation for T involves an
Using To do:
07/06/10Had a discussion with Chris about experimental protocols. Turns out my experiment yesterday pretty much all fits in the cache (which is 6 MB), so there wouldn't be that much of a benefit between sequential (interleaved) and parallel (normal). So we decided on a new protocol:
Also, I finished implementing count blocks; currently, each 64-bit block has a count block (which is 32 bits), with 4 count blocks per rank block, one for each base.
So here are the results, run on
To do:
07/07/10Implemented another structure with interleaved counts, as well as a count structure with only 3 bases. Sinceskorpios is down right now, I just did some preliminary testing with my laptop (32-bit, Intel Core 2 Duo, no hardware popcount, 6 MB cache, 3.2 GB RAM).
Again, times are in wall time, compiled at
The 1024M times are again a bit dubious. I fixed the segfault bug, but I think my computer was running out of memory, and it looks like a big jump in time. Regardless, I'm going to test these again on
I also experimented with some different popcount functions, found on the Wikipedia Finally, I started implementing the sampling rates; there's currently a bug somewhere, but it shouldn't be too bad to fix. PS: Before I forget, I realized I did actually lock the whole single end output processing section in my multithreaded driver (I originally thought I just locked the part where it writes to output, whereas now I realized that it locks before that, when it's processing the SamEntry into a string). I should experiment with locking it at a different place (though at this point, it's not that easy, unless I want to put locks in the sam file class itself.) To do:
07/08/10Skorpios wasn't back till late so I didn't really have a chance to benchmark. I did, however, finish my sampling rate implementation. Hopefully, there aren't any bugs. I also managed to fix allvalgrind errors, so now everything runs (hopefully) well.
I also managed to finish an interleaved count implementation, which was much harder. In fact, I pretty much had to do a complete recode of the core build/rank functions. I need to run a few more tests on the implementation, but it should be running fine.
Since To do:
07/09/10Finished up all the implementations for the rank structure (interleaved, 3-base, and interleaved counts, all with adjustable sampling rates) and benchmarked them. Rather than provide a huge chart for all the benchmarks, I decided to show a chart:
Times are in milliseconds, tested on
In previous tests, I noted that the table version was faster than the bit-mangling version; it's interesting to note that in these implementations, the bit-mangling Finally, I talked with Anne and Chris about a new project, possibly for my honours thesis, today. Chris proposed a GPGPU version of the Smith-Waterman aligner I vectorized earlier, or even GPGPU-capable succinct index. Anne also suggested that I look into the RNA folding pathway research currently going on as well, and I'll probably start reading some papers on that soon. To do:
Edit: I realized I forgot to benchmark everything WRT sampling rate, so that's another item on the to-do list.
07/10/10Had a small idea that I wanted to try out quickly. Tried replacing thepopcount function in basics.h for NGSA, and it did actually turn out to be faster, at least with skorpios . Times were 84.29 seconds vs 79.75 seconds, for tabular popcount vs bit-mangling popcount. So maybe we should change the function in basics to the bit-mangling version?
07/12/10Read over Daniel's presentation on rank structures and thought over how I'm going to implement the two-level structure today. As far as I can tell, Vigna'srank9 has no practical way of adjusting sampling rates, which is something we want. So, I think I'll stick with the regular two-level structure with a few modifications:
The implementation shouldn't be too bad. I've decided to stick with the interleaved counts implementation for now, where all the counts are interleaved with the sequence blocks. Adding a higher level would just mean adding more blocks, this time with a constant sampling rate of 1023. Ranks would just be the sum of the nearest low-level block and the nearest top-level block. I don't think I will interleave the top level blocks, since they're much too far apart to benefit anyway.
Another optimization I've been thinking about is to limit the sampling rate to powers of 2. That way, when finding the nearest blocks, I can just do bit shifts instead of costly divisions. I'm also thinking about using a sampling rate of 512 for the top-level blocks, so that I can use bit shifts as well. The difference in space usage isn't very large, less than a megabyte (in theory) for a 4-billion base sequence (calculation here
Finally, here is the graph I promised on Friday, compared to sampling rate. Tests were done on
It's a little hard to tell, but it looks like the count interleaved implementation shows minor improvements in speed (about 5%) up until sampling rate=128, where it starts to get much slower. To do:
07/13/10Finished up the two-level rank structure. I ended up deciding to go with taking a sample for the top level every 512 blocks instead of 1023 blocks. There's a little speed difference in it, it's easier to calculate and it's not much of a space difference, since the top level takes up so little anyway. I also decided to go with using bit shifts.So now, any sample factor that is inputted is rounded down to the next smallest power of two. This way, I can use bit operations instead of divisions and multiplications, which are slightly slower. Running a few tests, I see about a 10% performance gain. To do:
07/15/10Whoops, I completely forgot to make a post yesterday, so I'll make a post for yesterday and today, today.
Yesterday, I managed to actually finish up my two-level implementation (turns out there were a few bugs in certain cases, due to +1/-1 errors). I also implemented save/load functionality, so that the structure itself can be saved. To use, just call
Today, I managed to finish my 32-bit implementation, which runs much faster on my machine (operating system is 32 bits). I also did a bunch of rigorous tests on both the 32-bit and 64-bit versions of the code, trying to fix all the bugs I could find. Both implementations are now pretty much valgrind-error free, and I'm pretty sure they both work correctly, too. One very annoying bug I had was during memory allocation, where I had an expression,
I also did some tests on both implementations. Initially, I found the 32-bit version to be faster on Finally, I used the save functionality to test the memory size, just to make sure everything was going right. Using my two-level 64-bit implementation, I put in a 1,024,000,000 character sequence with a sampling factor of 1 (so sampling rate = 64) and saved it to a file. The file turned out to be 366.7 MB, which should be roughly the memory footprint of the structure. At 32 bits, with the same parameters (note sampling factor = 1, so sampling rate = 32), I get a file size of 488.8 MB. At sampling factor = 2 (sampling rate = 64), I get 366.7 MB again, which is exactly the same as the 64-bit implementation. To do:
07/16/10Hopefully, I've finalized my rank structure implementation. Ran as many tests as I could think of on both versions and managed to fix a small bug. Hopefully, it's all done now.
Also did a little reading on
Finally, had a long discussion with Daniel about his new To do:
07/19/10Started finalizing the rank structure. I've integrated the 32-bit and 64-bit versions using a CMake file. That took a stupidly long time because I've never been that great with CMake. I ended up reading the getting started guide to get it working. Also went through the documentation in my rank structure and made corrections here or there. Chris also suggested implementing a "memory budget" feature, but I'll have to get more details on that before I get started.I've started reviewing how the FM index works, in preparation for building it. More details on that tomorrow, though. To do:
07/20/10Had a long discussion with Chris today about the rank structure, and on implementing the FM index in the near future. Some important points:
Also, I revisited the 32-bit code and was actually wondering if it was significantly faster than the 64-bit rank implementation on 32-bit systems. Chris suggested I run some tests on To do:
07/21/10Had a meeting with Chris and Anne today and talked about goals for the project. See Daniel's journal for the details.
Also ran some tests on I looked into Daniel's packed string class as well. I haven't actually started adding/modifying it yet, because I figure it'll be easier when I start merging the new index in; it's going to require a little CMake tweaking as well, because I want to give it a 64-bit implementation as well. Some changes I would like to make:
Finally, I started looking into BWT builders. The BWT builder is in the To do:
07/23/10Started to look into the BWT builders. First, I tried using Karkkainen's builder, but it wouldn't compile onskorpios , even though it compiled fine on my machine (both use gcc 4.4.1 ). So I went back to readaligner's bwt, incbwt . I ended up taking the whole incbwt folder because they were pretty much all required classes. Building a BWT works through the class RLCSABuilder .
Initially, I had concerns over a FIXME type comment in the actual
Also, I don't think using packed strings with the So my plan now is to just build a BWT string, pack it, run it through my rank structure, and then see how that goes from there. To do:
07/26/10Whoops, forgot to do a post on Friday. For Friday, I worked on getting the BWT string builder integrated. Readaligner'sTextCollectionBuilder class specifies some options for an FM index compatible BWT string, so I just used those settings (documented in the code). Also had a discussion with Chris on matching N's and end markers ($'s). There are pretty much two options:
The first option would be easier to do, and would provide support for wildcard matching (later). The second option is much faster and memory-efficient. For the same memory footprint, my 2-bit rank structure is about twice as fast as the 3-bit structure. Today, I started building the 3-bit rank structure, since Chris tells me that we'll need it sometime in the future anyway and the 2-bit rank structure is still fresh in my mind. It looks like the structure works pretty much as desired, though I haven't done exhaustive bug tests yet. Also, I finished the BWT string builder integration (though not for multiple chromosomes), and I've successfully built a BWT string out of chromosome 10 of the human genomes, ran the resultant string through my 2-bit rank structure and got good ranks out of it, at least in the first 10 or 15 bases. While typing this, I realized I forgot to change the save/load methods for the 3-bit rank structure, so I'll have to get on that tomorrow. Another thing I should do is to build a 1-bit ('regular') rank structure. The problem with this structure is that I can't interleave the lower-level count blocks, since I only need one 16-bit count per sample, as opposed to four different counts. So, it would be a big waste of memory if they were interleaved. So, I'm just going to have the counts separate from the rank structure.
Finally, for the initialization functions, I'm really considering using iterators instead of requiring an To do:
07/27/10Pretty much finished up what I wanted to do from yesterday. The 3-bit rank structure has its bugs fixed, I changed the initialization from aStringType and length to iterators for all the rank structures and finished (I hope) the 1-bit rank structure.
The change from iterators was actually surprisingly easy because I had the rank blocks build separately from the count blocks (even though they were part of the same array). Hopefully, this will make everything more flexible. As a bonus, I also cleaned up the rank structure classes quite a bit, removing extraneous methods and updating the documentation.
The 1-bit rank structure wasn't too bad to code, since it's so simple. I had to get rid of the interleaved count blocks because it just wasn't practical, but the memory usage is quite good because I only have to store counts for 1's ( Finally, no speed tests yet, but I should have them up by tomorrow. To do:
07/29/10Yesterday, Chris and I met to discuss the next steps in building the FM index. Since the rank structures are done, we can pretty much get started on all the FM index functions. First off, we will start with implementing acount method, which simply counts the number of occurences of a read in a reference. This method should be relatively simple (can be found in the FM index paper). Then, we will do a locate function, which will actually give the position of the sequence in the reference. This locate function requires an additional rank structure that stores positions, sampled at every few positions in the original string. Then, it's just a matter of backtracking to the last sampled position and from there, we can calculate the position of the match. Since sampling the original sequence requires another rank structure (1-bit this time), Chris argued that it might be better to sample positions in a different way to get rid of that rank structure. The disadvantage to this approach is that it can't guarantee finding a position in a constant number of backtracks, but it does allow for better sampling for the same amount of memory.
Chris and I have also decided on a naming scheme, as follows:
I also benchmarked the various rank structures, shown below. These benchmarks were done on
Today, I successfully implemented a To do:
07/30/10Started to implement the new StaticDNASequence. The build process is going to get a little ugly, but the ranks should still be relatively clean (which is what matters). To make it easier on myself, I'm just going to do one pass through the whole sequence first to find all the N's and build the new data structure, then build the rest of the rank structure (rank blocks + count blocks). This is kind of wasteful because I can actually build the $ structure while building the rank structure (for just one pass), but I think it might be harder, given my current code. So, the build process will be a bit slower (which I think is okay...).
Finally, I did some benchmarks on the
For a (very) rough comparison, Another note is that for the 3-bit structure, it requires a sampling rate of 2048 to be roughly comparable in memory usage. However, at this rate, it's much slower than the 2-bit structure (roughly 5-6 times slower). To do:
|