JayJulyArchive < BETA

07/01/10

Played around with multithreading some more. I tried making a light-weight test file, where I just add a value a few billion (probably more) times, just to see if it's really the aligner. I noticed that there also is a pretty big jump in user time between 2 and 3 threads, so that bit might not be saligner's fault. However, the ratios were a bit more like what they should've been (though not quite). I forgot to record down the numbers, but they weren't that crazy.

Also worked on the rank structure a little more. I can now build sequences greater than 64 bases in length (so as big as needed, with memory restrictions of course). I figured out how to align to 16 bytes with posix_memalign. I've almost completed the rank function as well, except I'm just getting a small bug with a boundary condition (I'm pretty sure I'm just missing a +1 somewhere).

I was thinking about the calculation needed for the rank structure some more. Currently, the calculation goes something like this (for example, for a C): (~second & first) >> (BITS-i-1). Chris told me that this takes something like three operations to do, since some operations that are independent can be done in parallel. I figure the following equation would also work for calculation: (~second & first) & (0xFFFF... >> BITS-i) (these two formulas aren't numerically equivalent, but they should give the same popcount result). I'm not sure if this will do anything, but it looks like I might be able to drop the operations to 2, if I can somehow calculate the mask fast. I'm going to have to play around with that a bit more.

To do:

Fix that bug with the rank.
Play around with the rank calculation.
Find out what's going on with multithreading.
Start implementing precalculated blocks.

07/02/10

Fixed the bug in my rank function, turns out I did miss a +1. So now I have a working rank structure that I can build and query for any length sequence. The current implementation uses two parallel uint64_t arrays to store the two (first and second) bitsequences. Before I start an implementation that stores precalculated values though, I want to do a few experiments with one array that holds each block sequentially.

I also did a few tests with the new calculation method ( ~second & first) & (0xFFFF...FF >> BITS-i ). I tried making an array of length BITS to store the precalculated masks for every i. That way might lead to fewer operations (not really sure on that point). Preliminary tests didn't show much difference, especially on -O3. The strange thing is, I actually compiled a minimal rank function using either of the two calculations as assembly, and at -O3, there wasn't even any difference in the assembly code! Strange, since the two aren't really numerically equivalent, just "popcount equivalent".

To do:

Explore sequential vs parallel blocks
Add precalculated blocks.

07/05/10

Finished an implementation of sequential blocks instead of storing the first and second blocks in two parallel arrays. Chris argued that this might make it faster because one set of two blocks can then be cached together through one cache line, since they're side by side (I think that's the gist of it).

I also did some benchmarks: On skorpios at O3, using a test sequence of 1280000 characters, ranking from position 1230000-1280000:

Implementation	A (ms)	C (ms)	G (ms)	T (ms)
Parallel	5164	4661	4661	4921
Sequential	5025	4793	4792	4858

It's interesting to note that on the parallel implementation, ranking C and G is actually faster than T! This is strange because the calculation for T involves an & operation on the two blocks, whereas C/G involves an & and a ~ operation. The times are more as expected under the sequential implementation, although T ranking should still be faster (it's got fewer operations!). What's more surprising is that the speeds for C/G are faster on parallel!

Using cachegrind, it seems that there are far fewer (about 75%) "D refs" when ranking C/G than A/T in the parallel implementation (turns out "D refs" are main memory references). Specifically, it seems that there are much fewer D writes occurring in the C/G cases (reads are fewer as well, but not as significantly as writes). Furthermore, under the sequential implementation, the D references for all bases are more on par with the A/T rank functions in the parallel implementation. It seems there might be some specific optimization in the parallel C/G case that the compiler is making. I kind of want to investigate this a bit further; ideally, I can somehow end up with the parallel C/G times with the sequential A/T times.

To do:

Investigate above behaviour a little more.
Start implementing precomputed blocks.

07/06/10

Had a discussion with Chris about experimental protocols. Turns out my experiment yesterday pretty much all fits in the cache (which is 6 MB), so there wouldn't be that much of a benefit between sequential (interleaved) and parallel (normal). So we decided on a new protocol:

Use larger sequences (>=16,000,000 bases)
Use random access. More specifically, build an array of random numbers, then use those numbers to call random ranks. This should prevent streaming.
Try both hardware and software popcount.

Also, I finished implementing count blocks; currently, each 64-bit block has a count block (which is 32 bits), with 4 count blocks per rank block, one for each base.

So here are the results, run on skorpios at -O3, with 40,000,000 rank calls each. Times are wall time:

Sequence length (*10^6)	Normal (ms)	Interleaved	Normal + HW popcount	Interleaved + HW popcount
1	792	750	244	237
2	864	861	265	250
4	924	935	274	255
8	1134	1063	449	363
16	1567	1460	752	639
64	3857	3018	1242	1034
256	4495	3910	1901	1358
2048	5332	4639	2915	1957

Note that the 2048 million times may not be quite as accurate. The test at 1024M segfaulted for all four implementations (later note: turns out this was a problem in the driver, where I used a signed int by accident).

To do:

Try out using a different count structure (only store 3/4 bases, and calculate the last base) and see the performance hit.
Try out interleaving the count structure as well.
Implement adjustable sampling rates.

07/07/10

Implemented another structure with interleaved counts, as well as a count structure with only 3 bases. Since skorpios is down right now, I just did some preliminary testing with my laptop (32-bit, Intel Core 2 Duo, no hardware popcount, 6 MB cache, 3.2 GB RAM).

Again, times are in wall time, compiled at -O3, single threaded and run with 50,000,000 ranks each.

Sequence length (*10^6)	Normal (ms)	Interleaved	Interleaved + counts	Interleaved, 3-base
1	1046	985	1105	1013
8	1382	1202	1208	1245
16	3276	2768	2137	2522
32	5059	4525	3556	4474
64	5708	5100	4269	5292
128	5901	5303	4617	5545
256	5985	5387	4834	5686
512	6111	5556	4989	5786
1024	7967	7392	6690	7653

The 1024M times are again a bit dubious. I fixed the segfault bug, but I think my computer was running out of memory, and it looks like a big jump in time. Regardless, I'm going to test these again on skorpios when it gets back. It doesn't look like there's too much of a performance hit with the 3-base implementation, but if we stick with interleaved counts, there's really no point in going that route, since the 4 32-bit counts fit nicely into 2 64-bit "rank blocks". Using only 3 counts wouldn't do much for memory, since there will still be those 32 bits available.

I also experimented with some different popcount functions, found on the Wikipedia entry. So far, it seems like the current popcount function (which I stole from readaligner) is the faster than the one proposed by Wikipedia. However, the current popcount function relies on a table of values, which might make it slower if there's a lot of cache misses on it, though it outperformed Wikipedia's function even at 1024M bases (don't have the exact values, another thing to benchmark).

Finally, I started implementing the sampling rates; there's currently a bug somewhere, but it shouldn't be too bad to fix.

PS: Before I forget, I realized I did actually lock the whole single end output processing section in my multithreaded driver (I originally thought I just locked the part where it writes to output, whereas now I realized that it locks before that, when it's processing the SamEntry into a string). I should experiment with locking it at a different place (though at this point, it's not that easy, unless I want to put locks in the sam file class itself.)

To do:

Benchmark above on skorpios.
Finish sampling rate implementation.
Get started on 32-bit implementation?

07/08/10

Skorpios wasn't back till late so I didn't really have a chance to benchmark. I did, however, finish my sampling rate implementation. Hopefully, there aren't any bugs. I also managed to fix all valgrind errors, so now everything runs (hopefully) well.

I also managed to finish an interleaved count implementation, which was much harder. In fact, I pretty much had to do a complete recode of the core build/rank functions. I need to run a few more tests on the implementation, but it should be running fine.

Since skorpios is back now, I'll probably do a bunch of experiments tomorrow. I should also make up a 3-base sampling rate implementation, just to test the cost; again, however, if we end up going with the count-interleaved version, the 3-base version really isn't very practical.

To do:

Benchmark!
3-base implementation (for sampling-rate enabled, interleaved)
32-bit implementation, look into two-level rank structure.

07/09/10

Finished up all the implementations for the rank structure (interleaved, 3-base, and interleaved counts, all with adjustable sampling rates) and benchmarked them. Rather than provide a huge chart for all the benchmarks, I decided to show a chart:

Times are in milliseconds, tested on skorpios with a constant seed, and ranked 75,000,000 times per test. Squares are interleaved, diamonds are 3-base, and triangles are interleaved count implementations. The three different versions all use different popcount functions: blue is the function provided in readaligner, which uses a table; yellow is one provided by Wikipedia, which uses a bit-mangling method; and orange uses the hardware instruction.

Rank graph

In previous tests, I noted that the table version was faster than the bit-mangling version; it's interesting to note that in these implementations, the bit-mangling popcount is actually better. My thoughts are that once everything isn't able to be fit nicely into the cache, then the popcount table may encounter cache misses, which slows it down dramatically. It's also interesting to see that when using the tabular popcount, interleaving the counts is actually the slowest implementation. Again, this might be due to excessive cache misses from having to cache the table, and it's actually what led me to try the bit-mangling popcount one more time. It's nice to see that interleaving the counts has some effect, especially in the hardware popcount case, though.

Finally, I talked with Anne and Chris about a new project, possibly for my honours thesis, today. Chris proposed a GPGPU version of the Smith-Waterman aligner I vectorized earlier, or even GPGPU-capable succinct index. Anne also suggested that I look into the RNA folding pathway research currently going on as well, and I'll probably start reading some papers on that soon.

To do:

Look into both the GPGPU project and get more familiarized with the RNA folding pathway project
Start implementing a second level in the rank structure
Look into a 32-bit implementation

Edit: I realized I forgot to benchmark everything WRT sampling rate, so that's another item on the to-do list.

07/10/10

Had a small idea that I wanted to try out quickly. Tried replacing the popcount function in basics.h for NGSA, and it did actually turn out to be faster, at least with skorpios. Times were 84.29 seconds vs 79.75 seconds, for tabular popcount vs bit-mangling popcount. So maybe we should change the function in basics to the bit-mangling version?

07/12/10

Read over Daniel's presentation on rank structures and thought over how I'm going to implement the two-level structure today. As far as I can tell, Vigna's rank9 has no practical way of adjusting sampling rates, which is something we want. So, I think I'll stick with the regular two-level structure with a few modifications:

There is one "block" per 64 characters.
The top level will be 32-bit unsigned integers.
The bottom level will be made of 16-bit unsigned integers.
There will be one bottom level count every (sampling rate) blocks.
There will be one top level count every 1023 (the maximum number of blocks that can be counted with 16 bits) blocks

The implementation shouldn't be too bad. I've decided to stick with the interleaved counts implementation for now, where all the counts are interleaved with the sequence blocks. Adding a higher level would just mean adding more blocks, this time with a constant sampling rate of 1023. Ranks would just be the sum of the nearest low-level block and the nearest top-level block. I don't think I will interleave the top level blocks, since they're much too far apart to benefit anyway.

Another optimization I've been thinking about is to limit the sampling rate to powers of 2. That way, when finding the nearest blocks, I can just do bit shifts instead of costly divisions. I'm also thinking about using a sampling rate of 512 for the top-level blocks, so that I can use bit shifts as well. The difference in space usage isn't very large, less than a megabyte (in theory) for a 4-billion base sequence (calculation here, multiplied by 4 for all four bases).

Finally, here is the graph I promised on Friday, compared to sampling rate. Tests were done on skorpios with sequence size of 512,000,000 over 25,000,000 times each and a constant seed:

Rank graph (sampling rate) *Notes: "Interleaved" is the normal interleaved implementation, "3-base" is the slightly more memory-efficient method which keeps track of only 3 bases, and "Count-Interleaved" is the version with counts interleaved into the rank blocks. Blue dots are versions with the tabled popcount, yellow dots have the bit-mangling popcount, while orange dots use the SSE4.2 hardware instruction.

It's a little hard to tell, but it looks like the count interleaved implementation shows minor improvements in speed (about 5%) up until sampling rate=128, where it starts to get much slower.

To do:

Finish implementation of 2-level rank structure
Start on 32-bit implementation

07/13/10

Finished up the two-level rank structure. I ended up deciding to go with taking a sample for the top level every 512 blocks instead of 1023 blocks. There's a little speed difference in it, it's easier to calculate and it's not much of a space difference, since the top level takes up so little anyway. I also decided to go with using bit shifts.

So now, any sample factor that is inputted is rounded down to the next smallest power of two. This way, I can use bit operations instead of divisions and multiplications, which are slightly slower. Running a few tests, I see about a 10% performance gain.

To do:

Start on 32-bit implementation

07/15/10

Whoops, I completely forgot to make a post yesterday, so I'll make a post for yesterday and today, today.

Yesterday, I managed to actually finish up my two-level implementation (turns out there were a few bugs in certain cases, due to +1/-1 errors). I also implemented save/load functionality, so that the structure itself can be saved. To use, just call RankStructure::Save on a RankStructure that has already been initialized (by calling RankStructure::Init). Loading works the same way, just call RankStructure::Load with a filename supplied. Note that when loading a RankStructure, RankStructure::Init doesn't have to be called first. I'm not sure whether I should be taking in a file name as a string or whether I should just take in a stream pointer; readaligner's rank structure seems to just take in a pointer, so I might switch to that. Only problem is, I have to assume that the file's been opened as binary.

Today, I managed to finish my 32-bit implementation, which runs much faster on my machine (operating system is 32 bits). I also did a bunch of rigorous tests on both the 32-bit and 64-bit versions of the code, trying to fix all the bugs I could find. Both implementations are now pretty much valgrind-error free, and I'm pretty sure they both work correctly, too. One very annoying bug I had was during memory allocation, where I had an expression, size << 3, where size was a 32-bit unsigned integer. I later corrected this to a more generalized size * BITS/8, which ended up blowing up at sizes of around 2,048,000,000 characters. Well, it turns out the multiplication step is done first, which, when done, makes it too big for a 32-bit integer and rolls over. Sigh...

I also did some tests on both implementations. Initially, I found the 32-bit version to be faster on skorpios, which was very strange. However, I realized that this was because of the way I chose to represent the sampling rate (I chose to represent it in blocks, so a "sampling factor" of 1 is actually a sampling rate of either 64 or 32, depending on the implementation). So, the 64-bit implementation actually had twice the sampling rate given the same "sampling factor", which was what made it slower. Accounting for this, the 64-bit implementation is (rightly) faster.

Finally, I used the save functionality to test the memory size, just to make sure everything was going right. Using my two-level 64-bit implementation, I put in a 1,024,000,000 character sequence with a sampling factor of 1 (so sampling rate = 64) and saved it to a file. The file turned out to be 366.7 MB, which should be roughly the memory footprint of the structure. At 32 bits, with the same parameters (note sampling factor = 1, so sampling rate = 32), I get a file size of 488.8 MB. At sampling factor = 2 (sampling rate = 64), I get 366.7 MB again, which is exactly the same as the 64-bit implementation.

To do:

Not really sure, actually. I'm going to finalize the rank structure, then either...
Do some work on the Local Aligner, like optimize the FullAlignment methods
Read some papers/articles on GPGPUs and CUDA and also RNA folding pathways to decide a bit better on what I would do for my thesis.

07/16/10

Hopefully, I've finalized my rank structure implementation. Ran as many tests as I could think of on both versions and managed to fix a small bug. Hopefully, it's all done now.

Also did a little reading on CUDA and using GPGPUs. I found one paper detailing a GPGPU-based Smith-Waterman implementation, which I briefly glanced through and also went through the =CUDA= developer guide.

Finally, had a long discussion with Daniel about his new BowtieK1Mapper and multithreading. Daniel's new implementations seem to modify the index a little during the mapping process (he "flips" it), which doesn't go well with multithreading, because I assume the mapping function is constant and can be run independently. Long story short, we decided to hold off on modifications to the class, since we'll be implementing a new FM index soon, which will no doubt involve a lot of changes (the least of which is probably getting rid of the NGS_FMIndex class, which is just acting as a middleman for readaligner's FMIndex). I've also decided to hold off on committing my merge to the dev trunk for multithreading, because this would introduce compile errors with BowtieK1Mapper in its current state.

To do:

More reading.
Talk to Chris about the FM index.

07/19/10

Started finalizing the rank structure. I've integrated the 32-bit and 64-bit versions using a CMake file. That took a stupidly long time because I've never been that great with CMake. I ended up reading the getting started guide to get it working. Also went through the documentation in my rank structure and made corrections here or there. Chris also suggested implementing a "memory budget" feature, but I'll have to get more details on that before I get started.

I've started reviewing how the FM index works, in preparation for building it. More details on that tomorrow, though.

To do:

Discuss with Chris about what to do next, get feedback, etc.

07/20/10

Had a long discussion with Chris today about the rank structure, and on implementing the FM index in the near future. Some important points:

Rank structure is nearly finished, just a few minor changes:
- Should change sampling rate to reflect bases, not blocks of bases
- Put in packed string support (use templates)
- When taking in a regular string, assume I'm getting values of 0-3, instead of letters. Make a translate function to translate between letters and numbers.
- Implement a static memory function that outputs how much memory a given rank structure requires (in preparation for the "memory budget")
- Pull out some functions and hashes into a utils file
- Try combining 32-bit and 64-bit implementations into one file to minimize redundant code
- Look into using a "binary" bitstring instead of two bitstrings side-by-side for cleaner code.
For the FM index, I'll be:
- Taking a look at readaligner's FM index building function and pulling it out, since we won't be touching on that (building is quite complicated as well)
- Also implement a memory budget feature for the index (when getting positions)
- Start implementing the FM index itself, assuming 4 bases with no N's.

Also, I revisited the 32-bit code and was actually wondering if it was significantly faster than the 64-bit rank implementation on 32-bit systems. Chris suggested I run some tests on antiparos, which probably has a 32-bit CPU. More on that later.

To do:

Merge the two implementations
Implement a memory function
Start looking at Daniel's packed string implementation. Most importantly, add an operator[] function.

07/21/10

Had a meeting with Chris and Anne today and talked about goals for the project. See Daniel's journal for the details.

Also ran some tests on antiparos using the 32-bit and 64-bit rank structures. Since antiparos's CPU is actually 32-bit, there was a huge difference in speed between the two implementations. Of course, the 32-bit version was faster, ranging from 20% to as much as 70% faster (the difference gets wider with higher sampling rates). Given this, I will be keeping both implementations, which I've also managed to merge. The only thing I don't like is that I have to have a selective compilation in each of the rank functions, which makes the code look a little unwieldly. Fortunately, everything else was easy to merge by defining a constant.

I looked into Daniel's packed string class as well. I haven't actually started adding/modifying it yet, because I figure it'll be easier when I start merging the new index in; it's going to require a little CMake tweaking as well, because I want to give it a 64-bit implementation as well. Some changes I would like to make:

Add a 64-bit implementation and have it selectively compile (currently, the data_ member variable is a uint32_t. This should be uint64_t in 64-bit.
Add an operator[] method that returns the ith base; this shouldn't be too bad.
Build the string a little better, using a hash table instead of a switch/case.

Finally, I started looking into BWT builders. The BWT builder is in the incbwt folder (kind of obvious now that I think about it), so it's just a matter of extracting that folder and trimming down on some classes. I think the rotational index is also in that folder, which is why there are so many files in it currently. Chris also suggested an implementation he had from Karkkainen (blockwise suffix sorting), which is what bowtie uses (I think?). This version is able to build the BWT string using much less memory, but the implementation may be buggy, though it may just be that it uses signed integers (too small). As well, Chris isn't sure whether the code is open-source or not, so he's looking into that, too.

To do:

Try fixing the Karkkainen implementation of the BWT builder.
If that fails, extract readaligner's BWT builder.

07/23/10

Started to look into the BWT builders. First, I tried using Karkkainen's builder, but it wouldn't compile on skorpios, even though it compiled fine on my machine (both use gcc 4.4.1). So I went back to readaligner's bwt, incbwt. I ended up taking the whole incbwt folder because they were pretty much all required classes. Building a BWT works through the class RLCSABuilder.

Initially, I had concerns over a FIXME type comment in the actual RLCSA class (which RLCSABuilder builds), which said the program crashes when you enter in a sequence > 2 GB. Looking through the code more carefully though, I think the RLCSABuilder class builds a separate RLCSA for every chromosome (each of which is < 2 GB), and then merges them.

Also, I don't think using packed strings with the RLCSABuilder is possible, because it requires that the sequence it's given has no '\0's (it uses '\0's as the end of sequence marker). Also, it uses char arrays, which is pretty much hardcoded into the code, so using anything smaller isn't really possible without some major changes.

So my plan now is to just build a BWT string, pack it, run it through my rank structure, and then see how that goes from there.

To do:

Make sure the BWT builder works!

07/26/10

Whoops, forgot to do a post on Friday. For Friday, I worked on getting the BWT string builder integrated. Readaligner's TextCollectionBuilder class specifies some options for an FM index compatible BWT string, so I just used those settings (documented in the code). Also had a discussion with Chris on matching N's and end markers ($'s). There are pretty much two options:

Build a new rank structure that supports 3 bits (for 8 different characters)
"Ignore" the N's and $'s, but note their positions. After alignment, do a post-processing step where we filter out results that cross an N or $.

The first option would be easier to do, and would provide support for wildcard matching (later). The second option is much faster and memory-efficient. For the same memory footprint, my 2-bit rank structure is about twice as fast as the 3-bit structure.

Today, I started building the 3-bit rank structure, since Chris tells me that we'll need it sometime in the future anyway and the 2-bit rank structure is still fresh in my mind. It looks like the structure works pretty much as desired, though I haven't done exhaustive bug tests yet. Also, I finished the BWT string builder integration (though not for multiple chromosomes), and I've successfully built a BWT string out of chromosome 10 of the human genomes, ran the resultant string through my 2-bit rank structure and got good ranks out of it, at least in the first 10 or 15 bases.

While typing this, I realized I forgot to change the save/load methods for the 3-bit rank structure, so I'll have to get on that tomorrow. Another thing I should do is to build a 1-bit ('regular') rank structure. The problem with this structure is that I can't interleave the lower-level count blocks, since I only need one 16-bit count per sample, as opposed to four different counts. So, it would be a big waste of memory if they were interleaved. So, I'm just going to have the counts separate from the rank structure.

Finally, for the initialization functions, I'm really considering using iterators instead of requiring an operator[] to be present. char * is already an iterator in and of itself, and it's just a matter of implementing an iterator for the packed string. This way, it's much more flexible, and I might later be able to make a "streaming" class that loads from a file a little bit at a time.

To do:

Fix the save/load function of the 3-bit rank structure
Maybe add iterator support
Make a one-bit rank structure

07/27/10

Pretty much finished up what I wanted to do from yesterday. The 3-bit rank structure has its bugs fixed, I changed the initialization from a StringType and length to iterators for all the rank structures and finished (I hope) the 1-bit rank structure.

The change from iterators was actually surprisingly easy because I had the rank blocks build separately from the count blocks (even though they were part of the same array). Hopefully, this will make everything more flexible. As a bonus, I also cleaned up the rank structure classes quite a bit, removing extraneous methods and updating the documentation.

The 1-bit rank structure wasn't too bad to code, since it's so simple. I had to get rid of the interleaved count blocks because it just wasn't practical, but the memory usage is quite good because I only have to store counts for 1's (Rank0(position) = position - Rank1(position)). I haven't really done exhaustive testing for this class, so it might still have lots of bugs.

Finally, no speed tests yet, but I should have them up by tomorrow.

To do:

Test all the new classes for speed and bugs.

07/29/10

Yesterday, Chris and I met to discuss the next steps in building the FM index. Since the rank structures are done, we can pretty much get started on all the FM index functions. First off, we will start with implementing a count method, which simply counts the number of occurences of a read in a reference. This method should be relatively simple (can be found in the FM index paper). Then, we will do a locate function, which will actually give the position of the sequence in the reference. This locate function requires an additional rank structure that stores positions, sampled at every few positions in the original string. Then, it's just a matter of backtracking to the last sampled position and from there, we can calculate the position of the match. Since sampling the original sequence requires another rank structure (1-bit this time), Chris argued that it might be better to sample positions in a different way to get rid of that rank structure. The disadvantage to this approach is that it can't guarantee finding a position in a constant number of backtracks, but it does allow for better sampling for the same amount of memory.

Chris and I have also decided on a naming scheme, as follows:

1-bit = BitSequence
2-bit = DNASequence
3-bit = EDNASequence (for "Enhanced DNA sequence")
Rank structures have "Static" prepended
Packed strings don't have "Static"

I also benchmarked the various rank structures, shown below. These benchmarks were done on skorpios in the /var/tmp/ folder. Times are reported over 25,000,000 calls to rank on a 512,000,000 base sequence with varied sampling rates. The graph reports times on the y axis (milliseconds) vs memory usage on the x axis (bytes). As can be seen, the StaticEDNASequence is much slower than StaticDNASequence given the same memory usage, but one good thing is that they're not too much slower given the same sampling rate.

Rank graph

Today, I successfully implemented a count function using the StaticEDNASequence. Also tried to work out a nice hack for getting rid of the end markers when using the StaticDNASequence, but it doesn't seem possible. So right now, I'm going to build in another data structure into the StaticDNASequence that keeps track of all the $'s and subtracts these from the A ranks. It might get complicated, but hopefully the speed difference won't be too big.

To do:

Build the new data structure into StaticDNASequence
Benchmark the FM index count method using both 3-bit ranks and 2-bit ranks, to get a good idea of what speed we may be able to achieve.

07/30/10

Started to implement the new StaticDNASequence. The build process is going to get a little ugly, but the ranks should still be relatively clean (which is what matters). To make it easier on myself, I'm just going to do one pass through the whole sequence first to find all the N's and build the new data structure, then build the rest of the rank structure (rank blocks + count blocks). This is kind of wasteful because I can actually build the $ structure while building the rank structure (for just one pass), but I think it might be harder, given my current code. So, the build process will be a bit slower (which I think is okay...).

Finally, I did some benchmarks on the count function. Test parameters are:

read length = 35
reference = e coli genome NC_010473, minus two characters (there was one Y and one R...) (length = 4,686,136 bases)
number of reads = 10,000,000, generated randomly
- this randomly generated set was copied 10 times, for a total of 100,000,000 calls to count.
random seed = 1234
times are reported in milliseconds wall time.
Build version is 64-bit, Release mode (-O3)
Memory usage is just the memory used by the rank structure

Sampling rate (bases)	Memory usage (bytes)	Type	SSE 4.2?	Time (ms)
64	1,784,768	2-bit	No	27,212
256	1,339,168	2-bit	No	46,663
64	2,973,048	3-bit	No	31,454
256	2,081,848	3-bit	No	52,876
2048	1,821,912	3-bit	No	156,832
64	1,784,768	2-bit	Yes	19,924
256	1,339,168	2-bit	Yes	33,002
64	2,973,048	3-bit	Yes	24,371
256	2,081,848	3-bit	Yes	38,524
2048	1,821,912	3-bit	Yes	90,328

For a (very) rough comparison, saligner, using readaligner's FM index, takes 64.461 seconds to do an exact mapping for 2.7 million reads (each 35 bases in length). This means it would take roughly 2,387 seconds to align 100,000,000 reads. However, this is not a very good comparison because that one's doing a locate function which is more complex, has to read from a file and write to an output file, and runs at 32 bits. Also, the rank structure might use a different sampling rate, although I believe it uses a sampling rate of 256, which is why I included that test in. So, this means we have quite a bit of room to work with!

Another note is that for the 3-bit structure, it requires a sampling rate of 2048 to be roughly comparable in memory usage. However, at this rate, it's much slower than the 2-bit structure (roughly 5-6 times slower).

To do:

Finish the $ data structure implementation in the StaticDNASequence

Raw edit | More topic actions

Topic revision: r1 - 2010-08-04 - jayzhang