(1 vs. 73) JaysJournal < BETA

Revision 732010-09-01 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

May 2010 archive

Line: 158 to 158

Add text extraction support from a packed string (to do after block array has been made)
Add a BlockArray class

Added:

>
>

09/01/2010

Had a big discussion with Chris about what the FM index interface should look like and the workflow for building/loading/saving an index, so I thought I'd write it all down before I forget:

Init() will now be Build(sampling rates ...), and take in paramaeters for sampling rates, with defaults set. The method will build the index from scratch.
Load(file) will be Load(file, sampling rates ...), works the same way as Build, but it loads from a file instead of building from scratch. If sampling rates aren't provided, Load will use the sampling rates specified in the save file; otherwise, it will override the sampling rates, if it can (sampling rates should be a multiple of the ones in the save, and be larger).
Save(file) will pretty much do what it's always done, but make sure to save the structure in such a way so that the samples are "skippable"
I'll have SetXSamplingRate methods at some point that will allow locate and jump table parameters to be settable. Setting it to 0 will unload everything, while setting it to a different sampling rate will reload the structure. These methods will also have a file that can be passed in (optionally), which allows reloading of the structures just from a file.
Note that the rank structures (StaticBitSequence, StaticDNASequence) won't be dynamically changeable, since those require the BWT to rebuild (I could rebuild from a file, but it's a lot of trouble!)

META FILEATTACHMENT	attr="h" comment="Rank graph" date="1278719684" name="rank-graph.png" path="rank-graph.png" size="32863" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1278982205" name="rank-graph2.png" path="rank-graph2.png" size="26249" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1280528212" name="rank-graph3.png" path="rank-graph3.png" size="23549" user="jayzhang" version="1.1"

Revision 722010-08-24 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

May 2010 archive

Line: 143 to 143

Implement a jump table within FM Index
Maybe implement multiple sequence support (Chris will get back to me on this)

Added:

>
>

08/23/2010

Chris gave me a list of things to do for the index as well, and I've been working on those the past few days:

Have options to make building the locate table, jump table, etc. optional
Add jumptable support
Add text extraction support (at least be able to extract from a packed representation)
Add a BlockArray class

Currently, I've got the jumptable support done and optional build requirements. I'm also working on some unit tests for the FM index. Finally, the bug from the last post (Rank0 sometimes fails on blocks with '$') has been fixed.

To do:

Finish up unit tests
Add text extraction support from a packed string (to do after block array has been made)
Add a BlockArray class

META FILEATTACHMENT	attr="h" comment="Rank graph" date="1278719684" name="rank-graph.png" path="rank-graph.png" size="32863" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1278982205" name="rank-graph2.png" path="rank-graph2.png" size="26249" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1280528212" name="rank-graph3.png" path="rank-graph3.png" size="23549" user="jayzhang" version="1.1"

Revision 712010-08-18 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

May 2010 archive

Line: 116 to 116

Implement a feature to load/save only the BWT string, instead of the whole index, which could change based on different memory usage profiles.
Get ready for integration?

Added:

>
>

08/18/10

I just realized I forgot to update this journal in quite a while. So here's a brief overview of what's been happening:

The FM Index is mostly done, but without multiple-reference support. Chris says this should be an easy fix and can be updated later. Running a bunch of tests, the FM index seems quite fast when doing Locate (roughly 4x faster than bowtie). However, these results aren't the most reliable because there's no post-processing going into it, nor is there quality support. Some interesting points:

When doing Locate, the sampling rate of the locate structure (to store positions) matters quite a bit, so this should be as small as possible.
The StaticDNASequence sampling rate has some leeway in the beginning (in my tests, from 64-256 bases were okay), but starts affecting the time significantly after that.
The StaticBitSequence sampling rate has pretty much no effect, even up to sampling rates of 1024! Unfortunately, changing this sampling rate also affects memory the least.
Using SSE4.2 (hardware popcount) doesn't have too much of an effect at low StaticDNASequence sampling rates (expected); at a sampling rate of 64 bases, the SSE-enabled version was about 4% faster, but goes up to 20% higher at a sampling rate of 1024 bases.

Lately, I've been working with Daniel to integrate the new FM index into NGSA, which has taken most of the past two days. Right now, we have the BowtieMapper working (albeit with a lot of quick hacks, since multi-references isn't done yet), which maps reads with up to 3 mismatches and uses a jump table. Daniel is also going to work on the BWAMapper, which currently has some problems with unsigned integers and -1's. I'm also going to reorganize the structure of the NGSA library and make a better utilities file.

Finally, I ran some preliminary benchmarks between the new saligner and bowtie. The following were run on NC_010473.fasta, minus the 'R' and the 'Y' character in the genome, using the e_coli_2622382_1.fq readset. Tests were done on skorpios in /var/tmp/, using 64-bit versions of both aligners:

Aligner	Match Type	Flags	Memory Usage (rough, within a few 1000 B) (B)	Time (s)	Notes
bowtie	exact	`-S -v 0`	6,800,000	28.01
saligner	exact	N/A	6,500,000	18.24	locate sr = 16, `StaticBitSequence` sr = 512, `StaticDNASequence` sr = 128
saligner	exact	N/A	5,300,000	19.07	locate sr = 32, `StaticBitSequence` sr = 512, `StaticDNASequence` sr = 128

Note the times might not match so well because I used time for bowtie, but the built-in timing function for saligner, since I haven't gotten the saving/loading integrated yet. Also, I didn't use valgrind --tool=massif to profile saligner, because there are some full-text references being kept in memory somewhere, which is really raising the memory usage (I'll have to find and clear those later). The memory reported above is from the FM Index's FMIndex::MemoryUsage() function, which only reports the theoretical usage, and I kept the memory usage for saligner on the low side, to account for any extra memory that may be used for other things.

To do:

Reorganize directory structure
Fix a bug in the FM Index/=StaticDNASequence=, where Rank0 gives the wrong output in blocks with '$'
Implement a jump table within FM Index
Maybe implement multiple sequence support (Chris will get back to me on this)

META FILEATTACHMENT	attr="h" comment="Rank graph" date="1278719684" name="rank-graph.png" path="rank-graph.png" size="32863" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1278982205" name="rank-graph2.png" path="rank-graph2.png" size="26249" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1280528212" name="rank-graph3.png" path="rank-graph3.png" size="23549" user="jayzhang" version="1.1"

Revision 702010-08-12 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

May 2010 archive

Line: 88 to 88

Benchmarks! I might be able to integrate the index into saligner without too much trouble, so we can get some pretty accurate comparisons?
Implement saving/loading of the whole index. Currently, I support saving/loading of the rank structure, and the FM index is just a few more data structures so it shouldn't be too bad.

Changed:

<
<

08/11/10

>
>

08/10/10

Implemented a save/load feature for the FM index. I'm also thinking of implementing a "partial" load feature, where only the BWT string is loaded, and the other data structures are still built. The reason for this is that the BWT string is the one that takes the most memory (and maybe time, too) to build and should be constant for all indexes, while the other structures will differ depending on sampling rates and memory restrictions. So, the BWT string can be passed around between machines (with different memory restrictions) easily, while the other data structures can't.

I also did a few preliminary benchmarks, and the times were not great on the Locate function. I think this might be because we don't implement Locate the "proper" way, which guarantees a successful locate within sampling rate number of queries. Following Chris' suggestion, I tried graphing the number of backtracks it takes before a successful location is found on a random readset, and here are the results:

Line: 107 to 107

To do:

Make a "proper" locate structure and compare.

Added:

>
>

08/11/10

Finished implementing the "proper" locate structure, which guarantees a hit in sampling rate backtracks. I did a few small tests first, where I set the DNA rank structure sampling rate to a constant 64 bits. For aligning 2 million reads a maximum of 1 time, the "proper" version outperforms the previous version by quite a bit, even under roughly the same memory usage (6.9 seconds at sr=64, memory=37MB; vs 25.9 seconds at sr=16, memory=39MB). Even if I up the sampling rate of the "proper" locate to 256 bases (memory usage = 34MB), it performs at comparable speeds.

To do:

Benchmark more thoroughly
Test more thoroughly
Implement a feature to load/save only the BWT string, instead of the whole index, which could change based on different memory usage profiles.
Get ready for integration?

META FILEATTACHMENT	attr="h" comment="Rank graph" date="1278719684" name="rank-graph.png" path="rank-graph.png" size="32863" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1278982205" name="rank-graph2.png" path="rank-graph2.png" size="26249" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1280528212" name="rank-graph3.png" path="rank-graph3.png" size="23549" user="jayzhang" version="1.1"

Revision 692010-08-11 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

May 2010 archive

Line: 88 to 88

Benchmarks! I might be able to integrate the index into saligner without too much trouble, so we can get some pretty accurate comparisons?
Implement saving/loading of the whole index. Currently, I support saving/loading of the rank structure, and the FM index is just a few more data structures so it shouldn't be too bad.

Added:

>
>

08/11/10

Implemented a save/load feature for the FM index. I'm also thinking of implementing a "partial" load feature, where only the BWT string is loaded, and the other data structures are still built. The reason for this is that the BWT string is the one that takes the most memory (and maybe time, too) to build and should be constant for all indexes, while the other structures will differ depending on sampling rates and memory restrictions. So, the BWT string can be passed around between machines (with different memory restrictions) easily, while the other data structures can't.

I also did a few preliminary benchmarks, and the times were not great on the Locate function. I think this might be because we don't implement Locate the "proper" way, which guarantees a successful locate within sampling rate number of queries. Following Chris' suggestion, I tried graphing the number of backtracks it takes before a successful location is found on a random readset, and here are the results:

The sequence length was 65,751,813 bases long and consisted of the E. coli genome (4.6 Mbp) repeated multiple times. Reads were randomly generated 10-base sequences, and only the first 10 matches were "located". This was run at a sampling rate of 64 bases, and the average distance to locate was 341.68, which isn't very good. The x axis of the graph represents the distance it takes to locate, and the y axis is the number of aligments with that distance (it's logged). There were a total

Reference length = 65,751,813 bp (consisting of the E. coli genome repeated multiple times)
Randomly generated 10-base sequences, only the first 10 alignments were "located"
Run at a sampling rate of 64 bases for the locate structure
Average distance = 341.68, which isn't very good
x-axis is the distance it takes to locate, y-axis is the number of alignments with that distance, semi-logged
855,150 locates performed in total

Locate buckets

To do:

Make a "proper" locate structure and compare.

META FILEATTACHMENT	attr="h" comment="Rank graph" date="1278719684" name="rank-graph.png" path="rank-graph.png" size="32863" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1278982205" name="rank-graph2.png" path="rank-graph2.png" size="26249" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1280528212" name="rank-graph3.png" path="rank-graph3.png" size="23549" user="jayzhang" version="1.1"

Added:

>
>

META FILEATTACHMENT	attr="h" comment="" date="1281556653" name="locate-buckets.png" path="locate-buckets.png" size="79889" user="jayzhang" version="1.1"

Revision 682010-08-07 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

May 2010 archive

Line: 77 to 77

Start doing a Locate method.
Maybe do some code cleanup

Added:

>
>

08/06/10

Yesterday and today, I did a bunch of cleanup on both the rank structure and FM index classes. I pulled some methods out of StaticDNAStructure::Init so the code can be cleaner. I also rethought the naming scheme of a lot of variables to make the code a little more readable. For the FM index, I also pulled out methods from the FMIndex::Init method and overloaded the method to take in C++ strings (which are used by saligner currently). I also added the building of the BWT string directly into the FM index class instead of making it built externally. Currently, only one sequence is supported, but later on, there will be multi-sequence support.

Chris suggeted in an email that we prefetch ep ranks while doing both counts and locates, which might make it faster. I'll be discussing this with him later.

Finally, I finished the FMIndex::Locate function today. It works as expected, and I've tested it pretty thoroughly. Locate requires a bunch of LF mappings of each position in the BWT string to travel "backwards" until a position is reached whose position has been indexed. However, I can't actually get the specific character at the given position in the BWT to do the LF mapping. The FM index paper suggests doing an LF on both position i and i-1 for each character in the alphabet, since the only one that will differ in LF value is the character I'm currently on. I'm also thinking I could just implement a select function in my rank structure, which shouldn't be too hard.

To do:

Benchmarks! I might be able to integrate the index into saligner without too much trouble, so we can get some pretty accurate comparisons?
Implement saving/loading of the whole index. Currently, I support saving/loading of the rank structure, and the FM index is just a few more data structures so it shouldn't be too bad.

META FILEATTACHMENT	attr="h" comment="Rank graph" date="1278719684" name="rank-graph.png" path="rank-graph.png" size="32863" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1278982205" name="rank-graph2.png" path="rank-graph2.png" size="26249" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1280528212" name="rank-graph3.png" path="rank-graph3.png" size="23549" user="jayzhang" version="1.1"

Revision 672010-08-04 - jayzhang

Line: 1 to 1

Added:

>
>

META TOPICPARENT	name="NGSAlignerProject"

May 2010 archive

June 2010 archive

Changed:

<
<

07/01/10

Played around with multithreading some more. I tried making a light-weight test file, where I just add a value a few billion (probably more) times, just to see if it's really the aligner. I noticed that there also is a pretty big jump in user time between 2 and 3 threads, so that bit might not be saligner's fault. However, the ratios were a bit more like what they should've been (though not quite). I forgot to record down the numbers, but they weren't that crazy.

Also worked on the rank structure a little more. I can now build sequences greater than 64 bases in length (so as big as needed, with memory restrictions of course). I figured out how to align to 16 bytes with posix_memalign. I've almost completed the rank function as well, except I'm just getting a small bug with a boundary condition (I'm pretty sure I'm just missing a +1 somewhere).

I was thinking about the calculation needed for the rank structure some more. Currently, the calculation goes something like this (for example, for a C): (~second & first) >> (BITS-i-1). Chris told me that this takes something like three operations to do, since some operations that are independent can be done in parallel. I figure the following equation would also work for calculation: (~second & first) & (0xFFFF... >> BITS-i) (these two formulas aren't numerically equivalent, but they should give the same popcount result). I'm not sure if this will do anything, but it looks like I might be able to drop the operations to 2, if I can somehow calculate the mask fast. I'm going to have to play around with that a bit more.

To do:

Fix that bug with the rank.
Play around with the rank calculation.
Find out what's going on with multithreading.
Start implementing precalculated blocks.

07/02/10

Fixed the bug in my rank function, turns out I did miss a +1. So now I have a working rank structure that I can build and query for any length sequence. The current implementation uses two parallel uint64_t arrays to store the two (first and second) bitsequences. Before I start an implementation that stores precalculated values though, I want to do a few experiments with one array that holds each block sequentially.

I also did a few tests with the new calculation method ( ~second & first) & (0xFFFF...FF >> BITS-i ). I tried making an array of length BITS to store the precalculated masks for every i. That way might lead to fewer operations (not really sure on that point). Preliminary tests didn't show much difference, especially on -O3. The strange thing is, I actually compiled a minimal rank function using either of the two calculations as assembly, and at -O3, there wasn't even any difference in the assembly code! Strange, since the two aren't really numerically equivalent, just "popcount equivalent".

To do:

Explore sequential vs parallel blocks
Add precalculated blocks.

07/05/10

Finished an implementation of sequential blocks instead of storing the first and second blocks in two parallel arrays. Chris argued that this might make it faster because one set of two blocks can then be cached together through one cache line, since they're side by side (I think that's the gist of it).

I also did some benchmarks: On skorpios at O3, using a test sequence of 1280000 characters, ranking from position 1230000-1280000:

Implementation	A (ms)	C (ms)	G (ms)	T (ms)
Parallel	5164	4661	4661	4921
Sequential	5025	4793	4792	4858

It's interesting to note that on the parallel implementation, ranking C and G is actually faster than T! This is strange because the calculation for T involves an & operation on the two blocks, whereas C/G involves an & and a ~ operation. The times are more as expected under the sequential implementation, although T ranking should still be faster (it's got fewer operations!). What's more surprising is that the speeds for C/G are faster on parallel!

Using cachegrind, it seems that there are far fewer (about 75%) "D refs" when ranking C/G than A/T in the parallel implementation (turns out "D refs" are main memory references). Specifically, it seems that there are much fewer D writes occurring in the C/G cases (reads are fewer as well, but not as significantly as writes). Furthermore, under the sequential implementation, the D references for all bases are more on par with the A/T rank functions in the parallel implementation. It seems there might be some specific optimization in the parallel C/G case that the compiler is making. I kind of want to investigate this a bit further; ideally, I can somehow end up with the parallel C/G times with the sequential A/T times.

To do:

Investigate above behaviour a little more.
Start implementing precomputed blocks.

07/06/10

Had a discussion with Chris about experimental protocols. Turns out my experiment yesterday pretty much all fits in the cache (which is 6 MB), so there wouldn't be that much of a benefit between sequential (interleaved) and parallel (normal). So we decided on a new protocol:

Use larger sequences (>=16,000,000 bases)
Use random access. More specifically, build an array of random numbers, then use those numbers to call random ranks. This should prevent streaming.
Try both hardware and software popcount.

Also, I finished implementing count blocks; currently, each 64-bit block has a count block (which is 32 bits), with 4 count blocks per rank block, one for each base.

So here are the results, run on skorpios at -O3, with 40,000,000 rank calls each. Times are wall time:

Sequence length (*10^6)	Normal (ms)	Interleaved	Normal + HW popcount	Interleaved + HW popcount
1	792	750	244	237
2	864	861	265	250
4	924	935	274	255
8	1134	1063	449	363
16	1567	1460	752	639
64	3857	3018	1242	1034
256	4495	3910	1901	1358
2048	5332	4639	2915	1957

Note that the 2048 million times may not be quite as accurate. The test at 1024M segfaulted for all four implementations (later note: turns out this was a problem in the driver, where I used a signed int by accident).

To do:

Try out using a different count structure (only store 3/4 bases, and calculate the last base) and see the performance hit.
Try out interleaving the count structure as well.
Implement adjustable sampling rates.

07/07/10

Implemented another structure with interleaved counts, as well as a count structure with only 3 bases. Since skorpios is down right now, I just did some preliminary testing with my laptop (32-bit, Intel Core 2 Duo, no hardware popcount, 6 MB cache, 3.2 GB RAM).

Again, times are in wall time, compiled at -O3, single threaded and run with 50,000,000 ranks each.

Sequence length (*10^6)	Normal (ms)	Interleaved	Interleaved + counts	Interleaved, 3-base
1	1046	985	1105	1013
8	1382	1202	1208	1245
16	3276	2768	2137	2522
32	5059	4525	3556	4474
64	5708	5100	4269	5292
128	5901	5303	4617	5545
256	5985	5387	4834	5686
512	6111	5556	4989	5786
1024	7967	7392	6690	7653

The 1024M times are again a bit dubious. I fixed the segfault bug, but I think my computer was running out of memory, and it looks like a big jump in time. Regardless, I'm going to test these again on skorpios when it gets back. It doesn't look like there's too much of a performance hit with the 3-base implementation, but if we stick with interleaved counts, there's really no point in going that route, since the 4 32-bit counts fit nicely into 2 64-bit "rank blocks". Using only 3 counts wouldn't do much for memory, since there will still be those 32 bits available.

I also experimented with some different popcount functions, found on the Wikipedia entry. So far, it seems like the current popcount function (which I stole from readaligner) is the faster than the one proposed by Wikipedia. However, the current popcount function relies on a table of values, which might make it slower if there's a lot of cache misses on it, though it outperformed Wikipedia's function even at 1024M bases (don't have the exact values, another thing to benchmark).

Finally, I started implementing the sampling rates; there's currently a bug somewhere, but it shouldn't be too bad to fix.

PS: Before I forget, I realized I did actually lock the whole single end output processing section in my multithreaded driver (I originally thought I just locked the part where it writes to output, whereas now I realized that it locks before that, when it's processing the SamEntry into a string). I should experiment with locking it at a different place (though at this point, it's not that easy, unless I want to put locks in the sam file class itself.)

To do:

Benchmark above on skorpios.
Finish sampling rate implementation.
Get started on 32-bit implementation?

07/08/10

Skorpios wasn't back till late so I didn't really have a chance to benchmark. I did, however, finish my sampling rate implementation. Hopefully, there aren't any bugs. I also managed to fix all valgrind errors, so now everything runs (hopefully) well.

I also managed to finish an interleaved count implementation, which was much harder. In fact, I pretty much had to do a complete recode of the core build/rank functions. I need to run a few more tests on the implementation, but it should be running fine.

Since skorpios is back now, I'll probably do a bunch of experiments tomorrow. I should also make up a 3-base sampling rate implementation, just to test the cost; again, however, if we end up going with the count-interleaved version, the 3-base version really isn't very practical.

To do:

Benchmark!
3-base implementation (for sampling-rate enabled, interleaved)
32-bit implementation, look into two-level rank structure.

07/09/10

Finished up all the implementations for the rank structure (interleaved, 3-base, and interleaved counts, all with adjustable sampling rates) and benchmarked them. Rather than provide a huge chart for all the benchmarks, I decided to show a chart:

Times are in milliseconds, tested on skorpios with a constant seed, and ranked 75,000,000 times per test. Squares are interleaved, diamonds are 3-base, and triangles are interleaved count implementations. The three different versions all use different popcount functions: blue is the function provided in readaligner, which uses a table; yellow is one provided by Wikipedia, which uses a bit-mangling method; and orange uses the hardware instruction.

Rank graph

In previous tests, I noted that the table version was faster than the bit-mangling version; it's interesting to note that in these implementations, the bit-mangling popcount is actually better. My thoughts are that once everything isn't able to be fit nicely into the cache, then the popcount table may encounter cache misses, which slows it down dramatically. It's also interesting to see that when using the tabular popcount, interleaving the counts is actually the slowest implementation. Again, this might be due to excessive cache misses from having to cache the table, and it's actually what led me to try the bit-mangling popcount one more time. It's nice to see that interleaving the counts has some effect, especially in the hardware popcount case, though.

Finally, I talked with Anne and Chris about a new project, possibly for my honours thesis, today. Chris proposed a GPGPU version of the Smith-Waterman aligner I vectorized earlier, or even GPGPU-capable succinct index. Anne also suggested that I look into the RNA folding pathway research currently going on as well, and I'll probably start reading some papers on that soon.

To do:

Look into both the GPGPU project and get more familiarized with the RNA folding pathway project
Start implementing a second level in the rank structure
Look into a 32-bit implementation

Edit: I realized I forgot to benchmark everything WRT sampling rate, so that's another item on the to-do list.

07/10/10

Had a small idea that I wanted to try out quickly. Tried replacing the popcount function in basics.h for NGSA, and it did actually turn out to be faster, at least with skorpios. Times were 84.29 seconds vs 79.75 seconds, for tabular popcount vs bit-mangling popcount. So maybe we should change the function in basics to the bit-mangling version?

07/12/10

Read over Daniel's presentation on rank structures and thought over how I'm going to implement the two-level structure today. As far as I can tell, Vigna's rank9 has no practical way of adjusting sampling rates, which is something we want. So, I think I'll stick with the regular two-level structure with a few modifications:

There is one "block" per 64 characters.
The top level will be 32-bit unsigned integers.
The bottom level will be made of 16-bit unsigned integers.
There will be one bottom level count every (sampling rate) blocks.
There will be one top level count every 1023 (the maximum number of blocks that can be counted with 16 bits) blocks

The implementation shouldn't be too bad. I've decided to stick with the interleaved counts implementation for now, where all the counts are interleaved with the sequence blocks. Adding a higher level would just mean adding more blocks, this time with a constant sampling rate of 1023. Ranks would just be the sum of the nearest low-level block and the nearest top-level block. I don't think I will interleave the top level blocks, since they're much too far apart to benefit anyway.

Another optimization I've been thinking about is to limit the sampling rate to powers of 2. That way, when finding the nearest blocks, I can just do bit shifts instead of costly divisions. I'm also thinking about using a sampling rate of 512 for the top-level blocks, so that I can use bit shifts as well. The difference in space usage isn't very large, less than a megabyte (in theory) for a 4-billion base sequence (calculation here, multiplied by 4 for all four bases).

Finally, here is the graph I promised on Friday, compared to sampling rate. Tests were done on skorpios with sequence size of 512,000,000 over 25,000,000 times each and a constant seed:

Rank graph (sampling rate) *Notes: "Interleaved" is the normal interleaved implementation, "3-base" is the slightly more memory-efficient method which keeps track of only 3 bases, and "Count-Interleaved" is the version with counts interleaved into the rank blocks. Blue dots are versions with the tabled popcount, yellow dots have the bit-mangling popcount, while orange dots use the SSE4.2 hardware instruction.

It's a little hard to tell, but it looks like the count interleaved implementation shows minor improvements in speed (about 5%) up until sampling rate=128, where it starts to get much slower.

To do:

Finish implementation of 2-level rank structure
Start on 32-bit implementation

07/13/10

Finished up the two-level rank structure. I ended up deciding to go with taking a sample for the top level every 512 blocks instead of 1023 blocks. There's a little speed difference in it, it's easier to calculate and it's not much of a space difference, since the top level takes up so little anyway. I also decided to go with using bit shifts.

So now, any sample factor that is inputted is rounded down to the next smallest power of two. This way, I can use bit operations instead of divisions and multiplications, which are slightly slower. Running a few tests, I see about a 10% performance gain.

To do:

Start on 32-bit implementation

07/15/10

Whoops, I completely forgot to make a post yesterday, so I'll make a post for yesterday and today, today.

Yesterday, I managed to actually finish up my two-level implementation (turns out there were a few bugs in certain cases, due to +1/-1 errors). I also implemented save/load functionality, so that the structure itself can be saved. To use, just call RankStructure::Save on a RankStructure that has already been initialized (by calling RankStructure::Init). Loading works the same way, just call RankStructure::Load with a filename supplied. Note that when loading a RankStructure, RankStructure::Init doesn't have to be called first. I'm not sure whether I should be taking in a file name as a string or whether I should just take in a stream pointer; readaligner's rank structure seems to just take in a pointer, so I might switch to that. Only problem is, I have to assume that the file's been opened as binary.

Today, I managed to finish my 32-bit implementation, which runs much faster on my machine (operating system is 32 bits). I also did a bunch of rigorous tests on both the 32-bit and 64-bit versions of the code, trying to fix all the bugs I could find. Both implementations are now pretty much valgrind-error free, and I'm pretty sure they both work correctly, too. One very annoying bug I had was during memory allocation, where I had an expression, size << 3, where size was a 32-bit unsigned integer. I later corrected this to a more generalized size * BITS/8, which ended up blowing up at sizes of around 2,048,000,000 characters. Well, it turns out the multiplication step is done first, which, when done, makes it too big for a 32-bit integer and rolls over. Sigh...

I also did some tests on both implementations. Initially, I found the 32-bit version to be faster on skorpios, which was very strange. However, I realized that this was because of the way I chose to represent the sampling rate (I chose to represent it in blocks, so a "sampling factor" of 1 is actually a sampling rate of either 64 or 32, depending on the implementation). So, the 64-bit implementation actually had twice the sampling rate given the same "sampling factor", which was what made it slower. Accounting for this, the 64-bit implementation is (rightly) faster.

Finally, I used the save functionality to test the memory size, just to make sure everything was going right. Using my two-level 64-bit implementation, I put in a 1,024,000,000 character sequence with a sampling factor of 1 (so sampling rate = 64) and saved it to a file. The file turned out to be 366.7 MB, which should be roughly the memory footprint of the structure. At 32 bits, with the same parameters (note sampling factor = 1, so sampling rate = 32), I get a file size of 488.8 MB. At sampling factor = 2 (sampling rate = 64), I get 366.7 MB again, which is exactly the same as the 64-bit implementation.

To do:

Not really sure, actually. I'm going to finalize the rank structure, then either...
Do some work on the Local Aligner, like optimize the FullAlignment methods
Read some papers/articles on GPGPUs and CUDA and also RNA folding pathways to decide a bit better on what I would do for my thesis.

07/16/10

Hopefully, I've finalized my rank structure implementation. Ran as many tests as I could think of on both versions and managed to fix a small bug. Hopefully, it's all done now.

Also did a little reading on CUDA and using GPGPUs. I found one paper detailing a GPGPU-based Smith-Waterman implementation, which I briefly glanced through and also went through the =CUDA= developer guide.

Finally, had a long discussion with Daniel about his new BowtieK1Mapper and multithreading. Daniel's new implementations seem to modify the index a little during the mapping process (he "flips" it), which doesn't go well with multithreading, because I assume the mapping function is constant and can be run independently. Long story short, we decided to hold off on modifications to the class, since we'll be implementing a new FM index soon, which will no doubt involve a lot of changes (the least of which is probably getting rid of the NGS_FMIndex class, which is just acting as a middleman for readaligner's FMIndex). I've also decided to hold off on committing my merge to the dev trunk for multithreading, because this would introduce compile errors with BowtieK1Mapper in its current state.

To do:

More reading.
Talk to Chris about the FM index.

07/19/10

Started finalizing the rank structure. I've integrated the 32-bit and 64-bit versions using a CMake file. That took a stupidly long time because I've never been that great with CMake. I ended up reading the getting started guide to get it working. Also went through the documentation in my rank structure and made corrections here or there. Chris also suggested implementing a "memory budget" feature, but I'll have to get more details on that before I get started.

I've started reviewing how the FM index works, in preparation for building it. More details on that tomorrow, though.

To do:

Discuss with Chris about what to do next, get feedback, etc.

07/20/10

Had a long discussion with Chris today about the rank structure, and on implementing the FM index in the near future. Some important points:

Rank structure is nearly finished, just a few minor changes:
- Should change sampling rate to reflect bases, not blocks of bases
- Put in packed string support (use templates)
- When taking in a regular string, assume I'm getting values of 0-3, instead of letters. Make a translate function to translate between letters and numbers.
- Implement a static memory function that outputs how much memory a given rank structure requires (in preparation for the "memory budget")
- Pull out some functions and hashes into a utils file
- Try combining 32-bit and 64-bit implementations into one file to minimize redundant code
- Look into using a "binary" bitstring instead of two bitstrings side-by-side for cleaner code.
For the FM index, I'll be:
- Taking a look at readaligner's FM index building function and pulling it out, since we won't be touching on that (building is quite complicated as well)
- Also implement a memory budget feature for the index (when getting positions)
- Start implementing the FM index itself, assuming 4 bases with no N's.

Also, I revisited the 32-bit code and was actually wondering if it was significantly faster than the 64-bit rank implementation on 32-bit systems. Chris suggested I run some tests on antiparos, which probably has a 32-bit CPU. More on that later.

To do:

Merge the two implementations
Implement a memory function
Start looking at Daniel's packed string implementation. Most importantly, add an operator[] function.

07/21/10

Had a meeting with Chris and Anne today and talked about goals for the project. See Daniel's journal for the details.

Also ran some tests on antiparos using the 32-bit and 64-bit rank structures. Since antiparos's CPU is actually 32-bit, there was a huge difference in speed between the two implementations. Of course, the 32-bit version was faster, ranging from 20% to as much as 70% faster (the difference gets wider with higher sampling rates). Given this, I will be keeping both implementations, which I've also managed to merge. The only thing I don't like is that I have to have a selective compilation in each of the rank functions, which makes the code look a little unwieldly. Fortunately, everything else was easy to merge by defining a constant.

I looked into Daniel's packed string class as well. I haven't actually started adding/modifying it yet, because I figure it'll be easier when I start merging the new index in; it's going to require a little CMake tweaking as well, because I want to give it a 64-bit implementation as well. Some changes I would like to make:

Add a 64-bit implementation and have it selectively compile (currently, the data_ member variable is a uint32_t. This should be uint64_t in 64-bit.
Add an operator[] method that returns the ith base; this shouldn't be too bad.
Build the string a little better, using a hash table instead of a switch/case.

Finally, I started looking into BWT builders. The BWT builder is in the incbwt folder (kind of obvious now that I think about it), so it's just a matter of extracting that folder and trimming down on some classes. I think the rotational index is also in that folder, which is why there are so many files in it currently. Chris also suggested an implementation he had from Karkkainen (blockwise suffix sorting), which is what bowtie uses (I think?). This version is able to build the BWT string using much less memory, but the implementation may be buggy, though it may just be that it uses signed integers (too small). As well, Chris isn't sure whether the code is open-source or not, so he's looking into that, too.

To do:

Try fixing the Karkkainen implementation of the BWT builder.
If that fails, extract readaligner's BWT builder.

07/23/10

Started to look into the BWT builders. First, I tried using Karkkainen's builder, but it wouldn't compile on skorpios, even though it compiled fine on my machine (both use gcc 4.4.1). So I went back to readaligner's bwt, incbwt. I ended up taking the whole incbwt folder because they were pretty much all required classes. Building a BWT works through the class RLCSABuilder.

Initially, I had concerns over a FIXME type comment in the actual RLCSA class (which RLCSABuilder builds), which said the program crashes when you enter in a sequence > 2 GB. Looking through the code more carefully though, I think the RLCSABuilder class builds a separate RLCSA for every chromosome (each of which is < 2 GB), and then merges them.

Also, I don't think using packed strings with the RLCSABuilder is possible, because it requires that the sequence it's given has no '\0's (it uses '\0's as the end of sequence marker). Also, it uses char arrays, which is pretty much hardcoded into the code, so using anything smaller isn't really possible without some major changes.

So my plan now is to just build a BWT string, pack it, run it through my rank structure, and then see how that goes from there.

To do:

Make sure the BWT builder works!

07/26/10

Whoops, forgot to do a post on Friday. For Friday, I worked on getting the BWT string builder integrated. Readaligner's TextCollectionBuilder class specifies some options for an FM index compatible BWT string, so I just used those settings (documented in the code). Also had a discussion with Chris on matching N's and end markers ($'s). There are pretty much two options:

Build a new rank structure that supports 3 bits (for 8 different characters)
"Ignore" the N's and $'s, but note their positions. After alignment, do a post-processing step where we filter out results that cross an N or $.

The first option would be easier to do, and would provide support for wildcard matching (later). The second option is much faster and memory-efficient. For the same memory footprint, my 2-bit rank structure is about twice as fast as the 3-bit structure.

Today, I started building the 3-bit rank structure, since Chris tells me that we'll need it sometime in the future anyway and the 2-bit rank structure is still fresh in my mind. It looks like the structure works pretty much as desired, though I haven't done exhaustive bug tests yet. Also, I finished the BWT string builder integration (though not for multiple chromosomes), and I've successfully built a BWT string out of chromosome 10 of the human genomes, ran the resultant string through my 2-bit rank structure and got good ranks out of it, at least in the first 10 or 15 bases.

While typing this, I realized I forgot to change the save/load methods for the 3-bit rank structure, so I'll have to get on that tomorrow. Another thing I should do is to build a 1-bit ('regular') rank structure. The problem with this structure is that I can't interleave the lower-level count blocks, since I only need one 16-bit count per sample, as opposed to four different counts. So, it would be a big waste of memory if they were interleaved. So, I'm just going to have the counts separate from the rank structure.

Finally, for the initialization functions, I'm really considering using iterators instead of requiring an operator[] to be present. char * is already an iterator in and of itself, and it's just a matter of implementing an iterator for the packed string. This way, it's much more flexible, and I might later be able to make a "streaming" class that loads from a file a little bit at a time.

To do:

Fix the save/load function of the 3-bit rank structure
Maybe add iterator support
Make a one-bit rank structure

07/27/10

Pretty much finished up what I wanted to do from yesterday. The 3-bit rank structure has its bugs fixed, I changed the initialization from a StringType and length to iterators for all the rank structures and finished (I hope) the 1-bit rank structure.

The change from iterators was actually surprisingly easy because I had the rank blocks build separately from the count blocks (even though they were part of the same array). Hopefully, this will make everything more flexible. As a bonus, I also cleaned up the rank structure classes quite a bit, removing extraneous methods and updating the documentation.

The 1-bit rank structure wasn't too bad to code, since it's so simple. I had to get rid of the interleaved count blocks because it just wasn't practical, but the memory usage is quite good because I only have to store counts for 1's (Rank0(position) = position - Rank1(position)). I haven't really done exhaustive testing for this class, so it might still have lots of bugs.

Finally, no speed tests yet, but I should have them up by tomorrow.

To do:

Test all the new classes for speed and bugs.

07/29/10

Yesterday, Chris and I met to discuss the next steps in building the FM index. Since the rank structures are done, we can pretty much get started on all the FM index functions. First off, we will start with implementing a count method, which simply counts the number of occurences of a read in a reference. This method should be relatively simple (can be found in the FM index paper). Then, we will do a locate function, which will actually give the position of the sequence in the reference. This locate function requires an additional rank structure that stores positions, sampled at every few positions in the original string. Then, it's just a matter of backtracking to the last sampled position and from there, we can calculate the position of the match. Since sampling the original sequence requires another rank structure (1-bit this time), Chris argued that it might be better to sample positions in a different way to get rid of that rank structure. The disadvantage to this approach is that it can't guarantee finding a position in a constant number of backtracks, but it does allow for better sampling for the same amount of memory.

Chris and I have also decided on a naming scheme, as follows:

1-bit = BitSequence
2-bit = DNASequence
3-bit = EDNASequence (for "Enhanced DNA sequence")
Rank structures have "Static" prepended
Packed strings don't have "Static"

I also benchmarked the various rank structures, shown below. These benchmarks were done on skorpios in the /var/tmp/ folder. Times are reported over 25,000,000 calls to rank on a 512,000,000 base sequence with varied sampling rates. The graph reports times on the y axis (milliseconds) vs memory usage on the x axis (bytes). As can be seen, the StaticEDNASequence is much slower than StaticDNASequence given the same memory usage, but one good thing is that they're not too much slower given the same sampling rate.

Rank graph

Today, I successfully implemented a count function using the StaticEDNASequence. Also tried to work out a nice hack for getting rid of the end markers when using the StaticDNASequence, but it doesn't seem possible. So right now, I'm going to build in another data structure into the StaticDNASequence that keeps track of all the $'s and subtracts these from the A ranks. It might get complicated, but hopefully the speed difference won't be too big.

To do:

Build the new data structure into StaticDNASequence
Benchmark the FM index count method using both 3-bit ranks and 2-bit ranks, to get a good idea of what speed we may be able to achieve.

>
> July 2010 archive

07/30/10

Changed:

<
< Started to implement the new StaticDNASequence. The build process is going to get a little ugly, but the ranks should still be relatively clean (which is what matters). To make it easier on myself, I'm just going to do one pass through the whole sequence first to find all the N's and build the new data structure, then build the rest of the rank structure (rank blocks + count blocks). This is kind of wasteful because I can actually build the N structure while building the rank structure (for just one pass), but I think it might be harder, given my current code. So, the build process will be a bit slower (which I think is okay...).

>
> Started to implement the new StaticDNASequence. The build process is going to get a little ugly, but the ranks should still be relatively clean (which is what matters). To make it easier on myself, I'm just going to do one pass through the whole sequence first to find all the N's and build the new data structure, then build the rest of the rank structure (rank blocks + count blocks). This is kind of wasteful because I can actually build the $ structure while building the rank structure (for just one pass), but I think it might be harder, given my current code. So, the build process will be a bit slower (which I think is okay...).

Finally, I did some benchmarks on the count function. Test parameters are:

read length = 35

Changed:

<
<

reference = e coli genome NC_010473, minus two characters (there was one Y and one R...) (length = 4,753,080 bases)

>
>

reference = e coli genome NC_010473, minus two characters (there was one Y and one R...) (length = 4,686,136 bases)

number of reads = 10,000,000, generated randomly
- this randomly generated set was copied 10 times, for a total of 100,000,000 calls to count.
random seed = 1234

Line: 325 to 36

Another note is that for the 3-bit structure, it requires a sampling rate of 2048 to be roughly comparable in memory usage. However, at this rate, it's much slower than the 2-bit structure (roughly 5-6 times slower).

To do:

Changed:

<
<

Finish the N data structure implementation in the StaticDNASequence

>
>

Finish the $ data structure implementation in the StaticDNASequence

08/04/10

Spent all of yesterday and most of today finishing up the $ data structure for the StaticDNASequence class. I've finished testing it thoroughly (I hope). I've also integrated it with the FM Index and tested the count method to verify the output. Basically, I used the std::string::find method to look through a reference sequence (I used NC_010473, the E. coli genome) to "manually" count the occurrences of some read. Then, I used the actual FMIndex::Count method to count it and verified that they were the same. I tested a bunch of random sequences, and every one of them came out correct. So, I'm pretty sure that my rank structure and count methods are correct.

Also did a few benchmarks using the new 2-bit "enhanced" version and repeated some benchmarks from last time: Test parameters:

read length = 35
reference = e coli genome NC_010473, minus two characters (there was one Y and one R...) (length = 4,686,136 bases)
number of reads = 100,000, generated randomly
- this randomly generated set was copied 1000 times, for a total of 100,000,000 calls to count.
random seed = 1234
times are reported in milliseconds wall time.
Build version is 64-bit, Release mode (-O3)
Memory usage is just the memory used by the rank structure
There was one $ in the sequence, no N's

Sampling rate (bases)	Memory usage (bytes)	Type	SSE 4.2?	Time (ms)	Ratio to 2-bit
64	1,759,632	2-bit	No	27.071	1
256	1,320,312	2-bit	No	46,726	1
2048	1,192,176	2-bit	No	137,583	1
64	1,761,936	2-bit "special"	No	27,778	0.97
256	1,322,640	2-bit "special"	No	47,574	0.98
2048	1,194,728	2-bit "special"	No	138,833	0.99
64	2,931,176	3-bit	No	31,513	0.86
256	2,052,536	3-bit	No	53,110	0.88
2048	1,796,264	3-bit	No	156,759	0.88
64	1,759,632	2-bit	Yes	19,747	1
256	1,320,312	2-bit	Yes	32,898	1
2048	1,192,176	2-bit	Yes	74,230	1
64	1,761,936	2-bit "special"	Yes	20,471	0.96
256	1,322,640	2-bit "special"	Yes	33,809	0.97
2048	1,194,728	2-bit "special"	Yes	76,269	0.99
64	2,931,176	3-bit	Yes	24,855	0.79
256	2,052,536	3-bit	Yes	37,847	0.87
2048	1,796,264	3-bit	Yes	89,793	0.84

To do:

Start doing a Locate method.
Maybe do some code cleanup

META FILEATTACHMENT	attr="h" comment="Rank graph" date="1278719684" name="rank-graph.png" path="rank-graph.png" size="32863" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1278982205" name="rank-graph2.png" path="rank-graph2.png" size="26249" user="jayzhang" version="1.1"

Revision 662010-07-30 - jayzhang

Line: 1 to 1

May 2010 archive

June 2010 archive

Line: 275 to 275

To do:

Test all the new classes for speed and bugs.

Added:

>
>

07/29/10

Yesterday, Chris and I met to discuss the next steps in building the FM index. Since the rank structures are done, we can pretty much get started on all the FM index functions. First off, we will start with implementing a count method, which simply counts the number of occurences of a read in a reference. This method should be relatively simple (can be found in the FM index paper). Then, we will do a locate function, which will actually give the position of the sequence in the reference. This locate function requires an additional rank structure that stores positions, sampled at every few positions in the original string. Then, it's just a matter of backtracking to the last sampled position and from there, we can calculate the position of the match. Since sampling the original sequence requires another rank structure (1-bit this time), Chris argued that it might be better to sample positions in a different way to get rid of that rank structure. The disadvantage to this approach is that it can't guarantee finding a position in a constant number of backtracks, but it does allow for better sampling for the same amount of memory.

Chris and I have also decided on a naming scheme, as follows:

1-bit = BitSequence
2-bit = DNASequence
3-bit = EDNASequence (for "Enhanced DNA sequence")
Rank structures have "Static" prepended
Packed strings don't have "Static"

I also benchmarked the various rank structures, shown below. These benchmarks were done on skorpios in the /var/tmp/ folder. Times are reported over 25,000,000 calls to rank on a 512,000,000 base sequence with varied sampling rates. The graph reports times on the y axis (milliseconds) vs memory usage on the x axis (bytes). As can be seen, the StaticEDNASequence is much slower than StaticDNASequence given the same memory usage, but one good thing is that they're not too much slower given the same sampling rate.

Rank graph

Today, I successfully implemented a count function using the StaticEDNASequence. Also tried to work out a nice hack for getting rid of the end markers when using the StaticDNASequence, but it doesn't seem possible. So right now, I'm going to build in another data structure into the StaticDNASequence that keeps track of all the $'s and subtracts these from the A ranks. It might get complicated, but hopefully the speed difference won't be too big.

To do:

Build the new data structure into StaticDNASequence
Benchmark the FM index count method using both 3-bit ranks and 2-bit ranks, to get a good idea of what speed we may be able to achieve.

07/30/10

Started to implement the new StaticDNASequence. The build process is going to get a little ugly, but the ranks should still be relatively clean (which is what matters). To make it easier on myself, I'm just going to do one pass through the whole sequence first to find all the N's and build the new data structure, then build the rest of the rank structure (rank blocks + count blocks). This is kind of wasteful because I can actually build the N structure while building the rank structure (for just one pass), but I think it might be harder, given my current code. So, the build process will be a bit slower (which I think is okay...).

Finally, I did some benchmarks on the count function. Test parameters are:

read length = 35
reference = e coli genome NC_010473, minus two characters (there was one Y and one R...) (length = 4,753,080 bases)
number of reads = 10,000,000, generated randomly
- this randomly generated set was copied 10 times, for a total of 100,000,000 calls to count.
random seed = 1234
times are reported in milliseconds wall time.
Build version is 64-bit, Release mode (-O3)
Memory usage is just the memory used by the rank structure

Sampling rate (bases)	Memory usage (bytes)	Type	SSE 4.2?	Time (ms)
64	1,784,768	2-bit	No	27,212
256	1,339,168	2-bit	No	46,663
64	2,973,048	3-bit	No	31,454
256	2,081,848	3-bit	No	52,876
2048	1,821,912	3-bit	No	156,832
64	1,784,768	2-bit	Yes	19,924
256	1,339,168	2-bit	Yes	33,002
64	2,973,048	3-bit	Yes	24,371
256	2,081,848	3-bit	Yes	38,524
2048	1,821,912	3-bit	Yes	90,328

For a (very) rough comparison, saligner, using readaligner's FM index, takes 64.461 seconds to do an exact mapping for 2.7 million reads (each 35 bases in length). This means it would take roughly 2,387 seconds to align 100,000,000 reads. However, this is not a very good comparison because that one's doing a locate function which is more complex, has to read from a file and write to an output file, and runs at 32 bits. Also, the rank structure might use a different sampling rate, although I believe it uses a sampling rate of 256, which is why I included that test in. So, this means we have quite a bit of room to work with!

Another note is that for the 3-bit structure, it requires a sampling rate of 2048 to be roughly comparable in memory usage. However, at this rate, it's much slower than the 2-bit structure (roughly 5-6 times slower).

To do:

Finish the N data structure implementation in the StaticDNASequence

META FILEATTACHMENT	attr="h" comment="Rank graph" date="1278719684" name="rank-graph.png" path="rank-graph.png" size="32863" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1278982205" name="rank-graph2.png" path="rank-graph2.png" size="26249" user="jayzhang" version="1.1"

Added:

>
>

META FILEATTACHMENT	attr="h" comment="" date="1280528212" name="rank-graph3.png" path="rank-graph3.png" size="23549" user="jayzhang" version="1.1"

Revision 652010-07-28 - jayzhang

Line: 1 to 1

May 2010 archive

June 2010 archive

Line: 263 to 263

Maybe add iterator support
Make a one-bit rank structure

Added:

>
>

07/27/10

Pretty much finished up what I wanted to do from yesterday. The 3-bit rank structure has its bugs fixed, I changed the initialization from a StringType and length to iterators for all the rank structures and finished (I hope) the 1-bit rank structure.

The change from iterators was actually surprisingly easy because I had the rank blocks build separately from the count blocks (even though they were part of the same array). Hopefully, this will make everything more flexible. As a bonus, I also cleaned up the rank structure classes quite a bit, removing extraneous methods and updating the documentation.

The 1-bit rank structure wasn't too bad to code, since it's so simple. I had to get rid of the interleaved count blocks because it just wasn't practical, but the memory usage is quite good because I only have to store counts for 1's (Rank0(position) = position - Rank1(position)). I haven't really done exhaustive testing for this class, so it might still have lots of bugs.

Finally, no speed tests yet, but I should have them up by tomorrow.

To do:

Test all the new classes for speed and bugs.

META FILEATTACHMENT	attr="h" comment="Rank graph" date="1278719684" name="rank-graph.png" path="rank-graph.png" size="32863" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1278982205" name="rank-graph2.png" path="rank-graph2.png" size="26249" user="jayzhang" version="1.1"

Revision 642010-07-27 - jayzhang

Line: 1 to 1

May 2010 archive

June 2010 archive

Line: 245 to 245

To do:

Make sure the BWT builder works!

Added:

>
>

07/26/10

Whoops, forgot to do a post on Friday. For Friday, I worked on getting the BWT string builder integrated. Readaligner's TextCollectionBuilder class specifies some options for an FM index compatible BWT string, so I just used those settings (documented in the code). Also had a discussion with Chris on matching N's and end markers ($'s). There are pretty much two options:

Build a new rank structure that supports 3 bits (for 8 different characters)
"Ignore" the N's and $'s, but note their positions. After alignment, do a post-processing step where we filter out results that cross an N or $.

The first option would be easier to do, and would provide support for wildcard matching (later). The second option is much faster and memory-efficient. For the same memory footprint, my 2-bit rank structure is about twice as fast as the 3-bit structure.

Today, I started building the 3-bit rank structure, since Chris tells me that we'll need it sometime in the future anyway and the 2-bit rank structure is still fresh in my mind. It looks like the structure works pretty much as desired, though I haven't done exhaustive bug tests yet. Also, I finished the BWT string builder integration (though not for multiple chromosomes), and I've successfully built a BWT string out of chromosome 10 of the human genomes, ran the resultant string through my 2-bit rank structure and got good ranks out of it, at least in the first 10 or 15 bases.

While typing this, I realized I forgot to change the save/load methods for the 3-bit rank structure, so I'll have to get on that tomorrow. Another thing I should do is to build a 1-bit ('regular') rank structure. The problem with this structure is that I can't interleave the lower-level count blocks, since I only need one 16-bit count per sample, as opposed to four different counts. So, it would be a big waste of memory if they were interleaved. So, I'm just going to have the counts separate from the rank structure.

Finally, for the initialization functions, I'm really considering using iterators instead of requiring an operator[] to be present. char * is already an iterator in and of itself, and it's just a matter of implementing an iterator for the packed string. This way, it's much more flexible, and I might later be able to make a "streaming" class that loads from a file a little bit at a time.

To do:

Fix the save/load function of the 3-bit rank structure
Maybe add iterator support
Make a one-bit rank structure

META FILEATTACHMENT	attr="h" comment="Rank graph" date="1278719684" name="rank-graph.png" path="rank-graph.png" size="32863" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1278982205" name="rank-graph2.png" path="rank-graph2.png" size="26249" user="jayzhang" version="1.1"

Revision 632010-07-23 - jayzhang

Line: 1 to 1

May 2010 archive

June 2010 archive

Line: 233 to 233

Try fixing the Karkkainen implementation of the BWT builder.
If that fails, extract readaligner's BWT builder.

Added:

>
>

07/23/10

Started to look into the BWT builders. First, I tried using Karkkainen's builder, but it wouldn't compile on skorpios, even though it compiled fine on my machine (both use gcc 4.4.1). So I went back to readaligner's bwt, incbwt. I ended up taking the whole incbwt folder because they were pretty much all required classes. Building a BWT works through the class RLCSABuilder.

Initially, I had concerns over a FIXME type comment in the actual RLCSA class (which RLCSABuilder builds), which said the program crashes when you enter in a sequence > 2 GB. Looking through the code more carefully though, I think the RLCSABuilder class builds a separate RLCSA for every chromosome (each of which is < 2 GB), and then merges them.

Also, I don't think using packed strings with the RLCSABuilder is possible, because it requires that the sequence it's given has no '\0's (it uses '\0's as the end of sequence marker). Also, it uses char arrays, which is pretty much hardcoded into the code, so using anything smaller isn't really possible without some major changes.

So my plan now is to just build a BWT string, pack it, run it through my rank structure, and then see how that goes from there.

To do:

Make sure the BWT builder works!

META FILEATTACHMENT	attr="h" comment="Rank graph" date="1278719684" name="rank-graph.png" path="rank-graph.png" size="32863" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1278982205" name="rank-graph2.png" path="rank-graph2.png" size="26249" user="jayzhang" version="1.1"

Revision 622010-07-22 - jayzhang

Line: 1 to 1

May 2010 archive

June 2010 archive

Line: 217 to 217

Implement a memory function
Start looking at Daniel's packed string implementation. Most importantly, add an operator[] function.

Added:

>
>

07/21/10

Had a meeting with Chris and Anne today and talked about goals for the project. See Daniel's journal for the details.

Also ran some tests on antiparos using the 32-bit and 64-bit rank structures. Since antiparos's CPU is actually 32-bit, there was a huge difference in speed between the two implementations. Of course, the 32-bit version was faster, ranging from 20% to as much as 70% faster (the difference gets wider with higher sampling rates). Given this, I will be keeping both implementations, which I've also managed to merge. The only thing I don't like is that I have to have a selective compilation in each of the rank functions, which makes the code look a little unwieldly. Fortunately, everything else was easy to merge by defining a constant.

I looked into Daniel's packed string class as well. I haven't actually started adding/modifying it yet, because I figure it'll be easier when I start merging the new index in; it's going to require a little CMake tweaking as well, because I want to give it a 64-bit implementation as well. Some changes I would like to make:

Add a 64-bit implementation and have it selectively compile (currently, the data_ member variable is a uint32_t. This should be uint64_t in 64-bit.
Add an operator[] method that returns the ith base; this shouldn't be too bad.
Build the string a little better, using a hash table instead of a switch/case.

Finally, I started looking into BWT builders. The BWT builder is in the incbwt folder (kind of obvious now that I think about it), so it's just a matter of extracting that folder and trimming down on some classes. I think the rotational index is also in that folder, which is why there are so many files in it currently. Chris also suggested an implementation he had from Karkkainen (blockwise suffix sorting), which is what bowtie uses (I think?). This version is able to build the BWT string using much less memory, but the implementation may be buggy, though it may just be that it uses signed integers (too small). As well, Chris isn't sure whether the code is open-source or not, so he's looking into that, too.

To do:

Try fixing the Karkkainen implementation of the BWT builder.
If that fails, extract readaligner's BWT builder.

META FILEATTACHMENT	attr="h" comment="Rank graph" date="1278719684" name="rank-graph.png" path="rank-graph.png" size="32863" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1278982205" name="rank-graph2.png" path="rank-graph2.png" size="26249" user="jayzhang" version="1.1"

Revision 612010-07-21 - jayzhang

Line: 1 to 1

May 2010 archive

June 2010 archive

Line: 195 to 195

To do:

Discuss with Chris about what to do next, get feedback, etc.

Added:

>
>

07/20/10

Had a long discussion with Chris today about the rank structure, and on implementing the FM index in the near future. Some important points:

Rank structure is nearly finished, just a few minor changes:
- Should change sampling rate to reflect bases, not blocks of bases
- Put in packed string support (use templates)
- When taking in a regular string, assume I'm getting values of 0-3, instead of letters. Make a translate function to translate between letters and numbers.
- Implement a static memory function that outputs how much memory a given rank structure requires (in preparation for the "memory budget")
- Pull out some functions and hashes into a utils file
- Try combining 32-bit and 64-bit implementations into one file to minimize redundant code
- Look into using a "binary" bitstring instead of two bitstrings side-by-side for cleaner code.
For the FM index, I'll be:
- Taking a look at readaligner's FM index building function and pulling it out, since we won't be touching on that (building is quite complicated as well)
- Also implement a memory budget feature for the index (when getting positions)
- Start implementing the FM index itself, assuming 4 bases with no N's.

Also, I revisited the 32-bit code and was actually wondering if it was significantly faster than the 64-bit rank implementation on 32-bit systems. Chris suggested I run some tests on antiparos, which probably has a 32-bit CPU. More on that later.

To do:

Merge the two implementations
Implement a memory function
Start looking at Daniel's packed string implementation. Most importantly, add an operator[] function.

META FILEATTACHMENT	attr="h" comment="Rank graph" date="1278719684" name="rank-graph.png" path="rank-graph.png" size="32863" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1278982205" name="rank-graph2.png" path="rank-graph2.png" size="26249" user="jayzhang" version="1.1"

Revision 602010-07-20 - jayzhang

Line: 1 to 1

May 2010 archive

June 2010 archive

Line: 187 to 187

More reading.
Talk to Chris about the FM index.

Added:

>
>

07/19/10

Started finalizing the rank structure. I've integrated the 32-bit and 64-bit versions using a CMake file. That took a stupidly long time because I've never been that great with CMake. I ended up reading the getting started guide to get it working. Also went through the documentation in my rank structure and made corrections here or there. Chris also suggested implementing a "memory budget" feature, but I'll have to get more details on that before I get started.

I've started reviewing how the FM index works, in preparation for building it. More details on that tomorrow, though.

To do:

Discuss with Chris about what to do next, get feedback, etc.

META FILEATTACHMENT	attr="h" comment="Rank graph" date="1278719684" name="rank-graph.png" path="rank-graph.png" size="32863" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1278982205" name="rank-graph2.png" path="rank-graph2.png" size="26249" user="jayzhang" version="1.1"

Revision 592010-07-17 - jayzhang

Line: 1 to 1

May 2010 archive

June 2010 archive

Line: 144 to 144

Finally, here is the graph I promised on Friday, compared to sampling rate. Tests were done on skorpios with sequence size of 512,000,000 over 25,000,000 times each and a constant seed:

Rank graph (sampling rate)

Added:

>
> *Notes: "Interleaved" is the normal interleaved implementation, "3-base" is the slightly more memory-efficient method which keeps track of only 3 bases, and "Count-Interleaved" is the version with counts interleaved into the rank blocks. Blue dots are versions with the tabled popcount, yellow dots have the bit-mangling popcount, while orange dots use the SSE4.2 hardware instruction.

It's a little hard to tell, but it looks like the count interleaved implementation shows minor improvements in speed (about 5%) up until sampling rate=128, where it starts to get much slower.

Line: 175 to 176

Do some work on the Local Aligner, like optimize the FullAlignment methods
Read some papers/articles on GPGPUs and CUDA and also RNA folding pathways to decide a bit better on what I would do for my thesis.

Added:

>
>

07/16/10

Hopefully, I've finalized my rank structure implementation. Ran as many tests as I could think of on both versions and managed to fix a small bug. Hopefully, it's all done now.

Also did a little reading on CUDA and using GPGPUs. I found one paper detailing a GPGPU-based Smith-Waterman implementation, which I briefly glanced through and also went through the =CUDA= developer guide.

Finally, had a long discussion with Daniel about his new BowtieK1Mapper and multithreading. Daniel's new implementations seem to modify the index a little during the mapping process (he "flips" it), which doesn't go well with multithreading, because I assume the mapping function is constant and can be run independently. Long story short, we decided to hold off on modifications to the class, since we'll be implementing a new FM index soon, which will no doubt involve a lot of changes (the least of which is probably getting rid of the NGS_FMIndex class, which is just acting as a middleman for readaligner's FMIndex). I've also decided to hold off on committing my merge to the dev trunk for multithreading, because this would introduce compile errors with BowtieK1Mapper in its current state.

To do:

More reading.
Talk to Chris about the FM index.

META FILEATTACHMENT	attr="h" comment="Rank graph" date="1278719684" name="rank-graph.png" path="rank-graph.png" size="32863" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1278982205" name="rank-graph2.png" path="rank-graph2.png" size="26249" user="jayzhang" version="1.1"

Revision 582010-07-16 - jayzhang

Line: 1 to 1

May 2010 archive

June 2010 archive

Line: 159 to 159

To do:

Start on 32-bit implementation

Added:

>
>

07/15/10

Whoops, I completely forgot to make a post yesterday, so I'll make a post for yesterday and today, today.

Yesterday, I managed to actually finish up my two-level implementation (turns out there were a few bugs in certain cases, due to +1/-1 errors). I also implemented save/load functionality, so that the structure itself can be saved. To use, just call RankStructure::Save on a RankStructure that has already been initialized (by calling RankStructure::Init). Loading works the same way, just call RankStructure::Load with a filename supplied. Note that when loading a RankStructure, RankStructure::Init doesn't have to be called first. I'm not sure whether I should be taking in a file name as a string or whether I should just take in a stream pointer; readaligner's rank structure seems to just take in a pointer, so I might switch to that. Only problem is, I have to assume that the file's been opened as binary.

Today, I managed to finish my 32-bit implementation, which runs much faster on my machine (operating system is 32 bits). I also did a bunch of rigorous tests on both the 32-bit and 64-bit versions of the code, trying to fix all the bugs I could find. Both implementations are now pretty much valgrind-error free, and I'm pretty sure they both work correctly, too. One very annoying bug I had was during memory allocation, where I had an expression, size << 3, where size was a 32-bit unsigned integer. I later corrected this to a more generalized size * BITS/8, which ended up blowing up at sizes of around 2,048,000,000 characters. Well, it turns out the multiplication step is done first, which, when done, makes it too big for a 32-bit integer and rolls over. Sigh...

I also did some tests on both implementations. Initially, I found the 32-bit version to be faster on skorpios, which was very strange. However, I realized that this was because of the way I chose to represent the sampling rate (I chose to represent it in blocks, so a "sampling factor" of 1 is actually a sampling rate of either 64 or 32, depending on the implementation). So, the 64-bit implementation actually had twice the sampling rate given the same "sampling factor", which was what made it slower. Accounting for this, the 64-bit implementation is (rightly) faster.

Finally, I used the save functionality to test the memory size, just to make sure everything was going right. Using my two-level 64-bit implementation, I put in a 1,024,000,000 character sequence with a sampling factor of 1 (so sampling rate = 64) and saved it to a file. The file turned out to be 366.7 MB, which should be roughly the memory footprint of the structure. At 32 bits, with the same parameters (note sampling factor = 1, so sampling rate = 32), I get a file size of 488.8 MB. At sampling factor = 2 (sampling rate = 64), I get 366.7 MB again, which is exactly the same as the 64-bit implementation.

To do:

Not really sure, actually. I'm going to finalize the rank structure, then either...
Do some work on the Local Aligner, like optimize the FullAlignment methods
Read some papers/articles on GPGPUs and CUDA and also RNA folding pathways to decide a bit better on what I would do for my thesis.

META FILEATTACHMENT	attr="h" comment="Rank graph" date="1278719684" name="rank-graph.png" path="rank-graph.png" size="32863" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1278982205" name="rank-graph2.png" path="rank-graph2.png" size="26249" user="jayzhang" version="1.1"

Revision 572010-07-14 - jayzhang

Line: 1 to 1

May 2010 archive

June 2010 archive

Line: 151 to 151

Finish implementation of 2-level rank structure
Start on 32-bit implementation

Added:

>
>

07/13/10

Finished up the two-level rank structure. I ended up deciding to go with taking a sample for the top level every 512 blocks instead of 1023 blocks. There's a little speed difference in it, it's easier to calculate and it's not much of a space difference, since the top level takes up so little anyway. I also decided to go with using bit shifts.

So now, any sample factor that is inputted is rounded down to the next smallest power of two. This way, I can use bit operations instead of divisions and multiplications, which are slightly slower. Running a few tests, I see about a 10% performance gain.

To do:

Start on 32-bit implementation

META FILEATTACHMENT	attr="h" comment="Rank graph" date="1278719684" name="rank-graph.png" path="rank-graph.png" size="32863" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1278982205" name="rank-graph2.png" path="rank-graph2.png" size="26249" user="jayzhang" version="1.1"

Revision 562010-07-13 - jayzhang

Line: 1 to 1

May 2010 archive

June 2010 archive

Line: 129 to 129

07/10/10

Had a small idea that I wanted to try out quickly. Tried replacing the popcount function in basics.h for NGSA, and it did actually turn out to be faster, at least with skorpios. Times were 84.29 seconds vs 79.75 seconds, for tabular popcount vs bit-mangling popcount. So maybe we should change the function in basics to the bit-mangling version?

Added:

>
>

07/12/10

Read over Daniel's presentation on rank structures and thought over how I'm going to implement the two-level structure today. As far as I can tell, Vigna's rank9 has no practical way of adjusting sampling rates, which is something we want. So, I think I'll stick with the regular two-level structure with a few modifications:

There is one "block" per 64 characters.
The top level will be 32-bit unsigned integers.
The bottom level will be made of 16-bit unsigned integers.
There will be one bottom level count every (sampling rate) blocks.
There will be one top level count every 1023 (the maximum number of blocks that can be counted with 16 bits) blocks

The implementation shouldn't be too bad. I've decided to stick with the interleaved counts implementation for now, where all the counts are interleaved with the sequence blocks. Adding a higher level would just mean adding more blocks, this time with a constant sampling rate of 1023. Ranks would just be the sum of the nearest low-level block and the nearest top-level block. I don't think I will interleave the top level blocks, since they're much too far apart to benefit anyway.

Another optimization I've been thinking about is to limit the sampling rate to powers of 2. That way, when finding the nearest blocks, I can just do bit shifts instead of costly divisions. I'm also thinking about using a sampling rate of 512 for the top-level blocks, so that I can use bit shifts as well. The difference in space usage isn't very large, less than a megabyte (in theory) for a 4-billion base sequence (calculation here, multiplied by 4 for all four bases).

Finally, here is the graph I promised on Friday, compared to sampling rate. Tests were done on skorpios with sequence size of 512,000,000 over 25,000,000 times each and a constant seed:

Rank graph (sampling rate)

It's a little hard to tell, but it looks like the count interleaved implementation shows minor improvements in speed (about 5%) up until sampling rate=128, where it starts to get much slower.

To do:

Finish implementation of 2-level rank structure
Start on 32-bit implementation

META FILEATTACHMENT	attr="h" comment="Rank graph" date="1278719684" name="rank-graph.png" path="rank-graph.png" size="32863" user="jayzhang" version="1.1"

Added:

>
>

META FILEATTACHMENT	attr="h" comment="" date="1278982205" name="rank-graph2.png" path="rank-graph2.png" size="26249" user="jayzhang" version="1.1"

Revision 552010-07-10 - jayzhang

Line: 1 to 1

May 2010 archive

June 2010 archive

Line: 126 to 126

Edit: I realized I forgot to benchmark everything WRT sampling rate, so that's another item on the to-do list.

Added:

>
>

07/10/10

Had a small idea that I wanted to try out quickly. Tried replacing the popcount function in basics.h for NGSA, and it did actually turn out to be faster, at least with skorpios. Times were 84.29 seconds vs 79.75 seconds, for tabular popcount vs bit-mangling popcount. So maybe we should change the function in basics to the bit-mangling version?

META FILEATTACHMENT	attr="h" comment="Rank graph" date="1278719684" name="rank-graph.png" path="rank-graph.png" size="32863" user="jayzhang" version="1.1"

Revision 542010-07-09 - jayzhang

Line: 1 to 1

May 2010 archive

June 2010 archive

Line: 108 to 108

3-base implementation (for sampling-rate enabled, interleaved)
32-bit implementation, look into two-level rank structure.

Added:

>
>

07/09/10

Finished up all the implementations for the rank structure (interleaved, 3-base, and interleaved counts, all with adjustable sampling rates) and benchmarked them. Rather than provide a huge chart for all the benchmarks, I decided to show a chart:

Times are in milliseconds, tested on skorpios with a constant seed, and ranked 75,000,000 times per test. Squares are interleaved, diamonds are 3-base, and triangles are interleaved count implementations. The three different versions all use different popcount functions: blue is the function provided in readaligner, which uses a table; yellow is one provided by Wikipedia, which uses a bit-mangling method; and orange uses the hardware instruction.

Rank graph

In previous tests, I noted that the table version was faster than the bit-mangling version; it's interesting to note that in these implementations, the bit-mangling popcount is actually better. My thoughts are that once everything isn't able to be fit nicely into the cache, then the popcount table may encounter cache misses, which slows it down dramatically. It's also interesting to see that when using the tabular popcount, interleaving the counts is actually the slowest implementation. Again, this might be due to excessive cache misses from having to cache the table, and it's actually what led me to try the bit-mangling popcount one more time. It's nice to see that interleaving the counts has some effect, especially in the hardware popcount case, though.

Finally, I talked with Anne and Chris about a new project, possibly for my honours thesis, today. Chris proposed a GPGPU version of the Smith-Waterman aligner I vectorized earlier, or even GPGPU-capable succinct index. Anne also suggested that I look into the RNA folding pathway research currently going on as well, and I'll probably start reading some papers on that soon.

To do:

Look into both the GPGPU project and get more familiarized with the RNA folding pathway project
Start implementing a second level in the rank structure
Look into a 32-bit implementation

Edit: I realized I forgot to benchmark everything WRT sampling rate, so that's another item on the to-do list.

META FILEATTACHMENT	attr="h" comment="Rank graph" date="1278719684" name="rank-graph.png" path="rank-graph.png" size="32863" user="jayzhang" version="1.1"

Revision 532010-07-09 - jayzhang

Line: 1 to 1
	May 2010 archive June 2010 archive
Line: 96 to 96
	Finish sampling rate implementation. Get started on 32-bit implementation?
Added:
> >	07/08/10 Skorpios wasn't back till late so I didn't really have a chance to benchmark. I did, however, finish my sampling rate implementation. Hopefully, there aren't any bugs. I also managed to fix all `valgrind` errors, so now everything runs (hopefully) well. I also managed to finish an interleaved count implementation, which was much harder. In fact, I pretty much had to do a complete recode of the core build/rank functions. I need to run a few more tests on the implementation, but it should be running fine. Since `skorpios` is back now, I'll probably do a bunch of experiments tomorrow. I should also make up a 3-base sampling rate implementation, just to test the cost; again, however, if we end up going with the count-interleaved version, the 3-base version really isn't very practical. To do: Benchmark! 3-base implementation (for sampling-rate enabled, interleaved) 32-bit implementation, look into two-level rank structure.

Revision 522010-07-08 - jayzhang

Line: 1 to 1

May 2010 archive

June 2010 archive

Line: 43 to 43

Investigate above behaviour a little more.
Start implementing precomputed blocks.

Added:

>
>

07/06/10

Had a discussion with Chris about experimental protocols. Turns out my experiment yesterday pretty much all fits in the cache (which is 6 MB), so there wouldn't be that much of a benefit between sequential (interleaved) and parallel (normal). So we decided on a new protocol:

Use larger sequences (>=16,000,000 bases)
Use random access. More specifically, build an array of random numbers, then use those numbers to call random ranks. This should prevent streaming.
Try both hardware and software popcount.

Also, I finished implementing count blocks; currently, each 64-bit block has a count block (which is 32 bits), with 4 count blocks per rank block, one for each base.

So here are the results, run on skorpios at -O3, with 40,000,000 rank calls each. Times are wall time:

Sequence length (*10^6)	Normal (ms)	Interleaved	Normal + HW popcount	Interleaved + HW popcount
1	792	750	244	237
2	864	861	265	250
4	924	935	274	255
8	1134	1063	449	363
16	1567	1460	752	639
64	3857	3018	1242	1034
256	4495	3910	1901	1358
2048	5332	4639	2915	1957

Note that the 2048 million times may not be quite as accurate. The test at 1024M segfaulted for all four implementations (later note: turns out this was a problem in the driver, where I used a signed int by accident).

To do:

Try out using a different count structure (only store 3/4 bases, and calculate the last base) and see the performance hit.
Try out interleaving the count structure as well.
Implement adjustable sampling rates.

07/07/10

Implemented another structure with interleaved counts, as well as a count structure with only 3 bases. Since skorpios is down right now, I just did some preliminary testing with my laptop (32-bit, Intel Core 2 Duo, no hardware popcount, 6 MB cache, 3.2 GB RAM).

Again, times are in wall time, compiled at -O3, single threaded and run with 50,000,000 ranks each.

Sequence length (*10^6)	Normal (ms)	Interleaved	Interleaved + counts	Interleaved, 3-base
1	1046	985	1105	1013
8	1382	1202	1208	1245
16	3276	2768	2137	2522
32	5059	4525	3556	4474
64	5708	5100	4269	5292
128	5901	5303	4617	5545
256	5985	5387	4834	5686
512	6111	5556	4989	5786
1024	7967	7392	6690	7653

The 1024M times are again a bit dubious. I fixed the segfault bug, but I think my computer was running out of memory, and it looks like a big jump in time. Regardless, I'm going to test these again on skorpios when it gets back. It doesn't look like there's too much of a performance hit with the 3-base implementation, but if we stick with interleaved counts, there's really no point in going that route, since the 4 32-bit counts fit nicely into 2 64-bit "rank blocks". Using only 3 counts wouldn't do much for memory, since there will still be those 32 bits available.

I also experimented with some different popcount functions, found on the Wikipedia entry. So far, it seems like the current popcount function (which I stole from readaligner) is the faster than the one proposed by Wikipedia. However, the current popcount function relies on a table of values, which might make it slower if there's a lot of cache misses on it, though it outperformed Wikipedia's function even at 1024M bases (don't have the exact values, another thing to benchmark).

Finally, I started implementing the sampling rates; there's currently a bug somewhere, but it shouldn't be too bad to fix.

PS: Before I forget, I realized I did actually lock the whole single end output processing section in my multithreaded driver (I originally thought I just locked the part where it writes to output, whereas now I realized that it locks before that, when it's processing the SamEntry into a string). I should experiment with locking it at a different place (though at this point, it's not that easy, unless I want to put locks in the sam file class itself.)

To do:

Benchmark above on skorpios.
Finish sampling rate implementation.
Get started on 32-bit implementation?

Revision 512010-07-06 - jayzhang

Line: 1 to 1

May 2010 archive

Changed:

<
<

06/01/10

Here are a few more times, this time with all bin and data files on the /tmp/ folder with times calculated from /usr/bin/time.

Using commit 5a82392eeeb5653d1b75b7a8dd7ccdd499605cfa and saligner, with the 125000 readset (real time)

Type	k	max_results	Unmapped (+1662 invalid)	Search Time (s)	Avg time/read (s)
Exact	-	1	17732	4.727	0.000037816
Exact	-	10	17732	5.01	0.0004008
Mismatch	2	1	5466	8.251	0.000066008
Mismatch	2	10	5466	9.438	0.000075504

Using BWA with flags -l 99 -k 2 outputting to SAM:

Type	k	max_results	Unmapped	Search Time (s)	Avg time/read (s)
Mismatch, I guess	2?	?	?	2.832	0.000022656

Using readaligner with flags -k 2 --sam --fastq:

Type	k	max_results	Unmapped	Search Time (s)	Avg time/read (s)
Mismatch	2	1	?	6.662	0.000053296

Bowtie with version info:

64-bit
Built on sycamore.umiacs.umd.edu
Sat Apr 24 15:56:29 EDT 2010
Compiler: gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)
Options: -O3 -m64  -Wl,--hash-style=both
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}

Mode	Flags	Mapped	Unmapped	Search Time (s)	Average Search Time (s)
0-Mismatch	`-v 0`	105606	19394	1.310	0.00001048
1-Mismatch	`-v 1`	115631	9369	1.850	0.0000148
2/3-Mismatch	`-v 2`	118356	6644	1.878	0.000015024
2/3 Mismatch	`-v 3`	119430	5570	3.441	0.000027528
Seeded Quality	`-n 0`	112317	12683	1.478	0.000011824
Seeded Quality	`-n 1`	117659	7341	1.679	0.000013432
Seeded Quality	`-n 2`	118356	6644	1.974	0.000015792
Seeded Quality	`-n 3`	118330	6670	2.560	0.00002048

Also had a meeting with Chris today, where he clarified some of his previous communications. Our first priority should now be to compare Bowtie exact matching functions to saligner or readaligner exact matching. Daniel took on a job of writing a better locate function for exact matches. I will be looking for more areas to optimize and also fix some bugs and write a better saligner wrapper that actually takes in arguments. Also, since mismatches in cigars are recorded no differently from matches, both mismatch and exact can have their own defined cigar (e.g. 35M), instead of having to compare reads to references to generate our own cigar.

To do:

In Exact and MismatchMappers, just make a cigar on the fly instead of passing stuff to SamEntry.
More low level optimizations.
Make a better saligner wrapper.
Fix a bug in cigar generation (ticket 53df76)

06/02/10

I did a bunch of minor changes to the IO files to optimize them a little further. Most of the changes were just reserving space for strings/vectors beforehand, fixing a few unneeded lines, etc. I also looked into ticket 53df76 (buggy CIGAR output on indels), and found that it's not really a bug with the CIGAR generator. The bug actually comes from using the LocalAligner after a gap alignment to generate the "dashes" that the CIGAR generator needs. However, the behaviour for LocalAligner is unpredictable when used in this context; sometimes it will clip the start or end of the read and throw in mismatches instead of just inserting the gaps (which is actually what it's supposed to do, given it's local). So, there actually aren't bugs in either the CIGAR generator or the LocalAligner. Unfortunately, we can't really fix the bug given the current situation, because there's no other way to put dashes into the aligned ref/reads. The only solution is to either figure out how IndelQuery does its CIGAR generation or write our own IndelQuery (which we might just do anyway). Since the indel support is lower priority, I'm going to table this bug for later.

Another optimization I did today was make a convenience method in SamEntry to generate an entry representing an exact match (as per Chris' suggestion yesterday). Because exact matches are so predictable, it saves us having to generate the CIGAR, and a bunch of the tags. We also don't need to extract the original reference from the index, since it's not really needed. Mismatches should also be similar, since mismatches aren't represented differently from matches in CIGARs. However, we do have an MD:Z tag that supports mismatches (like BWA), but I'm not sure whether this is needed at all.

Finally, I added support for arguments in saligner, where we can specify reference, reads and output for convenience, so we don't have to recompile every time we need to use a different read file.

Anyways, I decided to run some tests on my newest changes using the 2622382 reads file just for completeness and for more accuracy, since I didn't do any tests on that file yesterday, either.

Under commit c6f55e33aa732c8952d1f56fa4c1fe6aa3875677, with the 2822382 reads file and Release build:

Type	k	max_results	Unmapped (+20957 invalid)	Search Time (s)	Avg time/read (s)
Exact	-	1	350717	86.634	0.000033036
Exact	-	10	350717	95.99	0.000036604

And with Bowtie (same version info as above):

Mode	Flags	Mapped	Unmapped	Search Time (s)	Average Search Time (s)
0-Mismatch	`-v 0`	2250708	371674	27.341	0.000010426

To do:

Do some profiling again...I haven't done that in a while and there's been a lot of changes to the code.
Start reading up on FM index.

06/03/10

Worked on more optimizations, this time guided by Valgrind. From KCacheGrind, I found that about 10-15% of our cycles were being used up in IO, especially making/writing SAM entries. So I went in and did a few more minor optimizations, such as using ostringstream instead of stringstream. Preliminary results (tested on my machine) shows it runs something like 10% faster now.

Up until now, I've really only been testing exact matches with saligner over the 125,000 reads file. I was doing one final test on the 2M read file (to verify output and to do a preliminary speed test), I noticed a lot of disk activity after about 40k reads! I'm guessing this is what cache misses look like (it could just be it reading from the reads file, I'm not really sure on this, have to confirm), so I'm going to have to do some more profiling with Callgrind. Granted, my system doesn't have as much memory as skorpios, but this might be one of the areas we can improve on.

Other than that, I think it might be a good idea to start reading up on FM indexes, because I really can't do much else without knowing how the aligner works. So tomorrow, I might start on my reading (probably following in Daniel's footsteps).

To do:

Check if we're really getting lots of cache misses.
Start on my reading!

Update: Ran a cachegrind over 2M reads, and didn't get very many cache misses (something like <1%), so I guess it was actually reading from input? I'll have to check out the log a bit more tomorrow. On a side note, the profiling took about 2.5 hours with 2M reads on my machine (don't even know the time because the built-in timer went negative), so it's not a very practical thing to do...

06/04/10

I feel like I've done all the minor optimizations on the I/O classes at this point, and probably won't be improving speeds much. So I spent the day reading up on FM indexes

using the same paper Daniel listed a while back. I now know how the BWT works and how count functions work on the index. I haven't really gotten through locate yet, but it shouldn't be too bad.

With my new-found knowledge, I took another look at the exact matching implementation in readaligner and now I feel like I get it a bit more. I tried to look up some papers on wavelet trees (I think that's what wv_tree stands for) and the rank functions, since that's pretty much the bottleneck but I couldn't really find much stuff on it (not that I tried that hard).

So for later:

Find some literature on wavelet trees and take a look at the rank function some more
Do some more reading...

06/07/10

Chris suggested finding the ratios between locating with 1 mismatch vs doing an exact locate for both bowtie and saligner. So here it is:

Run on skorpios with saligner commit 5f7f77021a321eab0019825bec36969209e707b6 and using the time command.

Aligner	Exact time (s)	1-mismatch time (s)	Mismatch:Exact ratio
saligner, max_results = 1	79.397	107.989	1.360
saligner, max_results = 10	86.937	127.223	1.46
bowtie (`-v 0/1`)	27.115	29.966	1.105
readaligner (`-k 0/1`)	68.646	87.841	1.280

Note: readaligner returns 1 max result, I believe.

Afternoon update: Well, I managed to shave another 5 or so seconds off saligner. Doing a brief benchmark, with same conditions as above: saligner, max_results = 1: 73.58s for exact alignment.

It's a lot closer to readaligner now. I ended up changing all the osstreams and ofstreams to sprintf and fprintf, which made it quite a bit faster. I'm debating whether I should change the reading code to cstdio as well, but it might get a bit more complicated, since I don't know the actual size of each line and the C input functions all require a char array buffer. Our saligner also does a bit more than readaligner. For example, our FASTQ file processing is much more robust and follows the FASTQ spec better.

Chris also mentioned that something else I can do is to replace the rank structure. He mentioned Daniel was looking at a 64-bit implementation, so that might be something to look at? I'll need to get more details later.

For tomorrow:

A few more minor optimizations.
Take a look at bowtie source code, maybe?

06/08/10

Decided to do one last benchmark with 32-bit bowtie instead of 64-bit bowtie, which might be a more accurate comparison, given that readaligner doesn't really take advantage of 64-bit registers (I don't think...). Anyways, here are the results:

Aligner	Exact time (s)	1-mismatch time (s)	Mismatch:Exact ratio
bowtie (-m32 compiled) (`-v 0/1`)	41.526	46.473	1.119

Also had a meeting with Daniel and Chris today (see Daniel's journal, June 8 entry for details).

After the meeting, I did a bit of tinkering with the FASTQ reader. After some discussion with Daniel, we decided it might be better to trade robustness for speed in the reader, as most FASTQ output these days are more standardized (for example, no wrap-around for sequence or quality strings). So by taking all those checks out, I was able to get another roughly 10-15% increase in speed. I also started doing some work on converting our C++-style strings to C-strings. I started with the FASTQ reader again, but it's going to end up being a huge change that'll effect almost all of our classes. At the same time, I think I should also start looking into using regular arrays instead of vectors where applicable.

To do:

Pretty much refactoring (converting C++ strings to C strings).

06/09/10

Experimented a bunch with char arrays vs C++ strings. I did a quick and dirty replacement of strings with char arrays in the FastqEntry and FastqFile classes, and then fixed all the compile errors that came with it, but there actually wasn't too big a difference. In fact, I couldn't really see a difference at all. Granted, I did do a pretty poor job of doing the replacement (just wanted results), so it might not have been as optimized as it could be, but it seems that changing over wouldn't make that much of a difference.

I also benchmarked my changes from yesterday (swapping out the Fastq read function). It seems saligner is now faster than readaligner (at least in exact matching). Anyways, here are the times:

Aligner	Exact time (s)	1-mismatch time (s)	Mismatch:Exact ratio
saligner, max_results=1	64.461	94.434	1.465

The mismatch:exact ratio is still a bit off from bowtie's and readaligner's, but that might also be because mismatches still unecessarily go through the CIGAR generator (which I should fix sometime).

There was quite a bit of discussion on some points discussed yesterday, detailed in Daniel's journal, June 10 entry.

Finally, spent the rest of my time conerting strings to char arrays, though I'm still not sure if it's doing anything. Regardless, it should be more future-proof.

There's also one point I'd like to look at in the ExactMapper, though I doubt it has too much of a performance impact. When the total results of an alignment is greater than max_results, it seems we do a Locate on all the results, not only max_results. This should be a pretty easy fix.

To do:

More converting
Look at fix mentioned above.

06/10/10

Bahhhhh spent all day converting strings to c-strings. I've decided to convert only things having to do with sequences and qualities to c-strings. To make it easier to see, I put up some typedefs:

#define NGSA_READLEN 128
typedef char SequenceBase;
typedef char QualityBase;
typedef SequenceBase SequenceString[NGSA_READLEN+1];
typedef QualityBase QualityString[NGSA_READLEN+1];

I'm about 75% done, I think. I finished the I/O classes and the drivers, as well as half the mappers. Just need to convert the rest, and then start on the pair-end matcher, and some other stuff. Also optimized some of the lower-level algorithms (such as ngsa::ReverseComplement) to make better use of char arrays. I hope the end result is a speed up, or at least doesn't result in a speed-down! Some parts still aren't very efficient, because there are other methods that still take in C++ strings, where I end up having to do a conversion (which takes O(n) time). I think strlen is also O(n), versus O(1) on std::string::size (I think that's what these are), since std::string can just update the size every time the string grows/shrinks, so I should make sure to avoid using strlen too often.

To do:

More conversion...

06/11/10

Finally finished converting almost all the NGSA library classes to c-strings, except the local aligners. I'm going to forgo the aligners for now because I'm planning on doing some more optimization on them later, so I'll have to change the code there anyway. I'm still not sure whether all my changes work, though, because I still can't seem to compile! It seems there's a problem with my common header, where I store all the constants and typedefs. For some reason, when it gets to linking all the libraries, it keeps giving me errors in whatever headers single_end_aligner.cc includes that also use my typedefs, saying that it can't find the definitions. I spent a long time on Google looking up forward declaring headers and typedefs, etc., but I still can't find a solution...maybe I'll have to ask Chris on Monday or Tuesday.

I also did a few benchmarks on a bunch of string vs char array functions, and I've developed the following guidelines for any future compatability issues or conversions:

Converting a string to char array using std::string::c_str() or std::string::data() are negligibly fast, so don't worry about doing it!
There's no difference as far as I can see between std::string::data() and std::string::c_str() in terms of speed.
Converting a char array to string (using std::string(const char *) or std::string::operator=(const char *)) is very slow (O(n)), avoid as much as possible.
Using strncpy with some pointer math and appending a NULL is much faster than string::substr (something like 4-5x faster).
std::string::operator[] is a bit slower than char array's [] operators as well, probably because the string does bounds checking (not sure on this). Even doing std::string::c_str(), then using the [] operator is faster (though not by much)
Also, be wary of the string's operator=(const char *) assignment, which sometimes masks a char array --> string conversion. Try being as explicit as possible when setting a string equal to a char array (by using the constructor), to avoid confusion

Finally, I think I might take Chris up on his offer for taking a break to study for my MCAT. I'll see how far I get on the weekend, but I might take Monday off for an extra study day. I guess we'll see after the weekend.

To do:

Finish the conversion by getting stuff to compile! (I really hope saligner ends up being faster).

06/18/10

Back from my MCAT!

Finally got my code to compile for the C string conversion. Now it's off to fix bugs and segmentation faults! I managed to clear up all the segmentation faults during the actual alignment process, but the output's still not quite right. My new build seems miss most of the reads that are alignable from an old build. Also, I get a segmentation fault during the index building process, which is really strange because I haven't even touched the code for that. No new code is even called until after the index has been built! Maybe I'll have to run the build through the debugger...

I also spent a large portion of the day experimenting with an idea I had last night. When calculating the rank value for a position i, instead of going to the next smallest pre-calculated position and doing popcounts up until i is reached, I modified the function so it goes to the nearest pre-calculated position, and either goes up or down from there. According to callgrind, popcount only gets called about 60% as much on the 125000 read dataset (170,862,051 vs the original 254,181,103 calls). Unfortunately, this only results in a very small speed increase, something like 3-5%, which is strange because I expected the gain to be a lot more.

Edit: Just noticed Daniel's started to implement a different rank structure, so my change might not even be relevant :(.

Finally, I noticed Daniel's made a lot of changes...not looking forward to the huge merge later...

To do:

Fix bugs in the new C string version.

06/22/10

I realized I forgot to make a post yesterday, so here it is:

Worked on moving over from C strings still. I managed to fix all the segmentation faults, even with the index-building step. Also fixed most of the bugs that arose from the conversion. Now, the exact aligner returns exactly the same output. The OneMismatchMapper doesn't quite do it, but I'm hoping the merge will take care of that.

Also worked on merging the branches, which might take a bit of time. There's so many new changes!

Entry for today: Had a meeting with Anne and Chris today (see Daniel's journal for notes). Anne and Chris have been discussing giving me a bigger aspect of the project to work on, which is great.

Chris suggested three things from the top of his head:

Work on building the index (which currently requires something like 12 GB for the human genome) in a memory-efficient fashion
Build an FM index from scratch, which would require things like a rank structure, etc.
Work on RNA Seq stuff (intron support, splice graph, etc)

So far, option 2 seems the most interesting to me, but all of them are pretty good!

Anyways, managed to finish merging everything. Mismatches and Exact matching work fine so far. The jumpstart mapper doesn't work (segfaults somewhere), but I think that might be something to do with the jump table not saving to the right place. Daniel also made some bug fixes, so I think I'll have to do some final (minor) merges, and then I'll be done the conversion!

To do:

Think about some bigger parts of the project I might like to do.
Finish up the C-string conversion (hopefully, done by tomorrow).

06/23/2010

FINALLY finished porting everything over to char arrays. I'm pretty sure everything's solid, I do as much to prevent buffer overflows as possible, and output looks the same as the old aligner using saligner and most mappers. Speedwise, it's pretty similar; I really can't tell if there's even been an improvement. I guess this port was to make it more future-proof (with packed strings) than anything.

Anyways, some outstanding bugs:

MutantMapper output isn't quite the same as the previous aligner. I think there are two different bugs:
- CIGAR strings on some reads add up to 36 on 35 nucleotide reads, which is strange...for example, one string will show up as "33M3S", where the old aligner shows "33M2S".
- For reverse strand reads, some positions seem to be 1 base less than the old aligner. I'm really not sure why...
KmerMapper output also isn't quite the same:
- CIGAR strings for some reads also add up to 36, just like MutantMapper
- Regardless of forward/reverse strand, some alignment positions are off, but by more than 1. This also screws up the CIGAR string.

The similarities of the above two cases seem to point to something that they have in common going wrong. They both use a LocalAligner as the final step of their alignment, so I'm thinking there might be something wonky going on there. I haven't really touched the LocalAligner yet, because I want to make some changes to it later on anyway, so I might as well put that off. Also, these two mappers won't be used too much for the time being (I don't think), so I think I"ll just put that off for now...

Chris also suggested I take a look at threading the aligner in the meantime, while he's still thinking up something for me to do. He suggested looking into Boost threads, which I did. I don't know...from the arguments I saw, the arguments seem to be more along the lines of "boost threads are easier to use" and "why not use boost threads?", so I think I'll just stick with those. Now I just have to figure out how to add the library to CMake for building. The actual threading looks kind of simple, since I can just have each thread work on a different read. The only shared object the two aligner threads would have are the input/output classes (SamFile and FastqFile) as far as I can see, and I think locking them up won't be too hard.

To do:

Push my latest changes after doing some final robustness tests (I'm kind of nervous...don't want to break anything!)
Add multithreading support!

06/24/10

Did some last-minute verifying for my C-string conversions (still a bit nervous about it!), and it all looks fine, aside from the KmerMapper and MutantMapper, detailed above. So I finally pushed the changes into dev.

Also began looking at multi-threading. It doesn't look too hard, and I've verified that the only places requiring locks should be with FastqFile and SamFile. Also, the two mappers, MismatchMapper and GapMapper, which use readaligner's query classes (MismatchQuery and IndelQuery respectively), might need locks when querying, because these seem to have shared variables. This might not be such a good plan because the bulk of the calculation is in the query segment, so it might as well be single threaded. To make matters worse, none of the mapper classes are copyable, so making copies of the mappers for each thread isn't an option. Nonetheless, this shouldn't be too big of a problem, especially if we're building a new index, since we probably won't be using the query classes much anyway.

On another note, I tried integrating the Boost.Thread libraries in, but I really don't know how with CMake. I tried extracting the Thread library using boost's bcp tool, and then adding that library into our build but I got a lot of compile errors; I think the thread libraries are compiled specially with bjam. So I'll have to ask Chris about that tomorrow.

To do:

Integrate the Boost.Thread library.
Finish up multi-threading.

06/25/10

Finally got the Boost.Thread library to compile with NGSA. Interesting thing...to test if linking worked, I did an #include <boost/thread/thread.hpp> in the saligner code, but after all our other includes. When I tried to compile, I would get compile errors from within Boost. But, when I moved the include statement to the top of the include list, it was perfectly fine! None of the other NGSA classes include the boost thread library either, so that was quite strange. I guess there's yet more incentive to follow the Google guidelines

for includes?

Anyways, I've added options in the CMakeLists.txt file for configuring multithreading, and also SSE support. For threading, I've made a new driver called MultithreadedDriver (I'm going to change this to ThreadedDriver), which takes in another driver. It will then create x number of threads, each calling the driver's Run method. If the user wants no multithreading, then ThreadedDriver just calls the Run method of the driver it gets without making any new threads.

To do:

Finish up threading.

06/26/10

It works! The multi-threaded driver works, at least with a minimal CMakeLists.txt and no bells and whistles. For some reason, I had a bit more trouble linking, and the test file I was able to compile on Friday wouldn't compile, despite pulling it directly from the website! So I had to reinstall Boost and it seems to work now...I feel like I've spent more time trying to get this thing to build than actually coding it!

I noticed something strange while running the threaded driver: it seems that when I specify 2 threads, it doesn't use both cores of my CPU, but when I specify >=3 threads, it seems to start utilizing both cores.

The threaded driver does work though; at least, it aligns all the 125k reads fine. I still have to lock up the driver files (single end driver and pair end driver), because I'm still running into a minor race condition when approaching the end of the file. Also, I still have to add the alignment status reporting into the threaded driver (e.g. report how many reads were aligned, not aligned, failed, etc). Finally, I need to verify the output of my threaded aligner, which I'm not really sure how to do; it seems to me that the threaded aligner might print reads out of order, so a simple diff won't suffice...

To do:

Add the "bells and whistles" to threaded driver.
Lock up driver classes to avoid race conditions.
Verify output.
Benchmark.

06/29/10

Finally got the multi-threaded aligner working, with pretty much the same functionality as before. I'm pretty sure I solved all the race conditions (with the exception of the two Query classes from readaligner), but I still don't really know how to compare the output to the old version, since everything is out of order. For now, I just made a quick glance through the output to check that every line aligns at the end properly (since I'm using the exact aligner, and each read is 35nt, every line should have pretty much the same length, given 8 spaces for each tab). It looks pretty good, but even going through the file like that is huge.

I realized that saligner wasn't running at high priority when I was doing tests yesterday, so doing that made it utilize both CPU cores on my machine. Preliminary tests on my machine shows that running with two threads drops the runtime to something like 51-52% of original speed, so there's not too much overhead from thread switching. Adding more threads does slow it down a little.

Finally, I got a new project from Chris. I'm to build a new rank structure he proposed, which should utilize the cache more efficiently and does away with the wavelet tree. Basically, we will be using two arrays for the popcount, so that we can accurately represent the four bases (as opposed to one array, which would only give us two values). The implementation will involve:

Implementing the structure for one block of 64-bits
Implementing the structure for multiple blocks of 64-bits (there's two ways to store the blocks, parallel or serial, so I should experiment in this step or step 3.)
Have multiple blocks of 64-bits, with counts for every block (there's multiple ways to store counts, so I should try experimenting here)
Adjusting sampling rate for every 32-bits, 64, 128, 256, 512.

To do:

Finish up the multi-threaded implementation: do some code inlining, check everything over, test on skorpios and merge.
Start on doing the rank structure!

06/30/2010

Finished up the multi-threading implementation. I've still yet to verify the output (I'm kind of putting it off...). I also did some tests on skorpios as follows:

Tested on commit 68bab30017bd18dd03f0689760111f229d595238, running everything from the /var/tmp/ directory without nice. Tested with the single NC_010473.fasta genome, with the 2622382 readset, using regular exact matching (no jump-start):

Threads	Real time (s)	User time (s)	System time (s)	Ratio	Theoretical
1 (with multithreading off)	64.389	60.936	1.592	1	1
1 (multithreading on)	66.336	62.616	1.456	1.03	1
2	37.990	63.536	2.588	0.59	0.5
3	31.849	73.085	3.504	0.49	0.33
4	26.918	74.169	4.580	0.42	0.25
5	31.418	73.381	5.412	0.49	0.2

Same as above, but using OneMismatchMapper:

Threads	Real time (s)	User time (s)	System time (s)	Ratio	Theoretical
1 (with multithreading off)	80.925	79.753	1.092	1.00	1.00
1 (multithreading on)	80.896	79.521	1.216	1.00	1.00
2	44.288	82.521	2.028	0.55	0.50
3	36.341	93.194	2.760	0.45	0.33
4	31.941	93.630	4.344	0.42	0.25
5	32.349	93.550	5.100	0.49	0.20

Notes:

Didn't go past 5 threads, because apparently skorpios is actually an i5 processor (not an i7), without hyperthreading, so there's not that much point.
Ratio was calculated comparing user times with the first entry (saligner compiled with multithreading set to OFF)
There seems to be a slight delay at the end of the alignment, when saligner finishes (all console output is done), but doesn't actually end. This wait comprises 2-3 seconds regardless of thread implementation and might be one factor of the poor ratios (though not much; even when this is taken into account, the difference is minimal, less than 0.05)
User time seems to make a big jump between 2 threads and 3 threads for both matchers. Maybe this is a clue?

Shifting gears, I've also started to implement the new rank structure, as discussed yesterday. I've completed step 1 and can now build one 64-bit block from a 64-character sequence of A/T/G/C. I'm using Daniel's representation for now, where A=0, C=1, G=2, T=3. It's interesting to note one rank operation is slightly faster than the other 3 rank operations, so it may be advantageous to change the alphabet assignment up to whichever base is the most common? Determining the most common is probably too slow and/or the speedup probably wouldn't be worth it though.

To do:

Rank structures!
Figure out why the multi-threaded implementation isn't working as well as hoped on skorpios.

>
> June 2010 archive

07/01/10

Played around with multithreading some more. I tried making a light-weight test file, where I just add a value a few billion (probably more) times, just to see if it's really the aligner. I noticed that there also is a pretty big jump in user time between 2 and 3 threads, so that bit might not be saligner's fault. However, the ratios were a bit more like what they should've been (though not quite). I forgot to record down the numbers, but they weren't that crazy.

Line: 339 to 25

Explore sequential vs parallel blocks
Add precalculated blocks.

Added:

>
>

07/05/10

Finished an implementation of sequential blocks instead of storing the first and second blocks in two parallel arrays. Chris argued that this might make it faster because one set of two blocks can then be cached together through one cache line, since they're side by side (I think that's the gist of it).

I also did some benchmarks: On skorpios at O3, using a test sequence of 1280000 characters, ranking from position 1230000-1280000:

Implementation	A (ms)	C (ms)	G (ms)	T (ms)
Parallel	5164	4661	4661	4921
Sequential	5025	4793	4792	4858

It's interesting to note that on the parallel implementation, ranking C and G is actually faster than T! This is strange because the calculation for T involves an & operation on the two blocks, whereas C/G involves an & and a ~ operation. The times are more as expected under the sequential implementation, although T ranking should still be faster (it's got fewer operations!). What's more surprising is that the speeds for C/G are faster on parallel!

Using cachegrind, it seems that there are far fewer (about 75%) "D refs" when ranking C/G than A/T in the parallel implementation (turns out "D refs" are main memory references). Specifically, it seems that there are much fewer D writes occurring in the C/G cases (reads are fewer as well, but not as significantly as writes). Furthermore, under the sequential implementation, the D references for all bases are more on par with the A/T rank functions in the parallel implementation. It seems there might be some specific optimization in the parallel C/G case that the compiler is making. I kind of want to investigate this a bit further; ideally, I can somehow end up with the parallel C/G times with the sequential A/T times.

To do:

Investigate above behaviour a little more.
Start implementing precomputed blocks.

Revision 502010-07-03 - jayzhang

Line: 1 to 1
	May 2010 archive 06/01/10
Line: 330 to 330
	Find out what's going on with multithreading. Start implementing precalculated blocks.
Added:
> >	07/02/10 Fixed the bug in my rank function, turns out I did miss a +1. So now I have a working rank structure that I can build and query for any length sequence. The current implementation uses two parallel `uint64_t` arrays to store the two (first and second) bitsequences. Before I start an implementation that stores precalculated values though, I want to do a few experiments with one array that holds each block sequentially. I also did a few tests with the new calculation method ( `~second & first) & (0xFFFF...FF >> BITS-i` ). I tried making an array of length `BITS` to store the precalculated masks for every `i`. That way might lead to fewer operations (not really sure on that point). Preliminary tests didn't show much difference, especially on `-O3`. The strange thing is, I actually compiled a minimal rank function using either of the two calculations as assembly, and at `-O3`, there wasn't even any difference in the assembly code! Strange, since the two aren't really numerically equivalent, just "popcount equivalent". To do: Explore sequential vs parallel blocks Add precalculated blocks.

Revision 492010-07-02 - jayzhang

Line: 1 to 1
	May 2010 archive 06/01/10
Line: 317 to 317
	Rank structures! Figure out why the multi-threaded implementation isn't working as well as hoped on skorpios.
Added:
> >	07/01/10 Played around with multithreading some more. I tried making a light-weight test file, where I just add a value a few billion (probably more) times, just to see if it's really the aligner. I noticed that there also is a pretty big jump in user time between 2 and 3 threads, so that bit might not be saligner's fault. However, the ratios were a bit more like what they should've been (though not quite). I forgot to record down the numbers, but they weren't that crazy. Also worked on the rank structure a little more. I can now build sequences greater than 64 bases in length (so as big as needed, with memory restrictions of course). I figured out how to align to 16 bytes with `posix_memalign`. I've almost completed the rank function as well, except I'm just getting a small bug with a boundary condition (I'm pretty sure I'm just missing a +1 somewhere). I was thinking about the calculation needed for the rank structure some more. Currently, the calculation goes something like this (for example, for a C): `(~second & first) >> (BITS-i-1)`. Chris told me that this takes something like three operations to do, since some operations that are independent can be done in parallel. I figure the following equation would also work for calculation: `(~second & first) & (0xFFFF... >> BITS-i)` (these two formulas aren't numerically equivalent, but they should give the same `popcount` result). I'm not sure if this will do anything, but it looks like I might be able to drop the operations to 2, if I can somehow calculate the mask fast. I'm going to have to play around with that a bit more. To do: Fix that bug with the rank. Play around with the rank calculation. Find out what's going on with multithreading. Start implementing precalculated blocks.

Revision 482010-07-01 - jayzhang

Line: 1 to 1

May 2010 archive

06/01/10

Line: 283 to 283

Finish up the multi-threaded implementation: do some code inlining, check everything over, test on skorpios and merge.
Start on doing the rank structure!

Added:

>
>

06/30/2010

Finished up the multi-threading implementation. I've still yet to verify the output (I'm kind of putting it off...). I also did some tests on skorpios as follows:

Tested on commit 68bab30017bd18dd03f0689760111f229d595238, running everything from the /var/tmp/ directory without nice. Tested with the single NC_010473.fasta genome, with the 2622382 readset, using regular exact matching (no jump-start):

Threads	Real time (s)	User time (s)	System time (s)	Ratio	Theoretical
1 (with multithreading off)	64.389	60.936	1.592	1	1
1 (multithreading on)	66.336	62.616	1.456	1.03	1
2	37.990	63.536	2.588	0.59	0.5
3	31.849	73.085	3.504	0.49	0.33
4	26.918	74.169	4.580	0.42	0.25
5	31.418	73.381	5.412	0.49	0.2

Same as above, but using OneMismatchMapper:

Threads	Real time (s)	User time (s)	System time (s)	Ratio	Theoretical
1 (with multithreading off)	80.925	79.753	1.092	1.00	1.00
1 (multithreading on)	80.896	79.521	1.216	1.00	1.00
2	44.288	82.521	2.028	0.55	0.50
3	36.341	93.194	2.760	0.45	0.33
4	31.941	93.630	4.344	0.42	0.25
5	32.349	93.550	5.100	0.49	0.20

Notes:

Didn't go past 5 threads, because apparently skorpios is actually an i5 processor (not an i7), without hyperthreading, so there's not that much point.
Ratio was calculated comparing user times with the first entry (saligner compiled with multithreading set to OFF)
There seems to be a slight delay at the end of the alignment, when saligner finishes (all console output is done), but doesn't actually end. This wait comprises 2-3 seconds regardless of thread implementation and might be one factor of the poor ratios (though not much; even when this is taken into account, the difference is minimal, less than 0.05)
User time seems to make a big jump between 2 threads and 3 threads for both matchers. Maybe this is a clue?

Shifting gears, I've also started to implement the new rank structure, as discussed yesterday. I've completed step 1 and can now build one 64-bit block from a 64-character sequence of A/T/G/C. I'm using Daniel's representation for now, where A=0, C=1, G=2, T=3. It's interesting to note one rank operation is slightly faster than the other 3 rank operations, so it may be advantageous to change the alphabet assignment up to whichever base is the most common? Determining the most common is probably too slow and/or the speedup probably wouldn't be worth it though.

To do:

Rank structures!
Figure out why the multi-threaded implementation isn't working as well as hoped on skorpios.

Revision 472010-06-30 - jayzhang

Line: 1 to 1
	May 2010 archive 06/01/10
Line: 267 to 267
	Verify output. Benchmark.
Added:
> >	06/29/10 Finally got the multi-threaded aligner working, with pretty much the same functionality as before. I'm pretty sure I solved all the race conditions (with the exception of the two Query classes from readaligner), but I still don't really know how to compare the output to the old version, since everything is out of order. For now, I just made a quick glance through the output to check that every line aligns at the end properly (since I'm using the exact aligner, and each read is 35nt, every line should have pretty much the same length, given 8 spaces for each tab). It looks pretty good, but even going through the file like that is huge. I realized that `saligner` wasn't running at high priority when I was doing tests yesterday, so doing that made it utilize both CPU cores on my machine. Preliminary tests on my machine shows that running with two threads drops the runtime to something like 51-52% of original speed, so there's not too much overhead from thread switching. Adding more threads does slow it down a little. Finally, I got a new project from Chris. I'm to build a new rank structure he proposed, which should utilize the cache more efficiently and does away with the wavelet tree. Basically, we will be using two arrays for the popcount, so that we can accurately represent the four bases (as opposed to one array, which would only give us two values). The implementation will involve: Implementing the structure for one block of 64-bits Implementing the structure for multiple blocks of 64-bits (there's two ways to store the blocks, parallel or serial, so I should experiment in this step or step 3.) Have multiple blocks of 64-bits, with counts for every block (there's multiple ways to store counts, so I should try experimenting here) Adjusting sampling rate for every 32-bits, 64, 128, 256, 512. To do: Finish up the multi-threaded implementation: do some code inlining, check everything over, test on skorpios and merge. Start on doing the rank structure!

Revision 462010-06-29 - jayzhang

Line: 1 to 1
	May 2010 archive 06/01/10
Line: 254 to 254
	To do: Finish up threading.
Added:
> >	06/26/10 It works! The multi-threaded driver works, at least with a minimal `CMakeLists.txt` and no bells and whistles. For some reason, I had a bit more trouble linking, and the test file I was able to compile on Friday wouldn't compile, despite pulling it directly from the website! So I had to reinstall Boost and it seems to work now...I feel like I've spent more time trying to get this thing to build than actually coding it! I noticed something strange while running the threaded driver: it seems that when I specify 2 threads, it doesn't use both cores of my CPU, but when I specify >=3 threads, it seems to start utilizing both cores. The threaded driver does work though; at least, it aligns all the 125k reads fine. I still have to lock up the driver files (single end driver and pair end driver), because I'm still running into a minor race condition when approaching the end of the file. Also, I still have to add the alignment status reporting into the threaded driver (e.g. report how many reads were aligned, not aligned, failed, etc). Finally, I need to verify the output of my threaded aligner, which I'm not really sure how to do; it seems to me that the threaded aligner might print reads out of order, so a simple `diff` won't suffice... To do: Add the "bells and whistles" to threaded driver. Lock up driver classes to avoid race conditions. Verify output. Benchmark.

Revision 452010-06-27 - jayzhang

Line: 1 to 1
	May 2010 archive 06/01/10
Line: 246 to 246
	Integrate the Boost.Thread library. Finish up multi-threading.
Added:
> >	06/25/10 Finally got the Boost.Thread library to compile with NGSA. Interesting thing...to test if linking worked, I did an `#include <boost/thread/thread.hpp>` in the `saligner` code, but after all our other includes. When I tried to compile, I would get compile errors from within Boost. But, when I moved the include statement to the top of the include list, it was perfectly fine! None of the other NGSA classes include the boost thread library either, so that was quite strange. I guess there's yet more incentive to follow the Google guidelines for includes? Anyways, I've added options in the `CMakeLists.txt` file for configuring multithreading, and also SSE support. For threading, I've made a new driver called `MultithreadedDriver` (I'm going to change this to `ThreadedDriver`), which takes in another driver. It will then create x number of threads, each calling the driver's Run method. If the user wants no multithreading, then `ThreadedDriver` just calls the Run method of the driver it gets without making any new threads. To do: Finish up threading.

Revision 442010-06-25 - jayzhang

Line: 1 to 1
	May 2010 archive 06/01/10
Line: 235 to 235
	Push my latest changes after doing some final robustness tests (I'm kind of nervous...don't want to break anything!) Add multithreading support!
Added:
> >	06/24/10 Did some last-minute verifying for my C-string conversions (still a bit nervous about it!), and it all looks fine, aside from the `KmerMapper` and `MutantMapper`, detailed above. So I finally pushed the changes into dev. Also began looking at multi-threading. It doesn't look too hard, and I've verified that the only places requiring locks should be with `FastqFile` and `SamFile`. Also, the two mappers, `MismatchMapper` and `GapMapper`, which use readaligner's query classes (`MismatchQuery` and `IndelQuery` respectively), might need locks when querying, because these seem to have shared variables. This might not be such a good plan because the bulk of the calculation is in the query segment, so it might as well be single threaded. To make matters worse, none of the mapper classes are copyable, so making copies of the mappers for each thread isn't an option. Nonetheless, this shouldn't be too big of a problem, especially if we're building a new index, since we probably won't be using the query classes much anyway. On another note, I tried integrating the Boost.Thread libraries in, but I really don't know how with CMake. I tried extracting the Thread library using boost's `bcp` tool, and then adding that library into our build but I got a lot of compile errors; I think the thread libraries are compiled specially with `bjam`. So I'll have to ask Chris about that tomorrow. To do: Integrate the Boost.Thread library. Finish up multi-threading.

Revision 432010-06-24 - jayzhang

Line: 1 to 1
	May 2010 archive 06/01/10
Line: 216 to 216
	Think about some bigger parts of the project I might like to do. Finish up the C-string conversion (hopefully, done by tomorrow).
Added:
> >	06/23/2010 FINALLY finished porting everything over to `char` arrays. I'm pretty sure everything's solid, I do as much to prevent buffer overflows as possible, and output looks the same as the old aligner using `saligner` and most mappers. Speedwise, it's pretty similar; I really can't tell if there's even been an improvement. I guess this port was to make it more future-proof (with packed strings) than anything. Anyways, some outstanding bugs: `MutantMapper` output isn't quite the same as the previous aligner. I think there are two different bugs: CIGAR strings on some reads add up to 36 on 35 nucleotide reads, which is strange...for example, one string will show up as "33M3S", where the old aligner shows "33M2S". For reverse strand reads, some positions seem to be 1 base less than the old aligner. I'm really not sure why... `KmerMapper` output also isn't quite the same: CIGAR strings for some reads also add up to 36, just like `MutantMapper` Regardless of forward/reverse strand, some alignment positions are off, but by more than 1. This also screws up the CIGAR string. The similarities of the above two cases seem to point to something that they have in common going wrong. They both use a `LocalAligner` as the final step of their alignment, so I'm thinking there might be something wonky going on there. I haven't really touched the `LocalAligner` yet, because I want to make some changes to it later on anyway, so I might as well put that off. Also, these two mappers won't be used too much for the time being (I don't think), so I think I"ll just put that off for now... Chris also suggested I take a look at threading the aligner in the meantime, while he's still thinking up something for me to do. He suggested looking into Boost threads, which I did. I don't know...from the arguments I saw, the arguments seem to be more along the lines of "boost threads are easier to use" and "why not use boost threads?", so I think I'll just stick with those. Now I just have to figure out how to add the library to CMake for building. The actual threading looks kind of simple, since I can just have each thread work on a different read. The only shared object the two aligner threads would have are the input/output classes (`SamFile` and `FastqFile`) as far as I can see, and I think locking them up won't be too hard. To do: Push my latest changes after doing some final robustness tests (I'm kind of nervous...don't want to break anything!) Add multithreading support!

Revision 422010-06-23 - jayzhang

Line: 1 to 1
	May 2010 archive 06/01/10
Line: 200 to 200
	Also worked on merging the branches, which might take a bit of time. There's so many new changes!
Added:
> >	Entry for today: Had a meeting with Anne and Chris today (see Daniel's journal for notes). Anne and Chris have been discussing giving me a bigger aspect of the project to work on, which is great. Chris suggested three things from the top of his head: Work on building the index (which currently requires something like 12 GB for the human genome) in a memory-efficient fashion Build an FM index from scratch, which would require things like a rank structure, etc. Work on RNA Seq stuff (intron support, splice graph, etc) So far, option 2 seems the most interesting to me, but all of them are pretty good! Anyways, managed to finish merging everything. Mismatches and Exact matching work fine so far. The jumpstart mapper doesn't work (segfaults somewhere), but I think that might be something to do with the jump table not saving to the right place. Daniel also made some bug fixes, so I think I'll have to do some final (minor) merges, and then I'll be done the conversion! To do: Think about some bigger parts of the project I might like to do. Finish up the C-string conversion (hopefully, done by tomorrow).

Revision 412010-06-22 - jayzhang

Line: 1 to 1
	May 2010 archive 06/01/10
Line: 193 to 193
	To do: Fix bugs in the new C string version.
Added:
> >	06/22/10 I realized I forgot to make a post yesterday, so here it is: Worked on moving over from C strings still. I managed to fix all the segmentation faults, even with the index-building step. Also fixed most of the bugs that arose from the conversion. Now, the exact aligner returns exactly the same output. The `OneMismatchMapper` doesn't quite do it, but I'm hoping the merge will take care of that. Also worked on merging the branches, which might take a bit of time. There's so many new changes!

Revision 402010-06-19 - jayzhang

Line: 1 to 1
	May 2010 archive 06/01/10
Line: 179 to 179
	To do: Finish the conversion by getting stuff to compile! (I really hope saligner ends up being faster).
Added:
> >	06/18/10 Back from my MCAT! Finally got my code to compile for the C string conversion. Now it's off to fix bugs and segmentation faults! I managed to clear up all the segmentation faults during the actual alignment process, but the output's still not quite right. My new build seems miss most of the reads that are alignable from an old build. Also, I get a segmentation fault during the index building process, which is really strange because I haven't even touched the code for that. No new code is even called until after the index has been built! Maybe I'll have to run the build through the debugger... I also spent a large portion of the day experimenting with an idea I had last night. When calculating the rank value for a position i, instead of going to the next smallest pre-calculated position and doing `popcounts` up until i is reached, I modified the function so it goes to the nearest pre-calculated position, and either goes up or down from there. According to callgrind, `popcount` only gets called about 60% as much on the 125000 read dataset (170,862,051 vs the original 254,181,103 calls). Unfortunately, this only results in a very small speed increase, something like 3-5%, which is strange because I expected the gain to be a lot more. Edit: Just noticed Daniel's started to implement a different rank structure, so my change might not even be relevant :(. Finally, I noticed Daniel's made a lot of changes...not looking forward to the huge merge later... To do: Fix bugs in the new C string version.

Revision 392010-06-12 - jayzhang

Line: 1 to 1
	May 2010 archive 06/01/10
Line: 163 to 163
	To do: More conversion...
Added:
> >	06/11/10 Finally finished converting almost all the NGSA library classes to c-strings, except the local aligners. I'm going to forgo the aligners for now because I'm planning on doing some more optimization on them later, so I'll have to change the code there anyway. I'm still not sure whether all my changes work, though, because I still can't seem to compile! It seems there's a problem with my common header, where I store all the constants and typedefs. For some reason, when it gets to linking all the libraries, it keeps giving me errors in whatever headers `single_end_aligner.cc` includes that also use my typedefs, saying that it can't find the definitions. I spent a long time on Google looking up forward declaring headers and typedefs, etc., but I still can't find a solution...maybe I'll have to ask Chris on Monday or Tuesday. I also did a few benchmarks on a bunch of string vs char array functions, and I've developed the following guidelines for any future compatability issues or conversions: Converting a string to char array using `std::string::c_str()` or `std::string::data()` are negligibly fast, so don't worry about doing it! There's no difference as far as I can see between `std::string::data()` and `std::string::c_str()` in terms of speed. Converting a char array to string (using `std::string(const char )` or `std::string::operator=(const char )`) is very slow (O(n)), avoid as much as possible. Using `strncpy` with some pointer math and appending a `NULL` is much faster than `string::substr` (something like 4-5x faster). `std::string::operator[]` is a bit slower than char array's [] operators as well, probably because the string does bounds checking (not sure on this). Even doing `std::string::c_str()`, then using the [] operator is faster (though not by much) Also, be wary of the string's `operator=(const char )` assignment, which sometimes masks a char array --> string conversion. Try being as explicit as possible when setting a string equal to a char array (by using the constructor), to avoid confusion Finally, I think I might take Chris up on his offer for taking a break to study for my MCAT. I'll see how far I get on the weekend, but I might take Monday off for an extra study day. I guess we'll see after the weekend. To do: Finish the conversion by getting stuff to compile! (I really hope saligner ends up being faster*).

Revision 382010-06-11 - jayzhang

Line: 1 to 1
	May 2010 archive 06/01/10
Line: 149 to 149
	More converting Look at fix mentioned above.
Added:
> >	06/10/10 Bahhhhh spent all day converting strings to c-strings. I've decided to convert only things having to do with sequences and qualities to c-strings. To make it easier to see, I put up some typedefs: #define NGSA_READLEN 128 typedef char SequenceBase; typedef char QualityBase; typedef SequenceBase SequenceString[NGSA_READLEN+1]; typedef QualityBase QualityString[NGSA_READLEN+1]; I'm about 75% done, I think. I finished the I/O classes and the drivers, as well as half the mappers. Just need to convert the rest, and then start on the pair-end matcher, and some other stuff. Also optimized some of the lower-level algorithms (such as ngsa::ReverseComplement) to make better use of `char` arrays. I hope the end result is a speed up, or at least doesn't result in a speed-down! Some parts still aren't very efficient, because there are other methods that still take in C++ strings, where I end up having to do a conversion (which takes O(n) time). I think `strlen` is also O(n), versus O(1) on `std::string::size` (I think that's what these are), since `std::string` can just update the size every time the string grows/shrinks, so I should make sure to avoid using `strlen` too often. To do: More conversion...

Revision 372010-06-10 - jayzhang

Line: 1 to 1

May 2010 archive

06/01/10

Line: 135 to 135

I also benchmarked my changes from yesterday (swapping out the Fastq read function). It seems saligner is now faster than readaligner (at least in exact matching). Anyways, here are the times:

Aligner	Exact time (s)	1-mismatch time (s)	Mismatch:Exact ratio

Changed:

<
<

saligner max_results=10

64.461

94.434

1.465

>
>

saligner, max_results=1

64.461

94.434

1.465

The mismatch:exact ratio is still a bit off from bowtie's and readaligner's, but that might also be because mismatches still unecessarily go through the CIGAR generator (which I should fix sometime).

Revision 362010-06-10 - jayzhang

Line: 1 to 1
	May 2010 archive 06/01/10
Line: 139 to 139
	The mismatch:exact ratio is still a bit off from bowtie's and readaligner's, but that might also be because mismatches still unecessarily go through the CIGAR generator (which I should fix sometime).
Added:
> >	There was quite a bit of discussion on some points discussed yesterday, detailed in Daniel's journal, June 10 entry. Finally, spent the rest of my time conerting strings to char arrays, though I'm still not sure if it's doing anything. Regardless, it should be more future-proof. There's also one point I'd like to look at in the `ExactMapper`, though I doubt it has too much of a performance impact. When the total results of an alignment is greater than `max_results`, it seems we do a `Locate` on all the results, not only `max_results`. This should be a pretty easy fix. To do: More converting Look at fix mentioned above.

Revision 352010-06-09 - jayzhang

Line: 1 to 1

May 2010 archive

06/01/10

Line: 129 to 129

To do:

Pretty much refactoring (converting C++ strings to C strings).

Added:

>
>

06/09/10

Experimented a bunch with char arrays vs C++ strings. I did a quick and dirty replacement of strings with char arrays in the FastqEntry and FastqFile classes, and then fixed all the compile errors that came with it, but there actually wasn't too big a difference. In fact, I couldn't really see a difference at all. Granted, I did do a pretty poor job of doing the replacement (just wanted results), so it might not have been as optimized as it could be, but it seems that changing over wouldn't make that much of a difference.

I also benchmarked my changes from yesterday (swapping out the Fastq read function). It seems saligner is now faster than readaligner (at least in exact matching). Anyways, here are the times:

Aligner	Exact time (s)	1-mismatch time (s)	Mismatch:Exact ratio
saligner max_results=10	64.461	94.434	1.465

The mismatch:exact ratio is still a bit off from bowtie's and readaligner's, but that might also be because mismatches still unecessarily go through the CIGAR generator (which I should fix sometime).

Revision 342010-06-09 - jayzhang

Line: 1 to 1

May 2010 archive

06/01/10

Line: 122 to 122

Aligner	Exact time (s)	1-mismatch time (s)	Mismatch:Exact ratio
bowtie (-m32 compiled) (`-v 0/1`)	41.526	46.473	1.119

Added:

>
>

Also had a meeting with Daniel and Chris today (see Daniel's journal, June 8 entry for details).

After the meeting, I did a bit of tinkering with the FASTQ reader. After some discussion with Daniel, we decided it might be better to trade robustness for speed in the reader, as most FASTQ output these days are more standardized (for example, no wrap-around for sequence or quality strings). So by taking all those checks out, I was able to get another roughly 10-15% increase in speed. I also started doing some work on converting our C++-style strings to C-strings. I started with the FASTQ reader again, but it's going to end up being a huge change that'll effect almost all of our classes. At the same time, I think I should also start looking into using regular arrays instead of vectors where applicable.

To do:

Pretty much refactoring (converting C++ strings to C strings).

Revision 332010-06-08 - jayzhang

Line: 1 to 1

May 2010 archive

06/01/10

Line: 116 to 116

A few more minor optimizations.
Take a look at bowtie source code, maybe?

Added:

>
>

06/08/10

Decided to do one last benchmark with 32-bit bowtie instead of 64-bit bowtie, which might be a more accurate comparison, given that readaligner doesn't really take advantage of 64-bit registers (I don't think...). Anyways, here are the results:

Aligner	Exact time (s)	1-mismatch time (s)	Mismatch:Exact ratio
bowtie (-m32 compiled) (`-v 0/1`)	41.526	46.473	1.119

Revision 322010-06-08 - jayzhang

Line: 1 to 1

May 2010 archive

06/01/10

Line: 105 to 105

readaligner (-k 0/1) 68.646 87.841 1.280

Note: readaligner returns 1 max result, I believe.

Added:

>
>

Afternoon update: Well, I managed to shave another 5 or so seconds off saligner. Doing a brief benchmark, with same conditions as above: saligner, max_results = 1: 73.58s for exact alignment.

It's a lot closer to readaligner now. I ended up changing all the osstreams and ofstreams to sprintf and fprintf, which made it quite a bit faster. I'm debating whether I should change the reading code to cstdio as well, but it might get a bit more complicated, since I don't know the actual size of each line and the C input functions all require a char array buffer. Our saligner also does a bit more than readaligner. For example, our FASTQ file processing is much more robust and follows the FASTQ spec better.

Chris also mentioned that something else I can do is to replace the rank structure. He mentioned Daniel was looking at a 64-bit implementation, so that might be something to look at? I'll need to get more details later.

For tomorrow:

A few more minor optimizations.
Take a look at bowtie source code, maybe?

Revision 312010-06-07 - jayzhang

Line: 1 to 1

May 2010 archive

06/01/10

Line: 94 to 94

Find some literature on wavelet trees and take a look at the rank function some more
Do some more reading...

Added:

>
>

06/07/10

Chris suggested finding the ratios between locating with 1 mismatch vs doing an exact locate for both bowtie and saligner. So here it is:

Run on skorpios with saligner commit 5f7f77021a321eab0019825bec36969209e707b6 and using the time command.

Aligner	Exact time (s)	1-mismatch time (s)	Mismatch:Exact ratio
saligner, max_results = 1	79.397	107.989	1.360
saligner, max_results = 10	86.937	127.223	1.46
bowtie (`-v 0/1`)	27.115	29.966	1.105
readaligner (`-k 0/1`)	68.646	87.841	1.280

Note: readaligner returns 1 max result, I believe.

Revision 302010-06-05 - jayzhang

Line: 1 to 1
	May 2010 archive 06/01/10
Line: 85 to 85
	Update: Ran a cachegrind over 2M reads, and didn't get very many cache misses (something like <1%), so I guess it was actually reading from input? I'll have to check out the log a bit more tomorrow. On a side note, the profiling took about 2.5 hours with 2M reads on my machine (don't even know the time because the built-in timer went negative), so it's not a very practical thing to do...
Added:
> >	06/04/10 I feel like I've done all the minor optimizations on the I/O classes at this point, and probably won't be improving speeds much. So I spent the day reading up on FM indexes using the same paper Daniel listed a while back. I now know how the BWT works and how `count` functions work on the index. I haven't really gotten through `locate` yet, but it shouldn't be too bad. With my new-found knowledge, I took another look at the exact matching implementation in readaligner and now I feel like I get it a bit more. I tried to look up some papers on wavelet trees (I think that's what `wv_tree` stands for) and the rank functions, since that's pretty much the bottleneck but I couldn't really find much stuff on it (not that I tried that hard). So for later: Find some literature on wavelet trees and take a look at the rank function some more Do some more reading...

Revision 292010-06-04 - jayzhang

Line: 1 to 1
	May 2010 archive 06/01/10
Line: 73 to 73
	Start reading up on FM index. 06/03/10
Changed:
< <	Worked on more optimizations, this time guided by Valgrind. From KCacheGrind, I found that about 10-15% of our cycles were being used up in IO, especially making/writing SAM entries. So I went in and did a few more minor optimizations, such as using `ostringstream` instead of `stringstream`. Preliminary results (tested on my machine) shows it runs something like 10% faster now.
> >	Worked on more optimizations, this time guided by Valgrind. From KCacheGrind, I found that about 10-15% of our cycles were being used up in IO, especially making/writing SAM entries. So I went in and did a few more minor optimizations, such as using `ostringstream` instead of `stringstream`. Preliminary results (tested on my machine) shows it runs something like 10% faster now.
	Up until now, I've really only been testing exact matches with saligner over the 125,000 reads file. I was doing one final test on the 2M read file (to verify output and to do a preliminary speed test), I noticed a lot of disk activity after about 40k reads! I'm guessing this is what cache misses look like (it could just be it reading from the reads file, I'm not really sure on this, have to confirm), so I'm going to have to do some more profiling with Callgrind. Granted, my system doesn't have as much memory as `skorpios`, but this might be one of the areas we can improve on.

Revision 282010-06-04 - jayzhang

Line: 1 to 1
	May 2010 archive 06/01/10
Line: 83 to 83
	Check if we're really getting lots of cache misses. Start on my reading!
Added:
> >	Update: Ran a cachegrind over 2M reads, and didn't get very many cache misses (something like <1%), so I guess it was actually reading from input? I'll have to check out the log a bit more tomorrow. On a side note, the profiling took about 2.5 hours with 2M reads on my machine (don't even know the time because the built-in timer went negative), so it's not a very practical thing to do...

Revision 272010-06-04 - jayzhang

Line: 1 to 1
	May 2010 archive 06/01/10
Line: 72 to 72
	Do some profiling again...I haven't done that in a while and there's been a lot of changes to the code. Start reading up on FM index.
Added:
> >	06/03/10 Worked on more optimizations, this time guided by Valgrind. From KCacheGrind, I found that about 10-15% of our cycles were being used up in IO, especially making/writing SAM entries. So I went in and did a few more minor optimizations, such as using `ostringstream` instead of `stringstream`. Preliminary results (tested on my machine) shows it runs something like 10% faster now. Up until now, I've really only been testing exact matches with saligner over the 125,000 reads file. I was doing one final test on the 2M read file (to verify output and to do a preliminary speed test), I noticed a lot of disk activity after about 40k reads! I'm guessing this is what cache misses look like (it could just be it reading from the reads file, I'm not really sure on this, have to confirm), so I'm going to have to do some more profiling with Callgrind. Granted, my system doesn't have as much memory as `skorpios`, but this might be one of the areas we can improve on. Other than that, I think it might be a good idea to start reading up on FM indexes, because I really can't do much else without knowing how the aligner works. So tomorrow, I might start on my reading (probably following in Daniel's footsteps). To do: Check if we're really getting lots of cache misses. Start on my reading!

Revision 262010-06-02 - jayzhang

Line: 1 to 1

Changed:

<
<

05/06/10

I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days:

Read through a bunch of Daniel's journal entries to get myself a little more familiar with the project so far.
Had a meeting with Daniel and Chris where we discussed my first task:
- First, I should make or beef up the existing test suite.
- Then, I'm to make a vectorized version of the Smith-Waterman aligner that Daniel has working.
- I should also familiarize myself with existing tools (e.g. Bowtie, BWA, Mosaik, and SamTools).
I ended up spending a lot of time reading through the various guides Daniel posted, as well as the Git guide and the Google C++ style guide to get myself more familiarized with what I should be doing and how I should be doing it.
I also spent some time reading up on the Smith-Waterman algorithm and reading up on sequence alignment in general to get myself a little more familiar with the project.
Checked out Daniel's existing code, especially local_aligner.cc bit (which I'm assuming my vectorized SW will be based a lot off of). I can understand the majority of the code, although I'm a little iffy on some of the gap penalty calculations.

So after all that, I think I'm finally ready to do a little coding. Today, I'll mostly be working on the test suite and trying to get that out of the way.

End-of-the-day post:

I've looked through the local aligner source code in a bit more detail and I think I know my way around it pretty well now. I'm still a little unclear on the free gap policies (I understand the premise and a little of how it works, but I still need to wrap my head around some of the cases in the algorithm), but Chris told me not to worry too much about it, as it's not a priority for now.

I've managed to modify Daniel's test/correctness_test.cc so that it works on the existing code, and all the test cases pass. The test cases Daniel wrote were pretty complete but only tested the ScoreOnlyAlignment method in the local aligner. So next up will be writing some test cases for the FullAlignment, which may be a little harder since it doesn't just output numbers. I've written a few tests so far, and I think I may have found one or two bugs (but I'm going to have to look into that some more to see if it was my test case or the code that was failing). So tomorrow, I will mostly be doing...

Finishing up the test cases for the FullAlignment method.
Fixing any bugs that might come up along the way.
If I get a chance...maybe even start looking into the vectorized methods!

05/07/10

Finished writing the test cases for the local aligner and corrected a very minor bug I found along the way. I mostly used Daniel's test cases for the ScoreOnlyAlignment method, but added the FullAlignment method to the tests. However, I didn't manage to get middle-of-sequence mismatch and gap tests working, since there are so many different possibilities for the alignment to turn out, and it depends very heavily on the scoring method (and the values of the different penalties) as well.

Chris also gave me the source code for a vectorized Smith-Waterman function (from the SHRiMP library) and I spent the rest of the afternoon looking at that and trying to understand it. I found the documentation for the SSE2 functions on the Intel website (link to come later, not on the right now). From my understanding (and Chris' explanation), the function goes through the matrix using a diagonal vector ("positive slope") operating on 8*16bit values, and calculates the highest score from that vector. It uses the affine gap penalty method to calculate scores. From my discussion with Chris, I should now be working on:

Integrating the vectorized Smith-Waterman function into the NGSA library (I can either try using the source code given or make my own).
Configuring CMake so that it detects support for SSE2 (and maybe SSE3) so that it will compile either the vectorized Smith-Waterman or non-vectorized Smith-Waterman, depending on the machine.
Take a look at the banded Smith-Waterman function and try integrating that, too, perhaps. This function only checks a "strip" of cells going down the diagonal of the score matrix, since in practice, alignments aren't as useful when there are large variations from a full match.
Finally, in the distant future, start thinking about the free gap algorithms and vectorizing them.

05/10/10

I spent the morning trying to integrate the SHRiMP vectorized Smith-Waterman code into a new local aligner class, but I couldn't get it to pass the test cases and kept getting segmentation faults, etc. So, I decided it was more trouble than it was worth and to try building my own version of the vectorized Smith-Waterman directly. I also had a lot of trouble getting CMake to do the right thing, but I later figured out it was because I forgot to use the -msse2 -msse3 compiler flags.

I'm about a quarter complete on a score-only vectorized version for the local aligner, and I've got a basic game plan laid out for completing it; I've also found that the best way to understand some code is to try to code it yourself! I now have a much firmer grasp on the whole SHRiMP aligner and on why it does some of the things it does (for example, one question I had was why it stored values for b_gap, but no a_gap array). I think by tomorrow, I can:

Get the score-only vectorized Smith-Waterman finished and hopefully tested out.
Time permitting, get started on the full alignment with traceback.
Read up on CMake a bit more; it's still giving me a strange amount of trouble...

05/11/10

Today felt pretty pointless. I managed to finish my score-only Smith-Waterman alignment, but it ended up failing a bunch of tests. So I took a look at the SHRiMP code again and made a few changes based on that code to iron out the bugs. In the end, my code ended up looking almost exactly like the SHRiMP code, and yet was still buggy, so I ended up just copying and pasting the code into my version. I decided to cut and paste the SHRiMP code into my own class for a couple of reasons:

I'll have to use that code again, but modified when I do the full alignment (with traceback).
The SHRiMP code required a call to initialization, which was troublesome and not really required given the way the NGSA LocalAligner works.
Since SHRiMP was in C, I couldn't wrap their functions in namespaces.

I'll still have to make sure if my decision was correct with Chris, though.

The good news is, I got all the tests to pass on the score-only alignment. Now all I have to do is get a full alignment working. I also want to try to optimize for the linear gap penalty method, since that case is much simpler than the affine gap penalty. Although the linear case is just affine with a gap open penalty set to 0, I think I can do away with some of the vectors in the calculation, and so make it a little faster. That'll be a low priority, though. So for tomorrow, I think I'll be able to finish:

The full Smith-Waterman alignment (vectorized)
Hopefully get that to pass all the tests (i.e. make the behaviour the same as the non-vectorized version)
Take a look at optimizing for linear gap penalty if I have time.

05/12/10

Chris told me that the full alignment was a lower priority, since the score-only alignment is used much more often. So, I've tried optimizing for linear gap penalty instead. I managed to improve performance by not doing a few calculations required in the affine and also improve the memory usage slightly (by not having to use one genome-length-size array).

I did a few preliminary tests, using a randomly generated genome/read sequences of length 1000 each. Over 1000 iterations, my results were:

38.836576365 seconds using affine gap penalty
28.907083597 seconds using linear gap penalty

So I get a roughly 25% decrease in calculation time. Tests using more/fewer iterations and using different genome/read lengths show more or less the same amount.

I also finished a very preliminary version of the banded algorithm (non-vectorized), though I haven't tested or even compiled it yet. It's basically just a recoding of the original Smith-Waterman aligner in that it goes through the matrix left-to-right and top-to-bottom, but instead of starting from x=0 and going to x=read length, it starts and ends based on boundaries defined by two diagonals going through the matrix. I'm not too happy about this arrangement right now, since if I were to do a full alignment, I would have to keep the whole matrix in memory, which is a waste since I'm only calculating a small subset of it. Another method I've been thinking of was to "flip" the matrix, so it's a diamond shape and start calculating in vertical lines. I think this would be the way to go for the vectorized version, but I have no idea how to do that right now!

I do remember doing a "flip" in CS320, but I think it was with a square, whereas this matrix could be rectangular (shouldn't be that much different). I should look up my old notes later on, too.

For tomorrow, I will:

Finish up my banded non-vectorized algorithm and test it
Try coming up with a vectorized version (may involve having to "flip" the matrix).

05/13/10

Had a discussion with Chris about implementing the banded local aligner. He told me that I have actually been doing my non-vectorized version right and I should go left-to-right, top-to-bottom through the array with artificial bounds instead of from 0 to max x. He also told me that I can pretty much assume the matrix will be square, since we will be passing output from the index into the aligner, and the index will find a reference that is roughly the same size as the read. So, that solves the slope issue. He also told me a general way to traverse the matrix with a vector: instead of going left-to-right, then starting at the beginning, the vector will follow a right-down-right-down pattern (so it will "bounce").

I came in today and realized my method of finding the bounding lines on the banded non-vectorized version was pretty complicated; instead of finding the equations for the two bounding lines given some k distance of separation from the diagonal (which involves finding the slope and y-intercept and has square roots in it), I instead take the diagonal line and calculate all cells that are a horizontal distance k from it. This is much easier, and works out to pretty much the same thing. It has the added advantage of letting me take in a horizontal k value for my band radius, which will let me guarantee a correct score given the number of gaps <= k.

I managed to finish up the non-vectorized version as well as make up a (I think) pretty comprehensive test suite for it. In the end, I passed all my own tests, though I did miss a "-1" in one line of code that had me banging my head for a while. I also managed to improve the memory efficiency of the score-only local alignment function in general based on a few tricks I learned while combing through the vectorized Smith-Waterman code, though I've only implemented them in my banded aligner for now. Improvements include:

Keeping only one read-sized row of scores instead of 2.
Keeping only one read-sized row to keep track of gaps in the x-axis instead of 2 rows.
Having just one int to keep track of the last match, instead of 2 read-sized rows.
Having just one int to keep track of the last gap in y, instead of 2 read-sized rows.

To do:

Start tackling the vectorized banded algorithm! That'll be pretty challenging and probably keep me busy all day.
(Much later) Add in my improvements to the original local aligner.
(Still much later) Polish up the different aligners and have CMake compile either the vectorized or non-vectorized versions of the aligners based on CPU support.

05/14/10

I finished the vectorized banded implementation with all test cases passing! I really thought it would take more time and I was surprised when all the test cases passed. The implementation works only with square matrices, and uses 16x 8-bit vectors, so it should be quite a speed-up from the regular banded implementation. The strange thing with 8-bit vectors is that the instruction set is missing a few things compared to 16-bit vectors. For example, there's no insert or extract function, so I ended up casting the __m128i 's into an int8_t array and inserting/extracting manually; I'm not really sure if that's a good idea, though. Also, there was no max function for signed integers, so I had to write my own for that as well.

For Monday:

Edit the CMakeList.txt files so that it compiles vector/non-vector versions based whether SSE2 is available or not.
Polish up all the code I've written and merge.
Add in my improvements to the original local aligner.

05/17/10

I've integrated all the different aligners I've made together into one package. So now, there are two different local aligners, banded and un-banded, each of which have two different implementations, vectorized and non-vectorized. Which version of those two aligners are compiled depends on whether the target machine supports at least SSE2. I haven't really figured out how to get CMake to automatically detect whether SSE2 is supported or not, but instead I've added a variable to the CMakeLists.txt to be set manually.

I also cleaned up my implementations and optimized the existing local aligner implementation (only the ScoreOnlyAlignment method) to be more memory-efficient and faster. All public interfaces for both vectorized/non-vectorized versions of the aligners now match in both input and output and should be more-or-less equivalent in usage (with the exception of the free gap bits in the regular local aligner).

I'm not really sure what else to do for now, so I'll probably have to ask Chris/Daniel tomorrow.

05/18/10

Whoops, I realized I forgot to make an update yesterday. I merged all my changes into the main dev trunk and had a meeting with Anne and Daniel today. My next task is to profile the baligner using the Valgrind suite. I spent the rest of the day looking at how the baligner works and trying to get some working output out.

05/19/10

Did some performance tests on my local aligner implementations. Using my latest commit (5aff5ddf561ee005fb96fa01082479a53364d475), over 1,000 randomly generated sequence pairs of length 500:

Aligner	Avg Time (s)	Total Time (s)
Normal Local Aligner	0.01465	14.65
Banded Local Aligner	0.0009	0.9
Normal Vectorized Local	0.00389	3.89
Banded Vectorized Local	0.00029	0.29

Note these were not done on a Release build because the Release build messes up the vectorized banded aligner for some reason (outputs all 0). I'll have to look into why building as "Release" build is wrong but a regular build is not. I think it might be going wrong in my "custom" signed max function.

I also started profiling the baligner, after making sure all the output was correct. Since aligning the 125000 reads Daniel usually uses would have been way too slow, I used the e_coli_1000_1.fq reads file as a test. I ran memcheck first and fixed all the memory leaks that arose. I also ran callgrind over baligner and I found a significant bottleneck in the gap aligner, where a constant 256*256 matrix was being initialized and reinitialized multiple times over all the reads, when it only needed to be initialized once. After the fix, the speed increase was significant:

Test	Time (s)
Before fix, 1,000 reads	20.26
Before fix, 125,000 reads	401.71
After fix, 1,000 reads	2.72
After fix, 125,000 reads	71.98

For tomorrow:

Find out why the vectorized banded local aligner isn't working on a Release build and fix it.
Benchmark the local aligner using Release builds.
Do more profiling of the baligner.

Note to self: SSE 4.1 seems to have the _mm_max_epi8 and _mm_extract_epi8 functions that I needed! I might be able to fix the "Release" build error using that instruction set.

05/20/10

I managed to fix my vectorized banded aligner for Release mode. It turns out I calculated the value in one of my helper functions, but forgot to return it; I'm surprised it even worked in Debug mode! Anyway, here's the results for commit 5ab570185f6ba90e100f19ae10d938d7d813f08b, compiled under Release mode and doing 10,000 random pairs of sequences of length 500 each.

Aligner	Vectorized avg time (s)	Regular avg time (s)	Regular:Vectorized ratio
Normal Local Aligner	0.000213	0.003323	15.6
Banded Local Aligner	0.000023	0.00024	10.4

So in the end, there's quite a big difference between the regular and vectorized versions. It's interesting to see the improvement in the normal vectorized aligner is greater than the banded aligner, even though the banded aligner does twice as many calculations per vector. It's probably due to me having to code in my own max function in the banded aligner.

Afternoon update:

I did some more optimization (it's strangely fun!) of the banded case and managed to speed it up considerably. Following Chris' suggestions, I changed everything over to unsigned int8_t 's instead of signed (never noticed the unsigned saturated subtract function in SSE2, so I didn't know I could do this). I also changed some things over, like my "_mm_insert_epi8" workaround, and made the horizontal max calculation a bit faster (also following Chris' suggestion). Times are now: (using Release builds, 10,000 runs of 500nt each):

Aligner	Vectorized avg time (s)	Regular avg time (s)	Regular:Vectorized ratio
Normal Local Aligner	0.000215	0.003326	15.5
Banded Local Aligner	0.000007	0.000241	34.4

I realized an alignment size of 500nt isn't that realistic, since the vectorized banded version can only handle 255nt before it starts suffering from integer rollovers.

Anyways, I've also been profiling the baligner some more and trying to figure out why it's roughly 2x slower than the readaligner. I thought I had a breakthrough when I noticed baligner called Query::align() twice as many times as readaligner (for aligning reverse sequences), when Query already dealt with reverse alignments, but it turned out Daniel had already accounted for that. So, I'm pretty much back to square one!

To do:

More profiling to figure out what's wrong; maybe play with report, k, and iteration limit values.
Try making CMake compile different implementations of the local aligner based on CPU features.

05/21/10

Okay, I've been staring at the computer screen for way too long. I tried to figure out what's different between our baligner and the readaligner and makes readaligner so much faster, but I just can't seem to find it. Here are some of my findings:

On 125,000 reads, readaligner runs in about 9-10 seconds, while baligner runs in between 15-16 seconds.
If I comment out the lines that actually do the alignment (the call to Query::align()), readaligner takes <1 second, while baligner takes about 3 seconds. This means baligner has a lot more overhead when starting and finishing (i.e. not actually "aligning"), but this still doesn't account for the 6-7 second difference.
On 1000 reads, baligner and readaligner call Query::align() the same amount of times (if I use the Query class's own reverse functionality and comment out Daniel's). However, it seems calls after that start to diverge, but only slightly. For example, the call after Query::align(), MismatchQuery::firstStep(), gets called 3634 times on baligner, but only 3630 times on readaligner. It looks like this small difference is propagated downward through the calls, and ends in a difference of about 1000 more calls in baligner in a low-level function. This small difference in the beginning might be the source of the problem, since the increase in function calls could potentially get huge in 125,000 reads. However, I haven't profiled that yet, since it would take so long. Perhaps this is a good next step?
baligner is much more complicated than readaligner in terms of class structure, so this might be giving us some overhead. Also, I've noticed vectors are used a lot when passing information around; this might be a source of overhead, especially since just the vectors seem to be returned, instead of a pointer to a vector. This could be a future point of optimization, but I doubt it would save on very much time.

I think the most promising area to look into right now is still the Query::align() area, and why it's calling MismatchQuery::firstStep() slightly more than the readaligner. I think I'm going to run a profile comparing 125,000 reads on both aligners, just to confirm if it is that portion of the code that's giving us the problem. Also, it might just be a bunch of small optimization problems and overhead from using vectors, etc. that could be resulting in the performance loss, although I kind of doubt it.

Todo:

More profiling!

05/25/10

I did another profile on the baligner (now called saligner), and it looks like Query::align() is called the same number of times now. However, MismatchQuery::firstStep(), and all subsequent functions, weren't called the same number of times, which was strange, since it's only caller is Query::align(). After a lot more investigation, I finally found the problem:

In readaligner, they instantiate a single Pattern object for each read. In our saligner, we instantiate one Pattern per read per max mismatch value. When the Pattern is checked, it's first checked "as-is", and then reversed and checked again. Since readaligner's Pattern object is kept between k values, it will check the forward and reverse strands in different order than saligner. If a match is found, the alignment function terminates.

Unfortunately, this "discrepancy" isn't really a slowdown, since it's dependent on the given read set. For example, if all reads given match on the forward strand, then our saligner would be "faster". But, since I've finally found the error (or lack thereof), we can finally start to optimize.

To do:

Optimize!
Read up on policy classes (Chris mentioned this briefly but said it was a low priority, so I just wanted to jot it down before I forgot about it).

05/26/10

Spent a lot of time just thinking...Daniel and I were revisiting the class design for the aligner, and we encountered some design issues with the Mapper and Index classes. Specifically, at this point, the Mapper classes feel too dependent on the specific Index implementation, a problem arising from using external libraries (readaligner). For example, I wanted to move the NGS_FMIndex::LocateWithMismatches and NGS_FMIndex::LocateWithGaps methods into their respective Mapper classes, but I found it troublesome, since the LocateWithX methods encapsulate the Query::align() method. The Query class is a readaligner class and must take in TextCollection objects to be instantiated. Those objects are outputted from the FM Index used in readaligner only, so if the Mappers were to instantiate their own Query classes, they would essentially have to be used with only the FM Index implementation.

Daniel and I have also decided to put an option in the Mapper classes for the user to choose the maximum number of results they want. This is a minor optimization that would allow us to terminate the alignment function once a set amount of results have been reached, rather than going through with the whole alignment. This might also allow us to use fixed-length arrays instead of vectors (or at least reserve the space in the vector beforehand to prevent reallocs).

Another area I've investigated today was the cost of abstraction in our Index classes. I tried unabstracting the NGS_FMIndex and comparing runtimes. On 125k reads, I get about 9.8-10.1 seconds for both implementations, so the abstraction cost isn't really an issue. I just realized that Query objects are also abstracted those methods are called much more often. This might also be a good area to investigate.

Finally, I just realized that string 's can reserve space, too. I think this might be a good area to improve on, as I'm seeing a lot of calls to malloc from both string and vector. Considering both of these could get quite large in some cases, reserving memory beforehand may be a good idea so we don't get frequent realloc 's.

To do:

Discuss class design with Daniel and come up with a finalized design we're both happy with. This may or may not mean changing the current design (which is good, except for the bit with Index and Mapper.
Look into unabstracting the Query class and see how that fares. If it's a big improvement, look into policy classes.
Add an option to limit number of results.
Move instantiation of Query objects into the Mapper 's (if Daniel agrees)

05/27/10

Moved the NGS_FMIndex::Locate and NGS_FMIndex::LocateWithMismatches into the ExactMapper and MismatchMapper class respectively. I also "added" in multimap functionality into both these classes. I put quotes around it because it was already there to begin with (especially in MismatchMapper), but it was slightly different behaviour.

In ExactMapper, I didn't really get a speed improvement. This was pretty much expected because the alignment function it uses is part of TextCollection (as opposed to the other two mappers, which use the Query class). The method, TextCollection::locate, pretty much searches the entire index for all occurrences of the sequence, not for the first x occurences. So, adding a max results parameter pretty much just prevents overflow matches from being outputted, which wasn't that slow to begin with.

MismatchMapper gave me a better result, though the news is bittersweet. With multimaps turned off (setting max_results to 1), I get about a 10-20% speed improvement over the old mapper (all numbers are preliminary, I haven't actually tested on Skorpios yet). However, with any sort of multimapping, even with max_results = 2, I get pretty significant slowdown WRT the old mapper (set to the same max_results value). This isn't really a regression, though; the old MismatchMapper code actually had different behaviour for multimaps. In cases where alignments are found at a lower mismatch value, the old mapper would not continue searching at higher mismatch values, even if there are fewer alignments than max_results. The new MismatchMapper actually continues searching until it's reached its maximum mismatch value or it's found the maximum number of results. This behaviour is more in line with readaligner, although I'm considering pushing it back to the old behaviour because it makes a little more sense to me (and is much faster at high max_results). I should discuss with Daniel/Chris about this.

While testing the new MismatchMapper, I noticed something interesting: if I increase the max_results value (it's report in readaligner) for both saligner and readaligner, I get about the same 2-4 second difference in runtime, no matter what value I use. Since increasing max_results effectively increases the number of alignment steps, the fact that the time difference is pretty much constant supports my previous conjecture that it's the pre/post-processing (parsing of FASTQ/FASTA files, outputting to SAM format, etc) of our aligner that's leading to the time difference. So, again, that should be an area worth investigating.

To do:

Discuss class design with Daniel and come up with a finalized design we're both happy with. This may or may not mean changing the current design (which is good, except for the bit with Index and Mapper (haven't really done it today...)
Look into unabstracting the Query class and see how that fares. If it's a big improvement, look into policy classes (also hasn't been visited at all)
Add a max_results option to the GapMapper and move NGS_FMIndex::LocateWithGaps into GapMapper::Map.
Find more areas where reserving space for strings and vectors would be useful.
Benchmark the new implementations vs the old on Skorpios.
Find other areas to optimize.

05/28/10

Did some more cleanup today. I finished changing the GapMapper over to the new method (see yesterday's post). One point of concern: I saw when I first tested out gapped alignments that aligning 125,000 reads took about 71 seconds. However, I was testing today and the aligner now takes 129.64 seconds to align 125,000 reads, even with k=0 (which is pretty much "exact matching"). It's probably nothing, though, since my previous encounters with the gap mappers were long ago, before the huge refactoring job Daniel did. The test was most likely under different parameters.

Preliminary testing on my own machine shows that saligner, using the GapMapper, is slightly faster than the old version (about 5-10% faster). Further tests need to be done on Skorpios to confirm. Another interesting point is that the GapMapper actually uses the LocalAligner to output to SAM format. However, it uses LocalAligner::FullAlignment functions to do so, which have not been optimized at all. I can also see GapMapper benefiting from using the BandedLocalAligner instead, since gaps tend to be small (BandedLocalAligner supports up to 15 gaps). It might also be worth vectorizing the FullAlignment function; currently, only ScoreOnlyAlignment is vectorized.

I also came across an interesting result, today. I noticed that at k=0, GapMapper, MismatchMapper and ExactMapper all have the same output (no difference whatsoever using diff), even though ExactMapper is faster than the other two. In fact, ExactMapper is something like 16x faster than GapMapper at k=0 on my machine! So, I think an interesting optimization would be to have ExactMapper do an initial alignment and using the more complex mappers only after it has returned negative. This follows through with existing behaviour anyway (both GapMapper and MismatchMapper will try the lowest k value, k=0, first before moving on to progressively higher k values). With data that is mostly exact, this could lead to a serious improvement, since most reads won't even need to touch the more complex alignment functions. This does lead to a question of whether or not the more complex mappers should even start at k=0 at all, since in most cases, ExactMapper will be there to cover it. The question is, I'm not sure if I should have Gap/MismatchMapper call ExactMapper directly within their Map functions, or whether I should let the user decide to call ExactMapper first. I'm currently leaning towards the latter option, since, as implied by their names, Mismatch/GapMapper are there to find alignments with gaps or mismatches, not to find alignments with exact matches. So, I'm proposing to start both mappers at k=1 instead of k=0, and let ExactMapper handle the k=0 condition (but the user has to do that explicitly). Finally, one more point to note is that even if those mappers start at k=1, they still find exact results, which leads me to think that Query::align checks for k=0 and k=1 when I call it for k=1 (and etc. for higher k). This might be an area of optimization worth looking into, since there would be a lot of overlap as we go higher in k value. However, it might not actually be possible to optimize this; I'm not familiar enough with the Query::align functions to know.

Finally, Daniel and I finished our discussion on class hierarchy and I think we're both pretty happy with the design. I've sent an email off to Chris for the finalized version (available on the discussion page of the Twiki).

To do:

Add in a new mapper, SequentialMapper, which allows the user to pass in two mappers. SequentialMapper will then map using the first mapper, and use the second mapper only if the first mapper fails. This should introduce more flexibility in the aligner.
Look into unabstracting the Query class and seeing how it fares (still haven't touched this yet).
Benchmark on Skorpios
Find more areas where reserving space will benefit (haven't touched this either)
Make a vectorized LocalAligner::FullAlignment function.
Make a BandedLocalAligner::FullAlignment function (both vectorized and non-vectorized).

05/31/10

Merged in my changes and did a few benchmarks on Skorpios. I'm not really expecting much of a speed increase, since most of the changes were just refactoring the code and making the classes more cohesive. Here are my results with commit 6e1407b164c7dacb84d857ad57c24ba10eee5f65 (new) compared to commit 5b2b316746ca115f88cc315a21b298c6d3cbedcd (old):

Alignment Type	k	max_results	Old time/read (s)	New time/read (s)	Old:New ratio
Exact	-	1	4.384e-05	3.84267e-05	1.14
Exact	-	10	4.37333e-05	4.088e-05	1.07
Mismatch	2	1	7.56267e-05	6.68267e-05	1.13
Mismatch	2	10	8.016e-05	7.528e-05	1.06
Gap	2	1	8.8936e-04	8.79333e-04	1.01
Gap	2	10	9.348e-04	9.36373e-04	0.998

Tests were done using the provided E. coli genome (NC_010473.fasta) on 375,000 reads (e_coli_125000_1.fq copied 3 times). Tested using the single-end aligner (saligner) release build, times taken from saligner output.

So there were slight speed improvements to exact and mismatch mappers and a negligible speed different in the gap mapper. The speed improvements are probably from some minor fixes I made during the cleanup, but I haven't really touched the readaligner Query classes, which takes up much more time in GapMapper than the other two mappers. Also, the more calls to the readaligner align functions in Exact/Mismatch mappers, the less speed improvement we see.

I spent a while verifying the output between the two versions match, and for the most part they do (tested Exact, Mismatch and Gap mappers at max_results = 1, and max_results = 10 at k = 2).

Finally, I revisited the LocalAligner and tried to make a more efficient FullAlignment function by using arrays instead of vectors and speeding up the recursive Traceback function (it had a bunch of string-passing by value). I ended up failing, getting segmentation faults everywhere, but I think it was just because of me getting my columns and rows mixed up in my matrices. Anyways, more of that tomorrow, when I have a fresher mind.

To do:

I'm going to table the SequentialMapper thing for now, since it's not very high priority. Chris also thought it might be a good idea to look into why MismatchMapper and GapMapper are so slow at k=0, instead of just replacing them with ExactMapper (ideally, ExactMapper shouldn't even be needed).
Unabstracting Query - it's not very necessary on second look, since I usually use the specific Query I need instead of the general Query object anyway.
Find more areas where reserving space will benefit (still not touched)
Improve the AlignerInterface::FullAlignment functions.
Benchmark BWA on skorpios.

>
> May 2010 archive

06/01/10

Here are a few more times, this time with all bin and data files on the /tmp/ folder with times calculated from /usr/bin/time.

Line: 268 to 22

Mismatch

2

1

?

6.662

0.000053296

Bowtie with version info:

Changed:

<
< 64-bit Built on sycamore.umiacs.umd.edu Sat Apr 24 15:56:29 EDT 2010 Compiler: gcc version 4.1.2 20080704 (Red Hat 4.1.2-46) Options: -O3 -m64 -Wl,--hash-style=both Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}

>
>

64-bit
Built on sycamore.umiacs.umd.edu
Sat Apr 24 15:56:29 EDT 2010
Compiler: gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)
Options: -O3 -m64  -Wl,--hash-style=both
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}

Mode	Flags	Mapped	Unmapped	Search Time (s)	Average Search Time (s)
0-Mismatch	`-v 0`	105606	19394	1.310	0.00001048

Line: 289 to 44

Also had a meeting with Chris today, where he clarified some of his previous communications. Our first priority should now be to compare Bowtie exact matching functions to saligner or readaligner exact matching. Daniel took on a job of writing a better locate function for exact matches. I will be looking for more areas to optimize and also fix some bugs and write a better saligner wrapper that actually takes in arguments. Also, since mismatches in cigars are recorded no differently from matches, both mismatch and exact can have their own defined cigar (e.g. 35M), instead of having to compare reads to references to generate our own cigar.

To do:

Changed:

<
<

In Exact and MismatchMappers, just make a cigar on the fly instead of passing stuff to SamEntry.

>
>

In Exact and MismatchMappers, just make a cigar on the fly instead of passing stuff to SamEntry.

More low level optimizations.
Make a better saligner wrapper.
Fix a bug in cigar generation (ticket 53df76)

Added:

>
>

06/02/10

I did a bunch of minor changes to the IO files to optimize them a little further. Most of the changes were just reserving space for strings/vectors beforehand, fixing a few unneeded lines, etc. I also looked into ticket 53df76 (buggy CIGAR output on indels), and found that it's not really a bug with the CIGAR generator. The bug actually comes from using the LocalAligner after a gap alignment to generate the "dashes" that the CIGAR generator needs. However, the behaviour for LocalAligner is unpredictable when used in this context; sometimes it will clip the start or end of the read and throw in mismatches instead of just inserting the gaps (which is actually what it's supposed to do, given it's local). So, there actually aren't bugs in either the CIGAR generator or the LocalAligner. Unfortunately, we can't really fix the bug given the current situation, because there's no other way to put dashes into the aligned ref/reads. The only solution is to either figure out how IndelQuery does its CIGAR generation or write our own IndelQuery (which we might just do anyway). Since the indel support is lower priority, I'm going to table this bug for later.

Another optimization I did today was make a convenience method in SamEntry to generate an entry representing an exact match (as per Chris' suggestion yesterday). Because exact matches are so predictable, it saves us having to generate the CIGAR, and a bunch of the tags. We also don't need to extract the original reference from the index, since it's not really needed. Mismatches should also be similar, since mismatches aren't represented differently from matches in CIGARs. However, we do have an MD:Z tag that supports mismatches (like BWA), but I'm not sure whether this is needed at all.

Finally, I added support for arguments in saligner, where we can specify reference, reads and output for convenience, so we don't have to recompile every time we need to use a different read file.

Anyways, I decided to run some tests on my newest changes using the 2622382 reads file just for completeness and for more accuracy, since I didn't do any tests on that file yesterday, either.

Under commit c6f55e33aa732c8952d1f56fa4c1fe6aa3875677, with the 2822382 reads file and Release build:

Type	k	max_results	Unmapped (+20957 invalid)	Search Time (s)	Avg time/read (s)
Exact	-	1	350717	86.634	0.000033036
Exact	-	10	350717	95.99	0.000036604

And with Bowtie (same version info as above):

Mode	Flags	Mapped	Unmapped	Search Time (s)	Average Search Time (s)
0-Mismatch	`-v 0`	2250708	371674	27.341	0.000010426

To do:

Do some profiling again...I haven't done that in a while and there's been a lot of changes to the code.
Start reading up on FM index.

Revision 252010-06-01 - jayzhang

Line: 1 to 1

05/06/10

I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days:

Read through a bunch of Daniel's journal entries to get myself a little more familiar with the project so far.

Line: 286 to 286

Seeded Quality	`-n 2`	118356	6644	1.974	0.000015792
Seeded Quality	`-n 3`	118330	6670	2.560	0.00002048

Changed:

<
< Note: I'm not really sure BWA is actually working right, because the output file looks way too small. More on that later.

>
>

Also had a meeting with Chris today, where he clarified some of his previous communications. Our first priority should now be to compare Bowtie exact matching functions to saligner or readaligner exact matching. Daniel took on a job of writing a better locate function for exact matches. I will be looking for more areas to optimize and also fix some bugs and write a better saligner wrapper that actually takes in arguments. Also, since mismatches in cigars are recorded no differently from matches, both mismatch and exact can have their own defined cigar (e.g. 35M), instead of having to compare reads to references to generate our own cigar.

To do:

In Exact and MismatchMappers, just make a cigar on the fly instead of passing stuff to SamEntry.
More low level optimizations.
Make a better saligner wrapper.
Fix a bug in cigar generation (ticket 53df76)

Revision 242010-06-01 - jayzhang

Line: 1 to 1

05/06/10

I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days:

Read through a bunch of Daniel's journal entries to get myself a little more familiar with the project so far.

Line: 247 to 247

Improve the AlignerInterface::FullAlignment functions.
Benchmark BWA on skorpios.

Added:

>
>

06/01/10

Here are a few more times, this time with all bin and data files on the /tmp/ folder with times calculated from /usr/bin/time.

Using commit 5a82392eeeb5653d1b75b7a8dd7ccdd499605cfa and saligner, with the 125000 readset (real time)

Type	k	max_results	Unmapped (+1662 invalid)	Search Time (s)	Avg time/read (s)
Exact	-	1	17732	4.727	0.000037816
Exact	-	10	17732	5.01	0.0004008
Mismatch	2	1	5466	8.251	0.000066008
Mismatch	2	10	5466	9.438	0.000075504

Using BWA with flags -l 99 -k 2 outputting to SAM:

Type	k	max_results	Unmapped	Search Time (s)	Avg time/read (s)
Mismatch, I guess	2?	?	?	2.832	0.000022656

Using readaligner with flags -k 2 --sam --fastq:

Type	k	max_results	Unmapped	Search Time (s)	Avg time/read (s)
Mismatch	2	1	?	6.662	0.000053296

Bowtie with version info:

64-bit Built on sycamore.umiacs.umd.edu Sat Apr 24 15:56:29 EDT 2010 Compiler: gcc version 4.1.2 20080704 (Red Hat 4.1.2-46) Options: -O3 -m64 -Wl,--hash-style=both Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}

Mode	Flags	Mapped	Unmapped	Search Time (s)	Average Search Time (s)
0-Mismatch	`-v 0`	105606	19394	1.310	0.00001048
1-Mismatch	`-v 1`	115631	9369	1.850	0.0000148
2/3-Mismatch	`-v 2`	118356	6644	1.878	0.000015024
2/3 Mismatch	`-v 3`	119430	5570	3.441	0.000027528
Seeded Quality	`-n 0`	112317	12683	1.478	0.000011824
Seeded Quality	`-n 1`	117659	7341	1.679	0.000013432
Seeded Quality	`-n 2`	118356	6644	1.974	0.000015792
Seeded Quality	`-n 3`	118330	6670	2.560	0.00002048

Note: I'm not really sure BWA is actually working right, because the output file looks way too small. More on that later.

Revision 232010-06-01 - jayzhang

Line: 1 to 1
	05/06/10 I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days: Read through a bunch of Daniel's journal entries to get myself a little more familiar with the project so far.
Line: 236 to 236
	So there were slight speed improvements to exact and mismatch mappers and a negligible speed different in the gap mapper. The speed improvements are probably from some minor fixes I made during the cleanup, but I haven't really touched the readaligner `Query` classes, which takes up much more time in `GapMapper` than the other two mappers. Also, the more calls to the readaligner align functions in Exact/Mismatch mappers, the less speed improvement we see.
Added:
> >	I spent a while verifying the output between the two versions match, and for the most part they do (tested Exact, Mismatch and Gap mappers at max_results = 1, and max_results = 10 at k = 2). Finally, I revisited the `LocalAligner` and tried to make a more efficient `FullAlignment` function by using arrays instead of vectors and speeding up the recursive `Traceback` function (it had a bunch of string-passing by value). I ended up failing, getting segmentation faults everywhere, but I think it was just because of me getting my columns and rows mixed up in my matrices. Anyways, more of that tomorrow, when I have a fresher mind. To do: I'm going to table the `SequentialMapper` thing for now, since it's not very high priority. Chris also thought it might be a good idea to look into why `MismatchMapper` and `GapMapper` are so slow at k=0, instead of just replacing them with `ExactMapper` (ideally, `ExactMapper` shouldn't even be needed). Unabstracting `Query` - it's not very necessary on second look, since I usually use the specific Query I need instead of the general Query object anyway. Find more areas where reserving space will benefit (still not touched) Improve the `AlignerInterface::FullAlignment` functions. Benchmark BWA on skorpios.

Revision 222010-05-31 - jayzhang

Line: 1 to 1

05/06/10

I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days:

Read through a bunch of Daniel's journal entries to get myself a little more familiar with the project so far.

Line: 221 to 221

Make a vectorized LocalAligner::FullAlignment function.
Make a BandedLocalAligner::FullAlignment function (both vectorized and non-vectorized).

Added:

>
>

05/31/10

Merged in my changes and did a few benchmarks on Skorpios. I'm not really expecting much of a speed increase, since most of the changes were just refactoring the code and making the classes more cohesive. Here are my results with commit 6e1407b164c7dacb84d857ad57c24ba10eee5f65 (new) compared to commit 5b2b316746ca115f88cc315a21b298c6d3cbedcd (old):

Alignment Type	k	max_results	Old time/read (s)	New time/read (s)	Old:New ratio
Exact	-	1	4.384e-05	3.84267e-05	1.14
Exact	-	10	4.37333e-05	4.088e-05	1.07
Mismatch	2	1	7.56267e-05	6.68267e-05	1.13
Mismatch	2	10	8.016e-05	7.528e-05	1.06
Gap	2	1	8.8936e-04	8.79333e-04	1.01
Gap	2	10	9.348e-04	9.36373e-04	0.998

Tests were done using the provided E. coli genome (NC_010473.fasta) on 375,000 reads (e_coli_125000_1.fq copied 3 times). Tested using the single-end aligner (saligner) release build, times taken from saligner output.

So there were slight speed improvements to exact and mismatch mappers and a negligible speed different in the gap mapper. The speed improvements are probably from some minor fixes I made during the cleanup, but I haven't really touched the readaligner Query classes, which takes up much more time in GapMapper than the other two mappers. Also, the more calls to the readaligner align functions in Exact/Mismatch mappers, the less speed improvement we see.

Revision 212010-05-29 - jayzhang

Line: 1 to 1
	05/06/10 I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days: Read through a bunch of Daniel's journal entries to get myself a little more familiar with the project so far.
Line: 204 to 204
	Benchmark the new implementations vs the old on Skorpios. Find other areas to optimize.
Added:
> >	05/28/10 Did some more cleanup today. I finished changing the `GapMapper` over to the new method (see yesterday's post). One point of concern: I saw when I first tested out gapped alignments that aligning 125,000 reads took about 71 seconds. However, I was testing today and the aligner now takes 129.64 seconds to align 125,000 reads, even with k=0 (which is pretty much "exact matching"). It's probably nothing, though, since my previous encounters with the gap mappers were long ago, before the huge refactoring job Daniel did. The test was most likely under different parameters. Preliminary testing on my own machine shows that saligner, using the `GapMapper`, is slightly faster than the old version (about 5-10% faster). Further tests need to be done on Skorpios to confirm. Another interesting point is that the `GapMapper` actually uses the `LocalAligner` to output to SAM format. However, it uses `LocalAligner::FullAlignment` functions to do so, which have not been optimized at all. I can also see `GapMapper` benefiting from using the `BandedLocalAligner` instead, since gaps tend to be small (`BandedLocalAligner` supports up to 15 gaps). It might also be worth vectorizing the `FullAlignment` function; currently, only `ScoreOnlyAlignment` is vectorized. I also came across an interesting result, today. I noticed that at k=0, `GapMapper`, `MismatchMapper` and `ExactMapper` all have the same output (no difference whatsoever using `diff`), even though `ExactMapper` is faster than the other two. In fact, `ExactMapper` is something like 16x faster than `GapMapper` at k=0 on my machine! So, I think an interesting optimization would be to have `ExactMapper` do an initial alignment and using the more complex mappers only after it has returned negative. This follows through with existing behaviour anyway (both `GapMapper` and `MismatchMapper` will try the lowest k value, k=0, first before moving on to progressively higher k values). With data that is mostly exact, this could lead to a serious improvement, since most reads won't even need to touch the more complex alignment functions. This does lead to a question of whether or not the more complex mappers should even start at k=0 at all, since in most cases, `ExactMapper` will be there to cover it. The question is, I'm not sure if I should have `Gap/MismatchMapper` call `ExactMapper` directly within their Map functions, or whether I should let the user decide to call `ExactMapper` first. I'm currently leaning towards the latter option, since, as implied by their names, `Mismatch/GapMapper` are there to find alignments with gaps or mismatches, not to find alignments with exact matches. So, I'm proposing to start both mappers at k=1 instead of k=0, and let `ExactMapper` handle the k=0 condition (but the user has to do that explicitly). Finally, one more point to note is that even if those mappers start at k=1, they still find exact results, which leads me to think that `Query::align` checks for k=0 and k=1 when I call it for k=1 (and etc. for higher k). This might be an area of optimization worth looking into, since there would be a lot of overlap as we go higher in k value. However, it might not actually be possible to optimize this; I'm not familiar enough with the `Query::align` functions to know. Finally, Daniel and I finished our discussion on class hierarchy and I think we're both pretty happy with the design. I've sent an email off to Chris for the finalized version (available on the discussion page of the Twiki). To do: Add in a new mapper, `SequentialMapper`, which allows the user to pass in two mappers. `SequentialMapper` will then map using the first mapper, and use the second mapper only if the first mapper fails. This should introduce more flexibility in the aligner. Look into unabstracting the `Query` class and seeing how it fares (still haven't touched this yet). Benchmark on Skorpios Find more areas where reserving space will benefit (haven't touched this either) Make a vectorized `LocalAligner::FullAlignment` function. Make a `BandedLocalAligner::FullAlignment` function (both vectorized and non-vectorized).

Revision 202010-05-28 - jayzhang

Line: 1 to 1
	05/06/10 I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days: Read through a bunch of Daniel's journal entries to get myself a little more familiar with the project so far.
Line: 187 to 187
	Add an option to limit number of results. Move instantiation of `Query` objects into the `Mapper` 's (if Daniel agrees)
Added:
> >	05/27/10 Moved the `NGS_FMIndex::Locate` and `NGS_FMIndex::LocateWithMismatches` into the `ExactMapper` and `MismatchMapper` class respectively. I also "added" in multimap functionality into both these classes. I put quotes around it because it was already there to begin with (especially in `MismatchMapper`), but it was slightly different behaviour. In `ExactMapper`, I didn't really get a speed improvement. This was pretty much expected because the alignment function it uses is part of `TextCollection` (as opposed to the other two mappers, which use the `Query` class). The method, `TextCollection::locate`, pretty much searches the entire index for all occurrences of the sequence, not for the first x occurences. So, adding a max results parameter pretty much just prevents overflow matches from being outputted, which wasn't that slow to begin with. `MismatchMapper` gave me a better result, though the news is bittersweet. With multimaps turned off (setting `max_results` to 1), I get about a 10-20% speed improvement over the old mapper (all numbers are preliminary, I haven't actually tested on Skorpios yet). However, with any sort of multimapping, even with `max_results` = 2, I get pretty significant slowdown WRT the old mapper (set to the same `max_results` value). This isn't really a regression, though; the old `MismatchMapper` code actually had different behaviour for multimaps. In cases where alignments are found at a lower mismatch value, the old mapper would not continue searching at higher mismatch values, even if there are fewer alignments than `max_results`. The new `MismatchMapper` actually continues searching until it's reached its maximum mismatch value or it's found the maximum number of results. This behaviour is more in line with readaligner, although I'm considering pushing it back to the old behaviour because it makes a little more sense to me (and is much faster at high `max_results`). I should discuss with Daniel/Chris about this. While testing the new `MismatchMapper`, I noticed something interesting: if I increase the `max_results` value (it's `report` in readaligner) for both saligner and readaligner, I get about the same 2-4 second difference in runtime, no matter what value I use. Since increasing `max_results` effectively increases the number of alignment steps, the fact that the time difference is pretty much constant supports my previous conjecture that it's the pre/post-processing (parsing of FASTQ/FASTA files, outputting to SAM format, etc) of our aligner that's leading to the time difference. So, again, that should be an area worth investigating. To do: Discuss class design with Daniel and come up with a finalized design we're both happy with. This may or may not mean changing the current design (which is good, except for the bit with `Index` and `Mapper` (haven't really done it today...) Look into unabstracting the `Query` class and see how that fares. If it's a big improvement, look into policy classes (also hasn't been visited at all) Add a `max_results` option to the `GapMapper` and move `NGS_FMIndex::LocateWithGaps` into `GapMapper::Map`. Find more areas where reserving space for strings and vectors would be useful. Benchmark the new implementations vs the old on Skorpios. Find other areas to optimize.

Revision 192010-05-27 - jayzhang

Line: 1 to 1
	05/06/10 I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days: Read through a bunch of Daniel's journal entries to get myself a little more familiar with the project so far.
Line: 172 to 172
	Optimize! Read up on policy classes (Chris mentioned this briefly but said it was a low priority, so I just wanted to jot it down before I forgot about it).
Added:
> >	05/26/10 Spent a lot of time just thinking...Daniel and I were revisiting the class design for the aligner, and we encountered some design issues with the `Mapper` and `Index` classes. Specifically, at this point, the `Mapper` classes feel too dependent on the specific `Index` implementation, a problem arising from using external libraries (readaligner). For example, I wanted to move the `NGS_FMIndex::LocateWithMismatches` and `NGS_FMIndex::LocateWithGaps` methods into their respective `Mapper` classes, but I found it troublesome, since the `LocateWithX` methods encapsulate the `Query::align()` method. The `Query` class is a readaligner class and must take in `TextCollection` objects to be instantiated. Those objects are outputted from the FM Index used in readaligner only, so if the `Mappers` were to instantiate their own `Query` classes, they would essentially have to be used with only the FM Index implementation. Daniel and I have also decided to put an option in the `Mapper` classes for the user to choose the maximum number of results they want. This is a minor optimization that would allow us to terminate the alignment function once a set amount of results have been reached, rather than going through with the whole alignment. This might also allow us to use fixed-length arrays instead of vectors (or at least reserve the space in the vector beforehand to prevent `reallocs`). Another area I've investigated today was the cost of abstraction in our `Index` classes. I tried unabstracting the `NGS_FMIndex` and comparing runtimes. On 125k reads, I get about 9.8-10.1 seconds for both implementations, so the abstraction cost isn't really an issue. I just realized that `Query` objects are also abstracted those methods are called much more often. This might also be a good area to investigate. Finally, I just realized that `string` 's can reserve space, too. I think this might be a good area to improve on, as I'm seeing a lot of calls to `malloc` from both `string` and `vector`. Considering both of these could get quite large in some cases, reserving memory beforehand may be a good idea so we don't get frequent `realloc` 's. To do: Discuss class design with Daniel and come up with a finalized design we're both happy with. This may or may not mean changing the current design (which is good, except for the bit with `Index` and `Mapper`. Look into unabstracting the `Query` class and see how that fares. If it's a big improvement, look into policy classes. Add an option to limit number of results. Move instantiation of `Query` objects into the `Mapper` 's (if Daniel agrees)

Revision 182010-05-26 - jayzhang

Line: 1 to 1
	05/06/10 I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days: Read through a bunch of Daniel's journal entries to get myself a little more familiar with the project so far.
Line: 161 to 161
	Todo: More profiling!
Added:
> >	05/25/10 I did another profile on the baligner (now called saligner), and it looks like `Query::align()` is called the same number of times now. However, `MismatchQuery::firstStep()`, and all subsequent functions, weren't called the same number of times, which was strange, since it's only caller is `Query::align()`. After a lot more investigation, I finally found the problem: In readaligner, they instantiate a single Pattern object for each read. In our saligner, we instantiate one Pattern per read per max mismatch value. When the Pattern is checked, it's first checked "as-is", and then reversed and checked again. Since readaligner's Pattern object is kept between k values, it will check the forward and reverse strands in different order than saligner. If a match is found, the alignment function terminates. Unfortunately, this "discrepancy" isn't really a slowdown, since it's dependent on the given read set. For example, if all reads given match on the forward strand, then our saligner would be "faster". But, since I've finally found the error (or lack thereof), we can finally start to optimize. To do: Optimize! Read up on policy classes (Chris mentioned this briefly but said it was a low priority, so I just wanted to jot it down before I forgot about it).

Revision 172010-05-22 - jayzhang

Line: 1 to 1
	05/06/10 I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days: Read through a bunch of Daniel's journal entries to get myself a little more familiar with the project so far.
Line: 149 to 149
	More profiling to figure out what's wrong; maybe play with report, k, and iteration limit values. Try making CMake compile different implementations of the local aligner based on CPU features.
Added:
> >	05/21/10 Okay, I've been staring at the computer screen for way too long. I tried to figure out what's different between our baligner and the readaligner and makes readaligner so much faster, but I just can't seem to find it. Here are some of my findings: On 125,000 reads, readaligner runs in about 9-10 seconds, while baligner runs in between 15-16 seconds. If I comment out the lines that actually do the alignment (the call to `Query::align()`), readaligner takes <1 second, while baligner takes about 3 seconds. This means baligner has a lot more overhead when starting and finishing (i.e. not actually "aligning"), but this still doesn't account for the 6-7 second difference. On 1000 reads, baligner and readaligner call `Query::align()` the same amount of times (if I use the Query class's own reverse functionality and comment out Daniel's). However, it seems calls after that start to diverge, but only slightly. For example, the call after `Query::align()`, `MismatchQuery::firstStep()`, gets called 3634 times on baligner, but only 3630 times on readaligner. It looks like this small difference is propagated downward through the calls, and ends in a difference of about 1000 more calls in baligner in a low-level function. This small difference in the beginning might be the source of the problem, since the increase in function calls could potentially get huge in 125,000 reads. However, I haven't profiled that yet, since it would take so long. Perhaps this is a good next step? baligner is much more complicated than readaligner in terms of class structure, so this might be giving us some overhead. Also, I've noticed vectors are used a lot when passing information around; this might be a source of overhead, especially since just the vectors seem to be returned, instead of a pointer to a vector. This could be a future point of optimization, but I doubt it would save on very much time. I think the most promising area to look into right now is still the `Query::align()` area, and why it's calling `MismatchQuery::firstStep()` slightly more than the readaligner. I think I'm going to run a profile comparing 125,000 reads on both aligners, just to confirm if it is that portion of the code that's giving us the problem. Also, it might just be a bunch of small optimization problems and overhead from using vectors, etc. that could be resulting in the performance loss, although I kind of doubt it. Todo: More profiling!

Revision 162010-05-21 - jayzhang

Line: 1 to 1

05/06/10

I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days:

Read through a bunch of Daniel's journal entries to get myself a little more familiar with the project so far.

Line: 134 to 134

So in the end, there's quite a big difference between the regular and vectorized versions. It's interesting to see the improvement in the normal vectorized aligner is greater than the banded aligner, even though the banded aligner does twice as many calculations per vector. It's probably due to me having to code in my own max function in the banded aligner.

Added:

>
>

Afternoon update:

I did some more optimization (it's strangely fun!) of the banded case and managed to speed it up considerably. Following Chris' suggestions, I changed everything over to unsigned int8_t 's instead of signed (never noticed the unsigned saturated subtract function in SSE2, so I didn't know I could do this). I also changed some things over, like my "_mm_insert_epi8" workaround, and made the horizontal max calculation a bit faster (also following Chris' suggestion). Times are now: (using Release builds, 10,000 runs of 500nt each):

Aligner	Vectorized avg time (s)	Regular avg time (s)	Regular:Vectorized ratio
Normal Local Aligner	0.000215	0.003326	15.5
Banded Local Aligner	0.000007	0.000241	34.4

I realized an alignment size of 500nt isn't that realistic, since the vectorized banded version can only handle 255nt before it starts suffering from integer rollovers.

Anyways, I've also been profiling the baligner some more and trying to figure out why it's roughly 2x slower than the readaligner. I thought I had a breakthrough when I noticed baligner called Query::align() twice as many times as readaligner (for aligning reverse sequences), when Query already dealt with reverse alignments, but it turned out Daniel had already accounted for that. So, I'm pretty much back to square one!

To do:

More profiling to figure out what's wrong; maybe play with report, k, and iteration limit values.
Try making CMake compile different implementations of the local aligner based on CPU features.

Revision 152010-05-20 - jayzhang

Line: 1 to 1

05/06/10

I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days:

Read through a bunch of Daniel's journal entries to get myself a little more familiar with the project so far.

Line: 100 to 100

Whoops, I realized I forgot to make an update yesterday. I merged all my changes into the main dev trunk and had a meeting with Anne and Daniel today. My next task is to profile the baligner using the Valgrind suite. I spent the rest of the day looking at how the baligner works and trying to get some working output out.

05/19/10

Changed:

<
< Did some performance tests on my local aligner implementations. Using my latest commit (5aff5ddf561ee005fb96fa01082479a53364d475), over 1,000 randomly generated sequence pairs of length 500:

>
> Did some performance tests on my local aligner implementations. Using my latest commit (5aff5ddf561ee005fb96fa01082479a53364d475), over 1,000 randomly generated sequence pairs of length 500:

Aligner	Avg Time (s)	Total Time (s)
Normal Local Aligner	0.01465	14.65

Line: 125 to 125

Note to self: SSE 4.1 seems to have the _mm_max_epi8 and _mm_extract_epi8 functions that I needed! I might be able to fix the "Release" build error using that instruction set.

Added:

>
>

05/20/10

I managed to fix my vectorized banded aligner for Release mode. It turns out I calculated the value in one of my helper functions, but forgot to return it; I'm surprised it even worked in Debug mode! Anyway, here's the results for commit 5ab570185f6ba90e100f19ae10d938d7d813f08b, compiled under Release mode and doing 10,000 random pairs of sequences of length 500 each.

Aligner	Vectorized avg time (s)	Regular avg time (s)	Regular:Vectorized ratio
Normal Local Aligner	0.000213	0.003323	15.6
Banded Local Aligner	0.000023	0.00024	10.4

So in the end, there's quite a big difference between the regular and vectorized versions. It's interesting to see the improvement in the normal vectorized aligner is greater than the banded aligner, even though the banded aligner does twice as many calculations per vector. It's probably due to me having to code in my own max function in the banded aligner.

Revision 142010-05-20 - jayzhang

Line: 1 to 1

Deleted:

<
<

META TOPICPARENT	name="NGSAlignerProject"

05/06/10

I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days:

Read through a bunch of Daniel's journal entries to get myself a little more familiar with the project so far.

Line: 111 to 110

Note these were not done on a Release build because the Release build messes up the vectorized banded aligner for some reason (outputs all 0). I'll have to look into why building as "Release" build is wrong but a regular build is not. I think it might be going wrong in my "custom" signed max function.

Changed:

<
< I also started profiling the baligner, after making sure all the output was correct. Since aligning the 125000 reads Daniel usually uses would have been way too slow, I used the e_coli_1000_1.fq reads file as a test. I ran memcheck first and fixed all the memory leaks that arose. I also ran callgrind over baligner and I found a significant bottleneck in the gap aligner, where a constant 256*256 matrix was being initialized and reinitialized multiple times over all the reads, when it only needed to be initialized once. After the fix, the speed increase was significant:

>
> I also started profiling the baligner, after making sure all the output was correct. Since aligning the 125000 reads Daniel usually uses would have been way too slow, I used the e_coli_1000_1.fq reads file as a test. I ran memcheck first and fixed all the memory leaks that arose. I also ran callgrind over baligner and I found a significant bottleneck in the gap aligner, where a constant 256*256 matrix was being initialized and reinitialized multiple times over all the reads, when it only needed to be initialized once. After the fix, the speed increase was significant:

Test	Time (s)
Before fix, 1,000 reads	20.26

Line: 124 to 123

Benchmark the local aligner using Release builds.
Do more profiling of the baligner.

Added:

>
> Note to self: SSE 4.1 seems to have the _mm_max_epi8 and _mm_extract_epi8 functions that I needed! I might be able to fix the "Release" build error using that instruction set.

Revision 132010-05-20 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

05/06/10

I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days:

Line: 100 to 100

05/18/10

Whoops, I realized I forgot to make an update yesterday. I merged all my changes into the main dev trunk and had a meeting with Anne and Daniel today. My next task is to profile the baligner using the Valgrind suite. I spent the rest of the day looking at how the baligner works and trying to get some working output out.

Added:

>
>

05/19/10

Did some performance tests on my local aligner implementations. Using my latest commit (5aff5ddf561ee005fb96fa01082479a53364d475), over 1,000 randomly generated sequence pairs of length 500:

Aligner	Avg Time (s)	Total Time (s)
Normal Local Aligner	0.01465	14.65
Banded Local Aligner	0.0009	0.9
Normal Vectorized Local	0.00389	3.89
Banded Vectorized Local	0.00029	0.29

Note these were not done on a Release build because the Release build messes up the vectorized banded aligner for some reason (outputs all 0). I'll have to look into why building as "Release" build is wrong but a regular build is not. I think it might be going wrong in my "custom" signed max function.

I also started profiling the baligner, after making sure all the output was correct. Since aligning the 125000 reads Daniel usually uses would have been way too slow, I used the e_coli_1000_1.fq reads file as a test. I ran memcheck first and fixed all the memory leaks that arose. I also ran callgrind over baligner and I found a significant bottleneck in the gap aligner, where a constant 256*256 matrix was being initialized and reinitialized multiple times over all the reads, when it only needed to be initialized once. After the fix, the speed increase was significant:

Test	Time (s)
Before fix, 1,000 reads	20.26
Before fix, 125,000 reads	401.71
After fix, 1,000 reads	2.72
After fix, 125,000 reads	71.98

For tomorrow:

Find out why the vectorized banded local aligner isn't working on a Release build and fix it.
Benchmark the local aligner using Release builds.
Do more profiling of the baligner.

Revision 122010-05-19 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

05/06/10

I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days:

Line: 97 to 97

I'm not really sure what else to do for now, so I'll probably have to ask Chris/Daniel tomorrow.

Added:

>
>

05/18/10

Whoops, I realized I forgot to make an update yesterday. I merged all my changes into the main dev trunk and had a meeting with Anne and Daniel today. My next task is to profile the baligner using the Valgrind suite. I spent the rest of the day looking at how the baligner works and trying to get some working output out.

Revision 112010-05-18 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

05/06/10

I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days:

Line: 90 to 90

Polish up all the code I've written and merge.
Add in my improvements to the original local aligner.

Added:

>
>

05/17/10

I've integrated all the different aligners I've made together into one package. So now, there are two different local aligners, banded and un-banded, each of which have two different implementations, vectorized and non-vectorized. Which version of those two aligners are compiled depends on whether the target machine supports at least SSE2. I haven't really figured out how to get CMake to automatically detect whether SSE2 is supported or not, but instead I've added a variable to the CMakeLists.txt to be set manually.

I also cleaned up my implementations and optimized the existing local aligner implementation (only the ScoreOnlyAlignment method) to be more memory-efficient and faster. All public interfaces for both vectorized/non-vectorized versions of the aligners now match in both input and output and should be more-or-less equivalent in usage (with the exception of the free gap bits in the regular local aligner).

I'm not really sure what else to do for now, so I'll probably have to ask Chris/Daniel tomorrow.

Revision 102010-05-15 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

05/06/10

I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days:

Line: 82 to 82

(Much later) Add in my improvements to the original local aligner.
(Still much later) Polish up the different aligners and have CMake compile either the vectorized or non-vectorized versions of the aligners based on CPU support.

Added:

>
>

05/14/10

I finished the vectorized banded implementation with all test cases passing! I really thought it would take more time and I was surprised when all the test cases passed. The implementation works only with square matrices, and uses 16x 8-bit vectors, so it should be quite a speed-up from the regular banded implementation. The strange thing with 8-bit vectors is that the instruction set is missing a few things compared to 16-bit vectors. For example, there's no insert or extract function, so I ended up casting the __m128i 's into an int8_t array and inserting/extracting manually; I'm not really sure if that's a good idea, though. Also, there was no max function for signed integers, so I had to write my own for that as well.

For Monday:

Edit the CMakeList.txt files so that it compiles vector/non-vector versions based whether SSE2 is available or not.
Polish up all the code I've written and merge.
Add in my improvements to the original local aligner.

Revision 92010-05-14 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

05/06/10

I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days:

Line: 66 to 66

Finish up my banded non-vectorized algorithm and test it
Try coming up with a vectorized version (may involve having to "flip" the matrix).

Added:

>
>

05/13/10

Had a discussion with Chris about implementing the banded local aligner. He told me that I have actually been doing my non-vectorized version right and I should go left-to-right, top-to-bottom through the array with artificial bounds instead of from 0 to max x. He also told me that I can pretty much assume the matrix will be square, since we will be passing output from the index into the aligner, and the index will find a reference that is roughly the same size as the read. So, that solves the slope issue. He also told me a general way to traverse the matrix with a vector: instead of going left-to-right, then starting at the beginning, the vector will follow a right-down-right-down pattern (so it will "bounce").

I came in today and realized my method of finding the bounding lines on the banded non-vectorized version was pretty complicated; instead of finding the equations for the two bounding lines given some k distance of separation from the diagonal (which involves finding the slope and y-intercept and has square roots in it), I instead take the diagonal line and calculate all cells that are a horizontal distance k from it. This is much easier, and works out to pretty much the same thing. It has the added advantage of letting me take in a horizontal k value for my band radius, which will let me guarantee a correct score given the number of gaps <= k.

I managed to finish up the non-vectorized version as well as make up a (I think) pretty comprehensive test suite for it. In the end, I passed all my own tests, though I did miss a "-1" in one line of code that had me banging my head for a while. I also managed to improve the memory efficiency of the score-only local alignment function in general based on a few tricks I learned while combing through the vectorized Smith-Waterman code, though I've only implemented them in my banded aligner for now. Improvements include:

Keeping only one read-sized row of scores instead of 2.
Keeping only one read-sized row to keep track of gaps in the x-axis instead of 2 rows.
Having just one int to keep track of the last match, instead of 2 read-sized rows.
Having just one int to keep track of the last gap in y, instead of 2 read-sized rows.

To do:

Start tackling the vectorized banded algorithm! That'll be pretty challenging and probably keep me busy all day.
(Much later) Add in my improvements to the original local aligner.
(Still much later) Polish up the different aligners and have CMake compile either the vectorized or non-vectorized versions of the aligners based on CPU support.

Revision 82010-05-13 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

05/06/10

I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days:

Line: 58 to 58

28.907083597 seconds using linear gap penalty

So I get a roughly 25% decrease in calculation time. Tests using more/fewer iterations and using different genome/read lengths show more or less the same amount.

Changed:

<
<

Now it's on to doing the "banded" algorithm. I'm going to try making a non-vectorized version first, then move on to vectorized. I'm not really sure how I'm going to test it though, since the results are going to be a little bit different...

>
>

I also finished a very preliminary version of the banded algorithm (non-vectorized), though I haven't tested or even compiled it yet. It's basically just a recoding of the original Smith-Waterman aligner in that it goes through the matrix left-to-right and top-to-bottom, but instead of starting from x=0 and going to x=read length, it starts and ends based on boundaries defined by two diagonals going through the matrix. I'm not too happy about this arrangement right now, since if I were to do a full alignment, I would have to keep the whole matrix in memory, which is a waste since I'm only calculating a small subset of it. Another method I've been thinking of was to "flip" the matrix, so it's a diamond shape and start calculating in vertical lines. I think this would be the way to go for the vectorized version, but I have no idea how to do that right now!

I do remember doing a "flip" in CS320, but I think it was with a square, whereas this matrix could be rectangular (shouldn't be that much different). I should look up my old notes later on, too.

For tomorrow, I will:

Finish up my banded non-vectorized algorithm and test it
Try coming up with a vectorized version (may involve having to "flip" the matrix).

Revision 72010-05-12 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

05/06/10

I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days:

Line: 50 to 50

Hopefully get that to pass all the tests (i.e. make the behaviour the same as the non-vectorized version)
Take a look at optimizing for linear gap penalty if I have time.

Added:

>
>

05/12/10

Chris told me that the full alignment was a lower priority, since the score-only alignment is used much more often. So, I've tried optimizing for linear gap penalty instead. I managed to improve performance by not doing a few calculations required in the affine and also improve the memory usage slightly (by not having to use one genome-length-size array).

I did a few preliminary tests, using a randomly generated genome/read sequences of length 1000 each. Over 1000 iterations, my results were:

38.836576365 seconds using affine gap penalty
28.907083597 seconds using linear gap penalty

So I get a roughly 25% decrease in calculation time. Tests using more/fewer iterations and using different genome/read lengths show more or less the same amount.

Now it's on to doing the "banded" algorithm. I'm going to try making a non-vectorized version first, then move on to vectorized. I'm not really sure how I'm going to test it though, since the results are going to be a little bit different...

Revision 62010-05-12 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

05/06/10

I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days:

Line: 39 to 39

Read up on CMake a bit more; it's still giving me a strange amount of trouble...

05/11/10

Changed:

<
<

Today felt pretty pointless. I managed to finish my score-only Smith-Waterman alignment, but it ended up failing a bunch of tests. So I took a look at the SHRiMP code again and made a few changes based on that code to iron out the bugs. In the end, my code ended up looking almost exactly like the SHRiMP code, and yet was still buggy, so I ended up just copying and pasting the code into my version. I decided to cut and paste the SHRiMP code into my own class for a couple of reasons:

>
>

Today felt pretty pointless. I managed to finish my score-only Smith-Waterman alignment, but it ended up failing a bunch of tests. So I took a look at the SHRiMP code again and made a few changes based on that code to iron out the bugs. In the end, my code ended up looking almost exactly like the SHRiMP code, and yet was still buggy, so I ended up just copying and pasting the code into my version. I decided to cut and paste the SHRiMP code into my own class for a couple of reasons:

I'll have to use that code again, but modified when I do the full alignment (with traceback).

Changed:

<
<

The SHRiMP code required a call to initialization, which was troublesome and not really required given the way the NGSA LocalAligner works.

>
>

The SHRiMP code required a call to initialization, which was troublesome and not really required given the way the NGSA LocalAligner works.

Since SHRiMP was in C, I couldn't wrap their functions in namespaces.

I'll still have to make sure if my decision was correct with Chris, though.

Revision 52010-05-12 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

05/06/10

I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days:

Line: 38 to 38

Time permitting, get started on the full alignment with traceback.
Read up on CMake a bit more; it's still giving me a strange amount of trouble...

Added:

>
>

05/11/10

Today felt pretty pointless. I managed to finish my score-only Smith-Waterman alignment, but it ended up failing a bunch of tests. So I took a look at the SHRiMP code again and made a few changes based on that code to iron out the bugs. In the end, my code ended up looking almost exactly like the SHRiMP code, and yet was still buggy, so I ended up just copying and pasting the code into my version. I decided to cut and paste the SHRiMP code into my own class for a couple of reasons:

I'll have to use that code again, but modified when I do the full alignment (with traceback).
The SHRiMP code required a call to initialization, which was troublesome and not really required given the way the NGSA LocalAligner works.
Since SHRiMP was in C, I couldn't wrap their functions in namespaces.

I'll still have to make sure if my decision was correct with Chris, though.

The good news is, I got all the tests to pass on the score-only alignment. Now all I have to do is get a full alignment working. I also want to try to optimize for the linear gap penalty method, since that case is much simpler than the affine gap penalty. Although the linear case is just affine with a gap open penalty set to 0, I think I can do away with some of the vectors in the calculation, and so make it a little faster. That'll be a low priority, though. So for tomorrow, I think I'll be able to finish:

The full Smith-Waterman alignment (vectorized)
Hopefully get that to pass all the tests (i.e. make the behaviour the same as the non-vectorized version)
Take a look at optimizing for linear gap penalty if I have time.

Revision 42010-05-11 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

05/06/10

I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days:

Line: 30 to 30

Take a look at the banded Smith-Waterman function and try integrating that, too, perhaps. This function only checks a "strip" of cells going down the diagonal of the score matrix, since in practice, alignments aren't as useful when there are large variations from a full match.
Finally, in the distant future, start thinking about the free gap algorithms and vectorizing them.

Added:

>
>

05/10/10

I spent the morning trying to integrate the SHRiMP vectorized Smith-Waterman code into a new local aligner class, but I couldn't get it to pass the test cases and kept getting segmentation faults, etc. So, I decided it was more trouble than it was worth and to try building my own version of the vectorized Smith-Waterman directly. I also had a lot of trouble getting CMake to do the right thing, but I later figured out it was because I forgot to use the -msse2 -msse3 compiler flags.

I'm about a quarter complete on a score-only vectorized version for the local aligner, and I've got a basic game plan laid out for completing it; I've also found that the best way to understand some code is to try to code it yourself! I now have a much firmer grasp on the whole SHRiMP aligner and on why it does some of the things it does (for example, one question I had was why it stored values for b_gap, but no a_gap array). I think by tomorrow, I can:

Get the score-only vectorized Smith-Waterman finished and hopefully tested out.
Time permitting, get started on the full alignment with traceback.
Read up on CMake a bit more; it's still giving me a strange amount of trouble...

Revision 32010-05-08 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

05/06/10

I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days:

Line: 21 to 21

Fixing any bugs that might come up along the way.
If I get a chance...maybe even start looking into the vectorized methods!

Added:

>
>

05/07/10

Finished writing the test cases for the local aligner and corrected a very minor bug I found along the way. I mostly used Daniel's test cases for the ScoreOnlyAlignment method, but added the FullAlignment method to the tests. However, I didn't manage to get middle-of-sequence mismatch and gap tests working, since there are so many different possibilities for the alignment to turn out, and it depends very heavily on the scoring method (and the values of the different penalties) as well.

Chris also gave me the source code for a vectorized Smith-Waterman function (from the SHRiMP library) and I spent the rest of the afternoon looking at that and trying to understand it. I found the documentation for the SSE2 functions on the Intel website (link to come later, not on the right now). From my understanding (and Chris' explanation), the function goes through the matrix using a diagonal vector ("positive slope") operating on 8*16bit values, and calculates the highest score from that vector. It uses the affine gap penalty method to calculate scores. From my discussion with Chris, I should now be working on:

Integrating the vectorized Smith-Waterman function into the NGSA library (I can either try using the source code given or make my own).
Configuring CMake so that it detects support for SSE2 (and maybe SSE3) so that it will compile either the vectorized Smith-Waterman or non-vectorized Smith-Waterman, depending on the machine.
Take a look at the banded Smith-Waterman function and try integrating that, too, perhaps. This function only checks a "strip" of cells going down the diagonal of the score matrix, since in practice, alignments aren't as useful when there are large variations from a full match.
Finally, in the distant future, start thinking about the free gap algorithms and vectorizing them.

Revision 22010-05-07 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

05/06/10

I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days:

Line: 13 to 13

So after all that, I think I'm finally ready to do a little coding. Today, I'll mostly be working on the test suite and trying to get that out of the way.

Added:

>
>

End-of-the-day post:

I've looked through the local aligner source code in a bit more detail and I think I know my way around it pretty well now. I'm still a little unclear on the free gap policies (I understand the premise and a little of how it works, but I still need to wrap my head around some of the cases in the algorithm), but Chris told me not to worry too much about it, as it's not a priority for now.

I've managed to modify Daniel's test/correctness_test.cc so that it works on the existing code, and all the test cases pass. The test cases Daniel wrote were pretty complete but only tested the ScoreOnlyAlignment method in the local aligner. So next up will be writing some test cases for the FullAlignment, which may be a little harder since it doesn't just output numbers. I've written a few tests so far, and I think I may have found one or two bugs (but I'm going to have to look into that some more to see if it was my test case or the code that was failing). So tomorrow, I will mostly be doing...

Finishing up the test cases for the FullAlignment method.
Fixing any bugs that might come up along the way.
If I get a chance...maybe even start looking into the vectorized methods!

Revision 12010-05-06 - jayzhang

Line: 1 to 1

Added:

>
>

META TOPICPARENT	name="NGSAlignerProject"

05/06/10

I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days:

Read through a bunch of Daniel's journal entries to get myself a little more familiar with the project so far.
Had a meeting with Daniel and Chris where we discussed my first task:
- First, I should make or beef up the existing test suite.
- Then, I'm to make a vectorized version of the Smith-Waterman aligner that Daniel has working.
- I should also familiarize myself with existing tools (e.g. Bowtie, BWA, Mosaik, and SamTools).
I ended up spending a lot of time reading through the various guides Daniel posted, as well as the Git guide and the Google C++ style guide to get myself more familiarized with what I should be doing and how I should be doing it.
I also spent some time reading up on the Smith-Waterman algorithm and reading up on sequence alignment in general to get myself a little more familiar with the project.
Checked out Daniel's existing code, especially local_aligner.cc bit (which I'm assuming my vectorized SW will be based a lot off of). I can understand the majority of the code, although I'm a little iffy on some of the gap penalty calculations.

So after all that, I think I'm finally ready to do a little coding. Today, I'll mostly be working on the test suite and trying to get that out of the way.

Difference: JaysJournal (1 vs. 73)

Revision 732010-09-01 - jayzhang

09/01/2010

Revision 722010-08-24 - jayzhang

08/23/2010

Revision 712010-08-18 - jayzhang

08/18/10

Revision 702010-08-12 - jayzhang

08/11/10

08/10/10

08/11/10

Revision 692010-08-11 - jayzhang

08/11/10

Revision 682010-08-07 - jayzhang

08/06/10

Revision 672010-08-04 - jayzhang

07/01/10

07/02/10

07/05/10

07/06/10

07/07/10

07/08/10

07/09/10

07/10/10

07/12/10

07/13/10

07/15/10

07/16/10

07/19/10

07/20/10

07/21/10

07/23/10

07/26/10

07/27/10

07/29/10

07/30/10

08/04/10

Revision 662010-07-30 - jayzhang

07/29/10

07/30/10

Revision 652010-07-28 - jayzhang

07/27/10

Revision 642010-07-27 - jayzhang

07/26/10

Revision 632010-07-23 - jayzhang

07/23/10

Revision 622010-07-22 - jayzhang

07/21/10

Revision 612010-07-21 - jayzhang

07/20/10

Revision 602010-07-20 - jayzhang

07/19/10

Revision 592010-07-17 - jayzhang

07/16/10

Revision 582010-07-16 - jayzhang

07/15/10

Revision 572010-07-14 - jayzhang

07/13/10

Revision 562010-07-13 - jayzhang

07/10/10

07/12/10

Revision 552010-07-10 - jayzhang

07/10/10

Revision 542010-07-09 - jayzhang

07/09/10

Revision 532010-07-09 - jayzhang

07/08/10

Revision 522010-07-08 - jayzhang

07/06/10

07/07/10

Revision 512010-07-06 - jayzhang

06/01/10

06/02/10

06/03/10

06/04/10

06/07/10

06/08/10

06/09/10

06/10/10

06/11/10