> > | 08/18/10
I just realized I forgot to update this journal in quite a while. So here's a brief overview of what's been happening:
The FM Index is mostly done, but without multiple-reference support. Chris says this should be an easy fix and can be updated later. Running a bunch of tests, the FM index seems quite fast when doing Locate (roughly 4x faster than bowtie). However, these results aren't the most reliable because there's no post-processing going into it, nor is there quality support. Some interesting points:
- When doing
Locate , the sampling rate of the locate structure (to store positions) matters quite a bit, so this should be as small as possible.
- The
StaticDNASequence sampling rate has some leeway in the beginning (in my tests, from 64-256 bases were okay), but starts affecting the time significantly after that.
- The
StaticBitSequence sampling rate has pretty much no effect, even up to sampling rates of 1024! Unfortunately, changing this sampling rate also affects memory the least.
- Using SSE4.2 (hardware popcount) doesn't have too much of an effect at low
StaticDNASequence sampling rates (expected); at a sampling rate of 64 bases, the SSE-enabled version was about 4% faster, but goes up to 20% higher at a sampling rate of 1024 bases.
Lately, I've been working with Daniel to integrate the new FM index into NGSA, which has taken most of the past two days. Right now, we have the BowtieMapper working (albeit with a lot of quick hacks, since multi-references isn't done yet), which maps reads with up to 3 mismatches and uses a jump table. Daniel is also going to work on the BWAMapper , which currently has some problems with unsigned integers and -1's. I'm also going to reorganize the structure of the NGSA library and make a better utilities file.
Finally, I ran some preliminary benchmarks between the new saligner and bowtie . The following were run on NC_010473.fasta , minus the 'R' and the 'Y' character in the genome, using the e_coli_2622382_1.fq readset. Tests were done on skorpios in /var/tmp/ , using 64-bit versions of both aligners:
Aligner |
Match Type |
Flags |
Memory Usage (rough, within a few 1000 B) (B) |
Time (s) |
Notes |
bowtie |
exact |
-S -v 0 |
6,800,000 |
28.01 |
|
saligner |
exact |
N/A |
6,500,000 |
18.24 |
locate sr = 16, StaticBitSequence sr = 512, StaticDNASequence sr = 128 |
saligner |
exact |
N/A |
5,300,000 |
19.07 |
locate sr = 32, StaticBitSequence sr = 512, StaticDNASequence sr = 128 |
Note the times might not match so well because I used time for bowtie, but the built-in timing function for saligner , since I haven't gotten the saving/loading integrated yet. Also, I didn't use valgrind --tool=massif to profile saligner , because there are some full-text references being kept in memory somewhere, which is really raising the memory usage (I'll have to find and clear those later). The memory reported above is from the FM Index's FMIndex::MemoryUsage() function, which only reports the theoretical usage, and I kept the memory usage for saligner on the low side, to account for any extra memory that may be used for other things.
To do:
- Reorganize directory structure
- Fix a bug in the FM Index/=StaticDNASequence=, where Rank0 gives the wrong output in blocks with '$'
- Implement a jump table within FM Index
- Maybe implement multiple sequence support (Chris will get back to me on this)
|