Difference: JaysJournal (70 vs. 71)

Revision 712010-08-18 - jayzhang

Line: 1 to 1
 
META TOPICPARENT name="NGSAlignerProject"
May 2010 archive
Line: 116 to 116
 
  • Implement a feature to load/save only the BWT string, instead of the whole index, which could change based on different memory usage profiles.
  • Get ready for integration?
Added:
>
>

08/18/10

I just realized I forgot to update this journal in quite a while. So here's a brief overview of what's been happening:

The FM Index is mostly done, but without multiple-reference support. Chris says this should be an easy fix and can be updated later. Running a bunch of tests, the FM index seems quite fast when doing Locate (roughly 4x faster than bowtie). However, these results aren't the most reliable because there's no post-processing going into it, nor is there quality support. Some interesting points:

  • When doing Locate, the sampling rate of the locate structure (to store positions) matters quite a bit, so this should be as small as possible.
  • The StaticDNASequence sampling rate has some leeway in the beginning (in my tests, from 64-256 bases were okay), but starts affecting the time significantly after that.
  • The StaticBitSequence sampling rate has pretty much no effect, even up to sampling rates of 1024! Unfortunately, changing this sampling rate also affects memory the least.
  • Using SSE4.2 (hardware popcount) doesn't have too much of an effect at low StaticDNASequence sampling rates (expected); at a sampling rate of 64 bases, the SSE-enabled version was about 4% faster, but goes up to 20% higher at a sampling rate of 1024 bases.

Lately, I've been working with Daniel to integrate the new FM index into NGSA, which has taken most of the past two days. Right now, we have the BowtieMapper working (albeit with a lot of quick hacks, since multi-references isn't done yet), which maps reads with up to 3 mismatches and uses a jump table. Daniel is also going to work on the BWAMapper, which currently has some problems with unsigned integers and -1's. I'm also going to reorganize the structure of the NGSA library and make a better utilities file.

Finally, I ran some preliminary benchmarks between the new saligner and bowtie. The following were run on NC_010473.fasta, minus the 'R' and the 'Y' character in the genome, using the e_coli_2622382_1.fq readset. Tests were done on skorpios in /var/tmp/, using 64-bit versions of both aligners:

Aligner Match Type Flags Memory Usage (rough, within a few 1000 B) (B) Time (s) Notes
bowtie exact -S -v 0 6,800,000 28.01  
saligner exact N/A 6,500,000 18.24 locate sr = 16, StaticBitSequence sr = 512, StaticDNASequence sr = 128
saligner exact N/A 5,300,000 19.07 locate sr = 32, StaticBitSequence sr = 512, StaticDNASequence sr = 128

Note the times might not match so well because I used time for bowtie, but the built-in timing function for saligner, since I haven't gotten the saving/loading integrated yet. Also, I didn't use valgrind --tool=massif to profile saligner, because there are some full-text references being kept in memory somewhere, which is really raising the memory usage (I'll have to find and clear those later). The memory reported above is from the FM Index's FMIndex::MemoryUsage() function, which only reports the theoretical usage, and I kept the memory usage for saligner on the low side, to account for any extra memory that may be used for other things.

To do:

  • Reorganize directory structure
  • Fix a bug in the FM Index/=StaticDNASequence=, where Rank0 gives the wrong output in blocks with '$'
  • Implement a jump table within FM Index
  • Maybe implement multiple sequence support (Chris will get back to me on this)
 
META FILEATTACHMENT attr="h" comment="Rank graph" date="1278719684" name="rank-graph.png" path="rank-graph.png" size="32863" user="jayzhang" version="1.1"
META FILEATTACHMENT attr="h" comment="" date="1278982205" name="rank-graph2.png" path="rank-graph2.png" size="26249" user="jayzhang" version="1.1"
META FILEATTACHMENT attr="h" comment="" date="1280528212" name="rank-graph3.png" path="rank-graph3.png" size="23549" user="jayzhang" version="1.1"
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2025 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback