Meeting with Chris, some things to do and their status:
Besides the meeting, also did a lot of housekeeping for the project, and setup for Jay
Up next:
Integrating the readaligner was a major headache, so rather than struggling to adapt it to our current structure, we decided to give a major overhaul of directory organization:
include
into lib
(separating them was causing headaches when integrating external libraries)
include
directory in the build directory, containing public interfaces via header files
With this reorganization, the readaligner library at least compiles... but I have to yet to actually call functions from it. Thus, up next:
Integrated readaligner, starting to construct a minimal aligner using classes from the software:
Next up:
Still attempting to unabstract the index:
TextCollection
as an index
FMIndex
directly, but it directly uses another sublibrary, and the compiler crashed when I tried, so I still to accessing it indirectly via TextCollection
locate
or count
functions LF
function and GetPositions
(which is not locate... unfortunately) and a few stacks, which I assume may corresponding to something similar to ep
and sp
FMIndex.cpp
that doesn't seem to be used in the alignments (as commenting it out doesn't break anything) count
I've seen in other indices
Next up then:
Search
function to create a locate
function, and integrate it into my existing interface
Integrated the FMIndex
from readaligner:
FMIndex
class directly, I decided to use it via TextCollection
& TextCollectionBuilder
FMIndex
or TextCollection
had standard locate
or count
functions, so I had to go in and piece together a few functions to get what I believe is a working locate
NGS_FMIndex
class that inherits my own IndexInterface
on top of the slightly modified FMIndex
bacterial_aligner
Extract
, so an original copy of the genome no longers needs to be maintained extract
methods
name
, to retrieve the reference name of the indexed genome
I have also taken some steps in reducing the sheer length of bacterial_aligner
:
FastaFile
class
SamEntry
creation methods into the SamEntry
class as generator methods
I also figured out some errors that KmerIndex
can run into, and added somewhat informative error messages if anyone ever hits them.
I believe that I should finally return to tackling the "inexact match" function:
locate
function
Query
classes use to achieve the same effect
Rather than building an inexact aligner from scratch, I've taken the suggestion of Chris, and started to directly integrate readaligner's Query
classes
maligner
which used readaligner
classes directly
maligner
, I'm now attempting to reproduce the desired inexact matching produced in readaligner
maligner
unable to do allow any mismatches
maligner
on readaligner
with correct output
Up next:
maligner
that prevents it from allowing inexact matches
Debugged maligner, traced readaligner, ready to fully integrate Query
classes to handle mismatches:
couts
throughout the Query
classes and made some sense of the flow of control Up next then:
Query
classes to work with existing classes and functions
Integrated the MismatchQuery
class:
LocateWithMismatches
function
LocateWithMismatches
takes in a string sequence, an integer value (say k) and returns all positions that the sequence is found in the index with at most k mismatches k
values would make the run time unbearable
Asking Chris what to tackle next... although I feel that a rather horrific organization may be required some time...
Did some rudimentary performance testing and comparisons:
k
):
k | unmapped | time (s) |
---|---|---|
0 | 17660 | 15.7 |
1 | 7886 | 18.6 |
2 | 5449 | 21.4 |
3 | 4468 | 31.7 |
4 | 3974 | 178.3 |
name | time (s) | notes |
---|---|---|
baligner | 20.9 | k = 2 , 5449 mismatches (no indels) |
readaligner | 9.7 | k = 2 (no indels) |
bwa | 14.4 | k = 2 , also handles indels |
bowtie | 4.1 | unknown k , 6644 mismatches, assumed not to handle indels |
I should also note that I've fixed the SAM output to be tview-friendly (I was missing a header).
From these results, much can be improved upon, so up next:
Starting integrating the IndelQuery
from Makinen et al's readaligner:
readaligner
runs k = 0
, the speed is considerably slower compared to MismatchQuery
k
larger than 0...
readaligner
MismatchQuery
as a template, I integrated IndelQuery
LocateWithGaps
function in NGS_FMIndex
Next up then:
Integrated indels into the bacterial aligner:
Up next:
Meeting with Anne and debriefed Jay about his upcoming task:
baligner
baligner
After this, we will direct our efforts to optimization, and only after do we consider RNA-Seq specific strategies, and finally colourspace alignments.
Begun to refactor baligner
:
pairend
resolution functions into a new MatchMaker
class
ticgits
Up next:
mapping
functions from baligner
Pulled out all non-main() functions out of baligner
:
mapper
directory in lib/ngsa
that contains all the mapper functions previously in baligner
Next up:
baligner
Refactored out large chunks of baligner
logic into driver
classes:
Next up:
Tried a few things before the cold settled in and knocked me out...
valgrind
running out leros
via ssh
(it doesn't work on cygwin)
KCacheGrind
on leros
cygwin
(it's supposedly possible...)
Next up:
ticgit
...
Cold improved, but I believe the CS server went down, taking out my email, the git repo, and the wiki for a good while...
KCacheGrind
on my windows via this projectcallgrind
... everything but the Call Graph works...
Next up:
ticgit
... hopefully when the repo connects again and the tickets sync up...
See Jay's Journal...
Besides that, I have been terribly unproductive lately, by a combination of this cold and being somewhat lost as what to do... Seeing as how Jay's been hard at work with low level implementation specific optimizations, I guess I should tackle some higher level algorithmic optimizations.
Up next:
Wanted to implement a true extract()
function:
PizzaChili
indexes...
C
array)
C
(that keeping an odd cumulative count...) and L
(L for Last... which is really the BWT of the text) array, best described in in section 4.1 of Navarro's review (although this is done in the context of a suffix array... not BWT)
LF()
was, and exactly how it does this with the C
array and Occ
(whose details I don't quite know... but I know it has to do with rank... so I'll probably have to know one day if I'm replacing a rank structure)
LF()
, BWT
, and C
, I've realized that extract()
is not all that straight forward... extract
was implemented in the Alphabet Friendly FM Index extract
of the original FM Index... in hopes that these elusive functions are unique to the Alphabet Friendly FM Index...
Next up:
extract()
function... after which I hope to do more with it.
skorpios
(all time data is taken from program output...): NC_010473.fasta
e_coli_125000_1.fq
(unmodified)
--time -S
(to diplay time and output in SAM)
./bowtie --version
):
64-bit Built on skorpios Mon May 31 14:32:21 PDT 2010 Compiler: gcc version 4.4.1 [gcc-4_4-branch revision 150839] (SUSE Linux) Options: -O3 Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}
Mode | Flags | Mapped | Unmapped | Search Time | Average Search Time![]() |
|
---|---|---|---|---|---|---|
0-Mismatch | -v 0 |
105606 | 19394 | 19s | 0.000152s | |
0-Mismatch | -v 0 |
2250708 | 371674 | 406s | 0.000155s | with 2622382 set |
1-Mismatch | -v 1 |
115631 | 9369 | 20s | 0.00016s | |
2/3-Mismatch | -v 2 |
118356 | 6644 | 20s | 0.00016s | |
2/3 Mismatch | -v 3 |
119430 | 5570 | 21s | 0.000168s | |
Seeded Quality | -n 0 |
112317 | 12683 | 20s | 0.00016s | |
Seeded Quality | -n 1 |
117659 | 7341 | 20s | 0.00016s | |
Seeded Quality | -n 2 |
118356 | 6644 | 20s | 0.00016s | |
Seeded Quality | -n 3 |
118330 | 6670 | 21s | 0.000168s | |
Seeded Quality | -n 3 |
2488915 | 133467 | 422s | 0.000161s | with 2622382 set |
Some notes:
bowtie
doesn't support gapped alignments
bowtie
doesn't supported any value greater than 3 (i.e. -v 4
and -n 4
are not implemented)
KmerMapper
), as they are not quality-wary, and align the entire read
KmerMapper
, as it first establishes a list of seeds, then aligns using these seeds
For a quick comparison, saligner
(commit 5a82392eeeb5653d1b75b7a8dd7ccdd499605cfa
), using MismatchMapper
with various k
settings on skorpios
with the same datset (Invalids
are reads with N
):
k | Mapped | Unmapped + Invalid | Search Time | Average Search Time |
---|---|---|---|---|
0 | 105606 | 17732 + 1662 | 5.22s | 0.00004176s |
1 | 115424 | 7914 + 1662 | 6.21s | 0.00004968s |
2 | 117872 | 5466 + 1662 | 8.31s | 0.00006648s |
3 | 118854 | 4484 + 1662 | 15.06s | 0.00006608s |
4 | 119349 | 3989 + 1662 | 116.61s | 0.00093288s |
One potential reason that we currently outperform bowtie
so much could be the report
setting. In the past, I set report
to 10, and Jay has recently dropped it to 1. By dropping it to one, we no longer detect multimaps, but I believe this is a fair comparison with bowtie, which also doesn't report multimaps.
However, bwa
does multimaps, gaps and mismatches, simultaneously, so we'll have to adjust our program to run on equal ground when doing those comparisons.
One odd note... after it prints all the information, the program won't actually terminate until about 20 seconds after on skorpios... so there is a large constant (about 20 seconds) discrepency between actual runtime and reported runtime. For example, with k = 2
:
real 0m26.654s user 0m7.940s sys 0m0.360s
...And for k = 4
real 2m14.341s user 1m54.987s sys 0m0.504s
Is this the problem with the program? Or something unique to running programs on skorpios
via SSH? It doesn't seem to occur when I run saligner
locally (which takes 15.61 seconds).
real 0m16.159s user 0m15.077s sys 0m0.656s
By the timing of the pause, I would suspect that it's the destructors... but that doesn't make much sense either.
Finally, all my builds and test files should be openly accessible in /ubc/cs/research/condon/people/jujubix/
in case anyone wants to double check