05/03/10
Meeting with Chris, some things to do and their status:
- vectorized SW (assigned to Jay)
- readaligner index integration (assigned to Daniel)
- substitute rank structure
- integrate query classes (strategies)
- memory-time trade off with pre-built kmer index
- cmake refactoring
- RNA-Seq (later)
- splice graph
- determine input format and minimum required
- output of spliced reads in SAM (check TopHat)
- colourspace (much later)
- colour <--> letter space converter
- git (done)
- tools directory re-organization (done)
- each tool in separate directory
- refactoring (to do...)
Besides the meeting, also did a lot of housekeeping for the project, and setup for Jay
- cleaned repo (merge all branches to dev, and delete everything else)
- setup a working version of ticgit (well... Chris did this, but had some troubles getting it to work on my end too)
- lots of small guides for wiki
Up next:
05/04/10
Integrating the readaligner was a major headache, so rather than struggling to adapt it to our current structure, we decided to give a major overhaul of directory organization:
- merged
include
into lib
(separating them was causing headaches when integrating external libraries)
- lots of new cmake code to:
- generate an
include
directory in the build directory, containing public interfaces via header files
- copy data files automatically for tests and tools
- discovered that my gcc-3.4.4 was extremely "robust" (well... sloppy is more like it) compared to the gcc-4.3.4 that the lab was using
- this may explain why Chris always had errors while my compiler was happy with the sloppy code
- I've also discovered that the newer compiler does a better job with optimization, cutting run time to a third
With this reorganization, the readaligner library at least compiles... but I have to yet to actually call functions from it. Thus, up next:
- integrating readaligner into the aligner
05/05/10
Integrated readaligner, starting to construct a minimal aligner using classes from the software:
- The basic flow:
- Obtain reference as string
- Build forward index with string
- Reverse string
- Build reverse index
- Initialize output writer
- Initialize query (aligner)
- Feed reads into query
- The query initialization takes the outer writer as an argument
- Thus, the actual output operation is abstracted away complete
- This is slightly problematic, as the SAM output they currently use isn't quite complete
- I also don't quite understand it handles mismatches
- The original executable handles mismatches
- I have tried repeatedly and failed to reproduce this, even after copying and pasting blocks of code...
Next up:
- Reproduce mismatches with the minimal aligner
- Unabstract the index, and the query...
- Then integrate back into bacterial aligner
05/06/10
Still attempting to unabstract the index:
- I have failed and given up on reproducing mismatches
- Instead I have turned to unabstracting the the index
- Instead of relying on the Query classes to output SAM entries, I completely ignoring them and using
TextCollection
as an index
- I tried using
FMIndex
directly, but it directly uses another sublibrary, and the compiler crashed when I tried, so I still to accessing it indirectly via TextCollection
- Unlike the classes in PizzaChili, I cannot seem to find any clear
locate
or count
functions
- Instead, the query relies purely on using the
LF
function and GetPositions
(which is not locate... unfortunately) and a few stacks, which I assume may corresponding to something similar to ep
and sp
- I have found a private function in
FMIndex.cpp
that doesn't seem to be used in the alignments (as commenting it out doesn't break anything)
- Oddly, this function looks almost identical to
count
I've seen in other indices
Next up then:
- Modifying this
Search
function to create a locate
function, and integrate it into my existing interface
- Possibly adapt the Query classes to handle inexact alignments
05/07/10
Integrated the
FMIndex
from readaligner:
- After discovering it was more trouble than it was worth to use the unabstracted
FMIndex
class directly, I decided to use it via TextCollection
& TextCollectionBuilder
- However, neither
FMIndex
or TextCollection
had standard locate
or count
functions, so I had to go in and piece together a few functions to get what I believe is a working locate
- I have thus built a
NGS_FMIndex
class that inherits my own IndexInterface
on top of the slightly modified FMIndex
- As before, I can now use it interchangeably with other indices in my
bacterial_aligner
- I've also made is indices now support two new functions
-
Extract
, so an original copy of the genome no longers needs to be maintained
- Currently, I simply keep a copy of the original genome inside the index object, but later this may use true succinct index
extract
methods
-
name
, to retrieve the reference name of the indexed genome
- Saves the trouble of passing this value separately
- Assumes that a single index has a single reference_name... this may not be true later
I have also taken some steps in reducing the sheer length of
bacterial_aligner
:
- Pulled out some reference reading methods into a new
FastaFile
class
- Pulled out some
SamEntry
creation methods into the SamEntry
class as generator methods
I also figured out some errors that
KmerIndex
can run into, and added somewhat informative error messages if anyone ever hits them.
I believe that I should finally return to tackling the "inexact match" function:
- Either write a variation of the
locate
function
- ...or figure out the ingenuity behind the huge branching looping pathway that the
Query
classes use to achieve the same effect
05/10/10
Rather than building an inexact aligner from scratch, I've taken the suggestion of Chris, and started to directly integrate readaligner's
Query
classes
- I've recovered an older version of
maligner
which used readaligner
classes directly
- With this
maligner
, I'm now attempting to reproduce the desired inexact matching produced in readaligner
- I assume there are currently some bugs that make
maligner
unable to do allow any mismatches
- I have confirmed that bugs both exist in the building of the index and the actual alignment using the index
- The former was confirmed and resolved, by running indices built in
maligner
on readaligner
with correct output
- The latter is to come
Up next:
- Find the bug in the alignment portion of
maligner
that prevents it from allowing inexact matches
05/11/10
Debugged maligner, traced readaligner, ready to fully integrate
Query
classes to handle mismatches:
- The "bug" was really an initialization problem, and solved by initializing a certain "limit" to a very high number (so the recursion wouldn't terminate prematurely)
- I have placed
couts
throughout the Query
classes and made some sense of the flow of control
- The simplest perfect match case seems simple enough
- I haven't quite understood yet how mismatches are allows (how they're disallowed is simple though)
- Nonetheless, I have determined the exact lines responsible for the final output
- While I'd love to simply have it return a usable object/value, I cannot for a few reasons
- It's not the last step of the function calls
- It's in the last function on a very large stack of function calls, so returning would involve changing a lot of signatures to propagate the value
Up next then:
- Modify
Query
classes to work with existing classes and functions
- I plan to introduce a "buffer" class member that will be written to at the aforementioned "output line"
- After the true final call, I just have to return the value in this buffer
- In theory then, I only have to modify the output line, and the signature and return line of the first function
05/12/10
Integrated the
MismatchQuery
class:
- I overrode the "output" line in the original class, and replaced it with a line that fills a vector
- A separate call to the object returns the now filled vector of positions
- These two steps are encapsulated in the
LocateWithMismatches
function
-
LocateWithMismatches
takes in a string sequence, an integer value (say k) and returns all positions that the sequence is found in the index with at most k mismatches
- This also means that setting k = 1 will return all perfect matches (k = 0) + all matches with 1 mismatch
- You can imagine how this gets slightly out of hand with higher mismatch values
- Before that even occurs though, the sheer amount of recursion required by high
k
values would make the run time unbearable
Asking Chris what to tackle next... although I feel that a rather horrific organization may be required some time...
05/13/10
Did some rudimentary performance testing and comparisons:
- The data:
- the first 125000 reads extracted from s_1_2_sequence.txt obtained from here
- The E.coli genome provided from the same link
- The results (index building time is excluded):
- Altering the allowed mismatches (
k
):
-
- Compared with other aligners on the same dataset (125000 reads)
name |
time (s) |
notes |
baligner |
20.9 |
k = 2 , 5449 mismatches (no indels) |
readaligner |
9.7 |
k = 2 (no indels) |
bwa |
14.4 |
k = 2 , also handles indels |
bowtie |
4.1 |
unknown k , 6644 mismatches, assumed not to handle indels |
I should also note that I've fixed the SAM output to be tview-friendly (I was missing a header).
From these results, much can be improved upon, so up next:
- Add indel support
- Find bottlenecks and optimize...?
05/14/10
Starting integrating the
IndelQuery
from Makinen et al's readaligner:
- Ran some gapped
readaligner
runs
- Even at
k = 0
, the speed is considerably slower compared to MismatchQuery
- The speed becomes quite unbearable at any
k
larger than 0...
- Mismatches and gaps are currently mutually exclusive in
readaligner
- Using the integrated
MismatchQuery
as a template, I integrated IndelQuery
- I encapsulated it all in a
LocateWithGaps
function in NGS_FMIndex
- It compiled fine...
- But it didn't work... (it returned nothing...)
Next up then:
- Figuring out why indels aren't working.
05/17/10
Integrated indels into the bacterial aligner:
- The error was previously caused by an uninitialized matrix (oddly, it works in Makinen's readaligner fine... with or without the initialization call)
- Currently, the basic flow is as follows:
- Aligned with up to 2 mismatches
- If missed, attempt a gapped locate with up to 2 gaps
- This slightly change has increased the run time to 2 minutes (from the previous 20 seconds)...
Up next:
- Abstracting everything but the main function out of bacterial aligner
- Some small tasks I've ticketed on ticgit (Jay could do these... actually...)
- Determining bottlenecks in bacterial aligner
05/18/10
Meeting with Anne and debriefed Jay about his upcoming task:
- Jay will use Valgrind
to profile the baligner
- Subsequent data will be used for optimization to reduce run time
- I will refactor
baligner
- Will reduce code bulk
- Will comment and clarify existing classes
After this, we will direct our efforts to optimization, and only after do we consider RNA-Seq specific strategies, and finally colourspace alignments.
05/19/10
Begun to refactor
baligner
:
- Pulled out
pairend
resolution functions into a new MatchMaker
class
- Logged a lot of small errors and improvements as
ticgits
Up next:
- Pulling out the
mapping
functions from baligner
- Adding to and resolving the current list of bugs/improvements
05/20/10
Pulled out all non-main() functions out of
baligner
:
- Created a new
mapper
directory in lib/ngsa
that contains all the mapper functions previously in baligner
Next up:
- Pulling out more stuff from
baligner
- After the above step, build minimal mode-specific aligners
- e.g. single end, pair end
- The goal is to be able to build an aligner in 20 lines
05/21/10
Refactored out large chunks of
baligner
logic into
driver
classes:
- You can now literally create an aligner with less than 20 lines of code (excluding import and cout statements)
Next up:
- Debug and document
- Profile and optimize
05/24/10
Tried a few things before the cold settled in and knocked me out...
- Got
valgrind
running out leros
via ssh
(it doesn't work on cygwin)
- Couldn't install nor run
KCacheGrind
on leros
- Attempted and failed to get it running on
cygwin
(it's supposedly possible...)
Next up:
- Tackle tickets left on
ticgit
...
05/25/10
Cold improved, but I believe the CS server went down, taking out my email, the git repo, and the wiki for a good while...
- Managed to install
KCacheGrind
on my windows via this project
- Generated and displayed a file from
callgrind
... everything but the Call Graph works...
Next up:
- Tackle tickets left on
ticgit
... hopefully when the repo connects again and the tickets sync up...
05/26/10
See
Jay's Journal
...
- Discussion about class hierarchy
Besides that, I have been terribly unproductive lately, by a combination of this cold and being somewhat lost as what to do... Seeing as how Jay's been hard at work with low level implementation specific optimizations, I guess I should tackle some higher level algorithmic optimizations.
Up next:
- Forget about those darn ticgits... I've been telling myself to do them for days to no avail...
- Create an uncompressing extract() function, then compare speed, and consider if keeping an uncompressed copy of the genome is worth the speed-memory trade off
- Confirm the speed differences between our implementation and bowtie, then see what they're doing that makes all the difference
- Kickstart locate... first by avoiding a number of bases to simulate the effect
- If a dramatic improvement is seen, implement the pre-computed kmer table
- Try a difference rank structure
- To accommodate almost all of the above steps, buckle down, stare at some code, and read up on full-succinct indices...
05/27/10
Wanted to implement a true
extract()
function:
- Stared at code from
PizzaChili
indexes...
- Couldn't understand any of the notations (such as the
C
array)
- Started reading various papers
- Started with this
paper's relatively simple description of the FMIndex in Section 2, leading me to the need to understand suffix arrays and BWT... which are highly related... in an odd twisted backwards kind of way I'm trying to wrap my head around
- Wikipedia's entries on suffix arrays
and BWT
were a somewhat lacking in depth, but served as a good entry
- This
site provided a more pictorial intro to suffix arrays, and its searching algorithm (defined best in Fig. 4 of Navarro's review
... supplemented by notation specified in the original paper
, about the greater than symbols between strings in section 2)
- Next was to understand the backward search in a search using the suffix array with
C
(that keeping an odd cumulative count...) and L
(L for Last... which is really the BWT of the text) array, best described in in section 4.1 of Navarro's review (although this is done in the context of a suffix array... not BWT)
- Finally understood what
LF()
was, and exactly how it does this with the C
array and Occ
(whose details I don't quite know... but I know it has to do with rank... so I'll probably have to know one day if I'm replacing a rank structure)
- Unfortunately, armed with new knowledge of
LF()
, BWT
, and C
, I've realized that extract()
is not all that straight forward...
- When in doubt, copy others... so I returned to check out how
extract
was implemented in the Alphabet Friendly FM Index
- It uses a few functions that still elude me... so more reading may be in order.
- Alternatively, I'll also check out the
extract
of the original FM Index... in hopes that these elusive functions are unique to the Alphabet Friendly FM Index...
Next up:
- Continue reading papers and code, until I understand the FM Index enough to make a simple
extract()
function... after which I hope to do more with it.
05/31/10
- Finished up reading on FM Indexes, mainly section two of Ferragina and Manzini's original paper
- I will ignore the other sections of the paper for now... since they talk about compression, and make little sense to me...
- Start to check out the code in bowtie, which lead me to discover 4 distinct modes of alignment, and some performance testing done on
skorpios
(all time data is taken from program output...):
- Reference:
NC_010473.fasta
- Read:
e_coli_125000_1.fq
(unmodified)
- Default flags:
--time -S
(to diplay time and output in SAM)
- Build (displayed via
./bowtie --version
):
64-bit
Built on skorpios
Mon May 31 14:32:21 PDT 2010
Compiler: gcc version 4.4.1 [gcc-4_4-branch revision 150839] (SUSE Linux)
Options: -O3
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}
Mode |
Flags |
Mapped |
Unmapped |
Search Time |
Average Search Time |
0-Mismatch |
-v 0 |
105606 |
19394 |
19s |
0.000152s |
0-Mismatch |
-v 0 |
2250708 |
371674 |
406s |
0.000155s |
with 2622382 set |
1-Mismatch |
-v 1 |
115631 |
9369 |
20s |
0.00016s |
2/3-Mismatch |
-v 2 |
118356 |
6644 |
20s |
0.00016s |
2/3 Mismatch |
-v 3 |
119430 |
5570 |
21s |
0.000168s |
Seeded Quality |
-n 0 |
112317 |
12683 |
20s |
0.00016s |
Seeded Quality |
-n 1 |
117659 |
7341 |
20s |
0.00016s |
Seeded Quality |
-n 2 |
118356 |
6644 |
20s |
0.00016s |
Seeded Quality |
-n 3 |
118330 |
6670 |
21s |
0.000168s |
Seeded Quality |
-n 3 |
2488915 |
133467 |
422s |
0.000161s |
with 2622382 set |
Some notes:
-
bowtie
doesn't support gapped alignments
-
bowtie
doesn't supported any value greater than 3 (i.e. -v 4
and -n 4
are not implemented)
- Mismatch modes are similar to FM-based mappers (i.e. everything but
KmerMapper
), as they are not quality-wary, and align the entire read
- Seeded Quality mode, (aka Maq-like mode) is somewhat similar to our
KmerMapper
, as it first establishes a list of seeds, then aligns using these seeds
- I find it quite odd how the time is nearly constant... maybe a larger dataset will shed some new light (UPDATE: ran a few trials with the large dataset, same average times)
For a quick comparison,
saligner
(commit
5a82392eeeb5653d1b75b7a8dd7ccdd499605cfa
), using
MismatchMapper
with various
k
settings on
skorpios
with the same datset (
Invalids
are reads with
N
):
One potential reason that we currently outperform
bowtie
so much could be the
report
setting. In the past, I set
report
to 10, and Jay has recently dropped it to 1. By dropping it to one, we no longer detect multimaps, but I believe this is a fair comparison with bowtie, which also doesn't report multimaps.
However,
bwa
does multimaps, gaps and mismatches, simultaneously, so we'll have to adjust our program to run on equal ground when doing those comparisons.
One odd note... after it prints all the information, the program won't actually terminate until about 20 seconds after on skorpios... so there is a large constant (about 20 seconds) discrepency between actual runtime and reported runtime. For example, with
k = 2
:
real 0m26.654s
user 0m7.940s
sys 0m0.360s
...And for
k = 4
real 2m14.341s
user 1m54.987s
sys 0m0.504s
Is this the problem with the program? Or something unique to running programs on
skorpios
via SSH? It doesn't seem to occur when I run
saligner
locally (which takes 15.61 seconds).
real 0m16.159s
user 0m15.077s
sys 0m0.656s
By the timing of the pause, I would suspect that it's the destructors... but that doesn't make much sense either.
Finally, all my builds and test files should be openly accessible in
/ubc/cs/research/condon/people/jujubix/
in case anyone wants to double check