DanielMayArchive < BETA

Tags: view all tags

05/03/10

Meeting with Chris, some things to do and their status:

vectorized SW (assigned to Jay)
- testing and integration
readaligner index integration (assigned to Daniel)
- substitute rank structure
- integrate query classes (strategies)
- memory-time trade off with pre-built kmer index
- cmake refactoring
RNA-Seq (later)
- splice graph
  - determine input format and minimum required
  - output of spliced reads in SAM (check TopHat)
colourspace (much later)
- colour <--> letter space converter

git (done)
- reorganize repo
tools directory re-organization (done)
- each tool in separate directory
refactoring (to do...)

Besides the meeting, also did a lot of housekeeping for the project, and setup for Jay

cleaned repo (merge all branches to dev, and delete everything else)
setup a working version of ticgit (well... Chris did this, but had some troubles getting it to work on my end too)
lots of small guides for wiki

Up next:

integrating readaligner

05/04/10

Integrating the readaligner was a major headache, so rather than struggling to adapt it to our current structure, we decided to give a major overhaul of directory organization:

merged include into lib (separating them was causing headaches when integrating external libraries)
lots of new cmake code to:
- generate an include directory in the build directory, containing public interfaces via header files
- copy data files automatically for tests and tools
discovered that my gcc-3.4.4 was extremely "robust" (well... sloppy is more like it) compared to the gcc-4.3.4 that the lab was using
- this may explain why Chris always had errors while my compiler was happy with the sloppy code
- I've also discovered that the newer compiler does a better job with optimization, cutting run time to a third

With this reorganization, the readaligner library at least compiles... but I have to yet to actually call functions from it. Thus, up next:

integrating readaligner into the aligner

05/05/10

Integrated readaligner, starting to construct a minimal aligner using classes from the software:

The basic flow:
- Obtain reference as string
- Build forward index with string
- Reverse string
- Build reverse index
- Initialize output writer
- Initialize query (aligner)
- Feed reads into query
The query initialization takes the outer writer as an argument
- Thus, the actual output operation is abstracted away complete
- This is slightly problematic, as the SAM output they currently use isn't quite complete
I also don't quite understand it handles mismatches
- The original executable handles mismatches
- I have tried repeatedly and failed to reproduce this, even after copying and pasting blocks of code...

Next up:

Reproduce mismatches with the minimal aligner
Unabstract the index, and the query...
Then integrate back into bacterial aligner

05/06/10

Still attempting to unabstract the index:

I have failed and given up on reproducing mismatches
Instead I have turned to unabstracting the the index
- Instead of relying on the Query classes to output SAM entries, I completely ignoring them and using TextCollection as an index
- I tried using FMIndex directly, but it directly uses another sublibrary, and the compiler crashed when I tried, so I still to accessing it indirectly via TextCollection
- Unlike the classes in PizzaChili, I cannot seem to find any clear locate or count functions
  - Instead, the query relies purely on using the LF function and GetPositions (which is not locate... unfortunately) and a few stacks, which I assume may corresponding to something similar to ep and sp
- I have found a private function in FMIndex.cpp that doesn't seem to be used in the alignments (as commenting it out doesn't break anything)
  - Oddly, this function looks almost identical to count I've seen in other indices

Next up then:

Modifying this Search function to create a locate function, and integrate it into my existing interface
Possibly adapt the Query classes to handle inexact alignments

05/07/10

Integrated the FMIndex from readaligner:

After discovering it was more trouble than it was worth to use the unabstracted FMIndex class directly, I decided to use it via TextCollection & TextCollectionBuilder
However, neither FMIndex or TextCollection had standard locate or count functions, so I had to go in and piece together a few functions to get what I believe is a working locate
I have thus built a NGS_FMIndex class that inherits my own IndexInterface on top of the slightly modified FMIndex
- As before, I can now use it interchangeably with other indices in my bacterial_aligner
I've also made is indices now support two new functions
- Extract, so an original copy of the genome no longers needs to be maintained
  - Currently, I simply keep a copy of the original genome inside the index object, but later this may use true succinct index extract methods
- name, to retrieve the reference name of the indexed genome
  - Saves the trouble of passing this value separately
  - Assumes that a single index has a single reference_name... this may not be true later

I have also taken some steps in reducing the sheer length of bacterial_aligner:

Pulled out some reference reading methods into a new FastaFile class
Pulled out some SamEntry creation methods into the SamEntry class as generator methods

I also figured out some errors that KmerIndex can run into, and added somewhat informative error messages if anyone ever hits them.

I believe that I should finally return to tackling the "inexact match" function:

Either write a variation of the locate function
...or figure out the ingenuity behind the huge branching looping pathway that the Query classes use to achieve the same effect

05/10/10

Rather than building an inexact aligner from scratch, I've taken the suggestion of Chris, and started to directly integrate readaligner's Query classes

I've recovered an older version of maligner which used readaligner classes directly
With this maligner, I'm now attempting to reproduce the desired inexact matching produced in readaligner
- I assume there are currently some bugs that make maligner unable to do allow any mismatches
I have confirmed that bugs both exist in the building of the index and the actual alignment using the index
- The former was confirmed and resolved, by running indices built in maligner on readaligner with correct output
- The latter is to come

Up next:

Find the bug in the alignment portion of maligner that prevents it from allowing inexact matches

05/11/10

Debugged maligner, traced readaligner, ready to fully integrate Query classes to handle mismatches:

The "bug" was really an initialization problem, and solved by initializing a certain "limit" to a very high number (so the recursion wouldn't terminate prematurely)
I have placed couts throughout the Query classes and made some sense of the flow of control
- The simplest perfect match case seems simple enough
- I haven't quite understood yet how mismatches are allows (how they're disallowed is simple though)
Nonetheless, I have determined the exact lines responsible for the final output
- While I'd love to simply have it return a usable object/value, I cannot for a few reasons
  - It's not the last step of the function calls
  - It's in the last function on a very large stack of function calls, so returning would involve changing a lot of signatures to propagate the value

Up next then:

Modify Query classes to work with existing classes and functions
- I plan to introduce a "buffer" class member that will be written to at the aforementioned "output line"
- After the true final call, I just have to return the value in this buffer
- In theory then, I only have to modify the output line, and the signature and return line of the first function

05/12/10

Integrated the MismatchQuery class:

I overrode the "output" line in the original class, and replaced it with a line that fills a vector
- A separate call to the object returns the now filled vector of positions
- These two steps are encapsulated in the LocateWithMismatches function
LocateWithMismatches takes in a string sequence, an integer value (say k) and returns all positions that the sequence is found in the index with at most k mismatches
- This also means that setting k = 1 will return all perfect matches (k = 0) + all matches with 1 mismatch
- You can imagine how this gets slightly out of hand with higher mismatch values
  - Before that even occurs though, the sheer amount of recursion required by high k values would make the run time unbearable

Asking Chris what to tackle next... although I feel that a rather horrific organization may be required some time...

05/13/10

Did some rudimentary performance testing and comparisons:

The data:
- the first 125000 reads extracted from s_1_2_sequence.txt obtained from here
- The E.coli genome provided from the same link
The results (index building time is excluded):
- Altering the allowed mismatches (k):

k	unmapped	time (s)
0	17660	15.7
1	7886	18.6
2	5449	21.4
3	4468	31.7
4	3974	178.3

- Compared with other aligners on the same dataset (125000 reads)

name	time (s)	notes
baligner	20.9	`k = 2`, 5449 mismatches (no indels)
readaligner	9.7	`k = 2` (no indels)
bwa	14.4	`k = 2`, also handles indels
bowtie	4.1	unknown `k`, 6644 mismatches, assumed not to handle indels

I should also note that I've fixed the SAM output to be tview-friendly (I was missing a header).

From these results, much can be improved upon, so up next:

Add indel support
Find bottlenecks and optimize...?

05/14/10

Starting integrating the IndelQuery from Makinen et al's readaligner:

Ran some gapped readaligner runs
- Even at k = 0, the speed is considerably slower compared to MismatchQuery
- The speed becomes quite unbearable at any k larger than 0...
- Mismatches and gaps are currently mutually exclusive in readaligner
Using the integrated MismatchQuery as a template, I integrated IndelQuery
- I encapsulated it all in a LocateWithGaps function in NGS_FMIndex
- It compiled fine...
- But it didn't work... (it returned nothing...)

Next up then:

Figuring out why indels aren't working.

05/17/10

Integrated indels into the bacterial aligner:

The error was previously caused by an uninitialized matrix (oddly, it works in Makinen's readaligner fine... with or without the initialization call)
Currently, the basic flow is as follows:
- Aligned with up to 2 mismatches
  - If missed, attempt a gapped locate with up to 2 gaps
This slightly change has increased the run time to 2 minutes (from the previous 20 seconds)...

Up next:

Abstracting everything but the main function out of bacterial aligner
Some small tasks I've ticketed on ticgit (Jay could do these... actually...)
Determining bottlenecks in bacterial aligner

05/18/10

Meeting with Anne and debriefed Jay about his upcoming task:

Jay will use Valgrind to profile the baligner
- Subsequent data will be used for optimization to reduce run time
I will refactor baligner
- Will reduce code bulk
- Will comment and clarify existing classes

After this, we will direct our efforts to optimization, and only after do we consider RNA-Seq specific strategies, and finally colourspace alignments.

05/19/10

Begun to refactor baligner:

Pulled out pairend resolution functions into a new MatchMaker class
Logged a lot of small errors and improvements as ticgits

Up next:

Pulling out the mapping functions from baligner
Adding to and resolving the current list of bugs/improvements

05/20/10

Pulled out all non-main() functions out of baligner:

Created a new mapper directory in lib/ngsa that contains all the mapper functions previously in baligner

Next up:

Pulling out more stuff from baligner
After the above step, build minimal mode-specific aligners
- e.g. single end, pair end
- The goal is to be able to build an aligner in 20 lines

05/21/10

Refactored out large chunks of baligner logic into driver classes:

You can now literally create an aligner with less than 20 lines of code (excluding import and cout statements)

Next up:

Debug and document
Profile and optimize

05/24/10

Tried a few things before the cold settled in and knocked me out...

Got valgrind running out leros via ssh (it doesn't work on cygwin)
Couldn't install nor run KCacheGrind on leros
- Attempted and failed to get it running on cygwin (it's supposedly possible...)

Next up:

Tackle tickets left on ticgit...

05/25/10

Cold improved, but I believe the CS server went down, taking out my email, the git repo, and the wiki for a good while...

Managed to install KCacheGrind on my windows via this project
- Generated and displayed a file from callgrind... everything but the Call Graph works...

Next up:

Tackle tickets left on ticgit... hopefully when the repo connects again and the tickets sync up...

05/26/10

See Jay's Journal...

Discussion about class hierarchy

Besides that, I have been terribly unproductive lately, by a combination of this cold and being somewhat lost as what to do... Seeing as how Jay's been hard at work with low level implementation specific optimizations, I guess I should tackle some higher level algorithmic optimizations.

Up next:

Forget about those darn ticgits... I've been telling myself to do them for days to no avail...
Create an uncompressing extract() function, then compare speed, and consider if keeping an uncompressed copy of the genome is worth the speed-memory trade off
Confirm the speed differences between our implementation and bowtie, then see what they're doing that makes all the difference
Kickstart locate... first by avoiding a number of bases to simulate the effect
- If a dramatic improvement is seen, implement the pre-computed kmer table
Try a difference rank structure
To accommodate almost all of the above steps, buckle down, stare at some code, and read up on full-succinct indices...

05/27/10

Wanted to implement a true extract() function:

Stared at code from PizzaChili indexes...
Couldn't understand any of the notations (such as the C array)
Started reading various papers
- Started with this paper's relatively simple description of the FMIndex in Section 2, leading me to the need to understand suffix arrays and BWT... which are highly related... in an odd twisted backwards kind of way I'm trying to wrap my head around
- Wikipedia's entries on suffix arrays and BWT were a somewhat lacking in depth, but served as a good entry
- This site provided a more pictorial intro to suffix arrays, and its searching algorithm (defined best in Fig. 4 of Navarro's review... supplemented by notation specified in the original paper, about the greater than symbols between strings in section 2)
- Next was to understand the backward search in a search using the suffix array with C (that keeping an odd cumulative count...) and L (L for Last... which is really the BWT of the text) array, best described in in section 4.1 of Navarro's review (although this is done in the context of a suffix array... not BWT)
- Finally understood what LF() was, and exactly how it does this with the C array and Occ (whose details I don't quite know... but I know it has to do with rank... so I'll probably have to know one day if I'm replacing a rank structure)
Unfortunately, armed with new knowledge of LF(), BWT, and C, I've realized that extract() is not all that straight forward...
- When in doubt, copy others... so I returned to check out how extract was implemented in the Alphabet Friendly FM Index
  - It uses a few functions that still elude me... so more reading may be in order.
  - Alternatively, I'll also check out the extract of the original FM Index... in hopes that these elusive functions are unique to the Alphabet Friendly FM Index...

Next up:

Continue reading papers and code, until I understand the FM Index enough to make a simple extract() function... after which I hope to do more with it.

05/31/10

Finished up reading on FM Indexes, mainly section two of Ferragina and Manzini's original paper
- I will ignore the other sections of the paper for now... since they talk about compression, and make little sense to me...
Start to check out the code in bowtie, which lead me to discover 4 distinct modes of alignment, and some performance testing done on skorpios (all time data is taken from program output...):
- Reference: NC_010473.fasta
- Read: e_coli_125000_1.fq (unmodified)
- Default flags: --time -S (to diplay time and output in SAM)
- Build (displayed via ./bowtie --version):

64-bit
Built on skorpios
Mon May 31 14:32:21 PDT 2010
Compiler: gcc version 4.4.1 [gcc-4_4-branch revision 150839] (SUSE Linux) 
Options: -O3  
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}

Mode	Flags	Mapped	Unmapped	Search Time	Average Search Time
0-Mismatch	`-v 0`	105606	19394	19s	0.000152s
0-Mismatch	`-v 0`	2250708	371674	406s	0.000155s	with 2622382 set
1-Mismatch	`-v 1`	115631	9369	20s	0.00016s
2/3-Mismatch	`-v 2`	118356	6644	20s	0.00016s
2/3 Mismatch	`-v 3`	119430	5570	21s	0.000168s
Seeded Quality	`-n 0`	112317	12683	20s	0.00016s
Seeded Quality	`-n 1`	117659	7341	20s	0.00016s
Seeded Quality	`-n 2`	118356	6644	20s	0.00016s
Seeded Quality	`-n 3`	118330	6670	21s	0.000168s
Seeded Quality	`-n 3`	2488915	133467	422s	0.000161s	with 2622382 set

Some notes:

bowtie doesn't support gapped alignments
bowtie doesn't supported any value greater than 3 (i.e. -v 4 and -n 4 are not implemented)
Mismatch modes are similar to FM-based mappers (i.e. everything but KmerMapper), as they are not quality-wary, and align the entire read
Seeded Quality mode, (aka Maq-like mode) is somewhat similar to our KmerMapper, as it first establishes a list of seeds, then aligns using these seeds
I find it quite odd how the time is nearly constant... maybe a larger dataset will shed some new light (UPDATE: ran a few trials with the large dataset, same average times)

For a quick comparison, saligner (commit 5a82392eeeb5653d1b75b7a8dd7ccdd499605cfa), using MismatchMapper with various k settings on skorpios with the same datset (Invalids are reads with N):

k	Mapped	Unmapped + Invalid	Search Time	Average Search Time
0	105606	17732 + 1662	5.22s	0.00004176s
1	115424	7914 + 1662	6.21s	0.00004968s
2	117872	5466 + 1662	8.31s	0.00006648s
3	118854	4484 + 1662	15.06s	0.00006608s
4	119349	3989 + 1662	116.61s	0.00093288s

One potential reason that we currently outperform bowtie so much could be the report setting. In the past, I set report to 10, and Jay has recently dropped it to 1. By dropping it to one, we no longer detect multimaps, but I believe this is a fair comparison with bowtie, which also doesn't report multimaps.

However, bwa does multimaps, gaps and mismatches, simultaneously, so we'll have to adjust our program to run on equal ground when doing those comparisons.

One odd note... after it prints all the information, the program won't actually terminate until about 20 seconds after on skorpios... so there is a large constant (about 20 seconds) discrepency between actual runtime and reported runtime. For example, with k = 2:

real   0m26.654s
user   0m7.940s
sys   0m0.360s

...And for k = 4

real   2m14.341s
user   1m54.987s
sys   0m0.504s

Is this the problem with the program? Or something unique to running programs on skorpios via SSH? It doesn't seem to occur when I run saligner locally (which takes 15.61 seconds).

real    0m16.159s
user    0m15.077s
sys     0m0.656s

By the timing of the pause, I would suspect that it's the destructors... but that doesn't make much sense either.

Finally, all my builds and test files should be openly accessible in /ubc/cs/research/condon/people/jujubix/ in case anyone wants to double check

Raw edit | More topic actions

Topic revision: r1 - 2010-06-02 - jujubix