DanielMarchArchive < BETA

03/01/10

Well, I was planning on doing another large self-expository entry for Vigna's select9, but due to certain circumstances (including but not limited to the fact that I have a dislike for it), I will spare the reader and myself from the ordeal. Instead, to facilitate the meeting with Anne and Chris tomorrow, and discussion about my talk, I'll jot down a rough outline of what I imagine will be included in my talk (March 16):

Before that though, I should mention I'm planning on doing a majority of the presentation on whiteboard, since I believe it easier for the audience to follow, easier for me to draw and explain details (both main and aside) on the fly, and lengthen what would alternatively be a short and potentially confusing slide show. Also, the intended audience level is undergraduate level... unfortunately, limited by my own level.

Self-introduction, as I can only think of 4 people in those meetings who actually know why I'm there, lest even know who I am... (mention sequence aligner)
General intro: Indexes, and their direct application to bioinformatics (i.e. sequence data) (segue from sequence aligner)
Relevance: Attribute modern successful index implementations with select/rank
- ? Currently, I still personally don't even know why or how select/rank is used, and no paper I've read has actually made the connection/usage explicit
Definition: Explain what functions rank and select are
- Small example
Trivial solution: describe/demo two trivial solutions, point to convey is the trade-off between time and space
- Linear count, no additional memory, linear time
- Hash table, O(n) additional memory, constant time
Basic solution with o(n) additional memory + constant time, attempts to strike trade-off balance
- rank (Jacobson)
  - 2 layer dictionary + 1 global lookup table
  - Access method + speed
  - Memory analysis
  - ? How long, how detailed?
- select (Clark)
  - "similar" to rank
  - 2 layer-ish dictionary + some lookup...
  - As you can tell, I actually don't understand this well, at all...
- Questionable asymptotic advantage in implementation/large data set
? I'd like to very briefly mention a few efficient 32-bit implementations that address the weakness of the basic solution, but this'll require some more reading (any suggestions?)
Broadword implementation with 0.25n additional memory + constant time
- Explain "broadword", or SWAR
  - Small example with parallel adding
- List 3 goals of implementation
  - Address 64-bit
  - Decrease memory usage
  - Increase speed
    - (note trade-off contradiction)
- rank9
  - classic 2-level dictionary...
  - ...but interleaved (explain cache advantage)
  - discards global table for sideways addition or population count (memory advantage)
    - ? Example of sideways addition? (will be required to explain details of select)
  - space and time requirement analysis (25% space + constant)
- select9
  - built on top of rank9 structure
  - introduces 4 new layers on top (lower two also interleaved)
  - ? explain how/why it works? (actually uses same code as sideways addition, thus the need to explain it before)
  - space and time requirement analysis (I can't actually do this well yet... besides quoting numbers: 25% + 37.5% space, constant)
- simple
  - ? segue from select9's algorithm, if presented
  - rank9 independent select implementation
  - 2 dictionary layers + linear search
  - space and time requirement (12.5% to 25%, linear/nearly constant?)
Results and comparison (slideshow with diagrams from paper)
- ? Do I need details of some other implementations? Can't think of much else to do here, besides pointing and reading...
Sum up advantage (?) of Vigna's implementations, differences from classic/other implementation that lead to observed results
Question and comment...? (although I'll also field these anytime during the presentation)

And that's a very rough draft off the top of my head... I guess it was good I jotted this all down ahead of time, since going through this in detail then commenting on it on the spot during the meeting would probably be less effective.

03/02/10

Met with Anne and Chris today for our bi-weekly supervisor meeting, mainly went over the talk layout above. Some notes:

Concentrate on rank, and either mention select in passing or not at all
Read Jacobson's paper on the original rank and select algorithms, which will hopefully give me a bit more insight into how its used
Do not dive into detail, but do convey why the basic rank is o(n)

Besides that, I'll start fleshing out the details of the talk over the coming days, and hopefully return to coding afterwards.

03/03/10

Fleshing out the talk, some points:

Read over the first part of Jacobson's paper, and was slightly more convinced that:
- rank and select are actually somewhat useful
- I should forget explaining how they're used in full-text indexes
Found this nice paper by Navarro and friends giving a few neat ways to improve the basic algorithm
Started to scribble examples on paper, I now realize this may take a bit more time (and whiteboard pens) than I first thought...

03/04/10

Started to add details to the previous list, including figures and diagrams to draw on the board.

In other news, there was no meeting with Sohrab today, as he's off in La Jolla. Last week, the meeting was also canceled as Ryan and Rodrigo were in Florida, and the week before that Sohrab was off to the Olympics.

03/05/10

Finished the rough draft of my talk notes, and I also whipped up a quick abstract and sent it to Monir. Now to make a few presentation slides, run through the presentation, and time it.

03/08/10

Ran through it completely twice in my head, while madly scribbling figures, in about three hours, with lots of stops and rests. Spent some time learning the Beamer class and made the slide show components in LaTeX (hey... neat wiki link), which fortunately... I didn't have to spend time to learn. Tomorrow, I will talk to myself out loud in my room while waving my hands on an imaginary whiteboard, time it, and hopefully put this talk aside until presentation day.

03/09/10

Talk preparations are finished, and although shaky for the moment, it runs in just under barely an hour. Since I'd rather not be cutting it this close, I'll probably put a few more things onto slides at the start and end of the presentation.

Putting the talk aside for now then, I'll be resuming coding tomorrow, with the following things:

Score only implementation of Smith-Waterman alignment with affine gap penalties
A correctness test suite
A performance test suite
An abstract class to serve as a general interface for all aligners
Figure out how to get CMake working...

03/10/10

Resumed coding, had to re-familiarize myself with git, and I'll have to buckle down and spend some time figuring out CMake, sooner rather than later... Today's progress:

Created a new branch local_affine
Changed the local aligner to use affine gap penalties
Commited the changes to the new branch
Pushed the new branch to my public repo

Next up...

Figure out CMake... maybe by forcing myself to get gtest up and running
Create the abstract class interface I keep telling myself to do...
Once gtest and the interface are done, use them to create the two test suites...

03/11/10

Tweaked and created CMakeList files all over the place, and successfully got gtest integrated into cmake. Not as horrific as I thought it would be (as makefiles scare me to death, so CMake wasn't looking any better), and at the very least I can now get cmake working properly, albeit lots of trial and error. Some notes:

Created a new branch gtest-cmake (Oh dear... I just realized, I have very inconsistent hypen/underscore usage when naming branches...)
Modified lots of CMakeList files and restructured some directories to match gtest hardcoded import pathnames
Committed changes and pushed new branch to my public repo

Did another commit, with the following changes and problems:

Used file(GLOB ...) subcommand to select files to add to a library, replacing a really long string of filenames
Unfortunately, this didn't work quite as well in the ngsa folder, due to the files being in subdirectories
Finally, the test executable created requires an external fastq file to test some io operations, but out of source builds don't seem to carry this file over (well, yes, in the first place, the test shouldn't need to rely on an external file... and I may address that later, by dynamically writing then deleting the file... maybe)

Third and last commit for today, notes:

Created a new branch aligner-interface (I seem to prefer hyphens...)
Created an abstract class aligner_interface.h in the include/ngsa directory
Modified local_aligner.h to inherit the interface
Modified the test file to ensure the inheritance was successful

Last note, I realize I'm always branching off the latest branch, as the features I'm currently implementing actually build on top of one another (so rather that features, I guess the more fundamental problem is I'm still implementing the basic foundation). I imagine that once everything becomes more stable, I'm to branch off the stable trunk, instead of this long chain of branches.

Next up, in the following order:

Performance test suite
Correctness test suite
Mass copy and paste boilerplate text to all files...? (I swear there's probably a smarter way to do this)

PS: No meeting with Sohrab today, as I received no email notification, so I assume he's still away in La Jolla.

03/13/10

Started a basic class for performance testing, details as follows:

Branched, committed, and pushed performance-test
Created file performance_test.cc, which contains:
- a Timer class, based off Boost's timer
- a RandomSequenceGenerator class, based off JAligner's class
- a main that feeds my aligner random sequences, and returns the total and average time

Some problems I ran into:

Whilst testing, I often found that the elapsed time would be 0 seconds, even when the actual program would run for several seconds.
- It turns out that the Boost timer, using C++'s clock function, reports CPU time, and not real time (to the 1/1000th of a second, on my computer, explaining the 0s)
- Incidentally, C++'s time functions reports real time (only to the second, not very precise...)
- I used UNIX's time to confirm this
random sequences take non-negligible time to generate, so the reported times to not reflect the alignment time, as they include the sequence generation times, but this may not be a huge concerns if we're only concerned with the relative times of different implementations keeping the generator (and seed value) the same.

Next up,

Correctness test suite

And also, the talk has been postponed to March 30.

03/15/10

Started a basic correctness test suite, using the Google Test Suite. Branched, committed, and pushed correctness-test, the following tests are included:

Blank sequence
Perfect matching sequences
Mismatches at start
Mismatches at end
Mismatches in middle
Mismatches at both ends
Gaps at start
Gaps at end
Gaps in middle
Gaps at both ends
Random reciprocal tests (i.e. flip sequence order of input)
Confirm results between different implementations (added 03/16/10)

Anymore cases that I should add?

Also, I've thought about mixing the sets of mismatch and gap tests... in a Cartesian product sort of way. But I'm wondering if that's really required...

Now, using this test suite, I think I've actually found a bug in my own implementation, which I will now attempt to resolve. Either that, or rework the test suite...

Thus, next up:

Debug LocalAligner, or fix correctness_test...
Optimize LocalAligner using the performance_test

Speaking of which, by Chris's suggestion, I've modified the performance_test so it now populates a vector with random sequences first, then quickly accesses these pre-computed sequences for the alignments, dramatically reducing the reported average alignment times.

Update:

A bug definitely exists in the ScoreOnlyAlignment function, as the FullAlignment functions passes the correctness test with no errors. Merely staring at the code is not shedding any light on the error, so I'll spend some time tomorrow printing out matrices and tracking the bug down.

03/16/10

Fixed the bug, which was done by swapping two variables (literally... swap x and y, which took an hour to figure out).

As previously mentioned, I assume my next goal would be to investigate factors that affect the alignment performance. Scouring my mailbox, Chris has previously suggested looking into the following changes:

Setting up a scoring matrix
- This may be an actual matrix or individual penalty values
Making scoring parameters...
- const class members
- static class members
- constructor-initialized typed class member
- constructor-initialized generic class member
Presence and absence of traceback
Using different data structures, such as:
- vectors
- arrays
- variable-length arrays
Dynamic vs static allocation

Now, assuming my performance and correctness test suites are actually adequate, I should probably start some systematic way of implementing and testing these...

03/17/10

MICB324 midterm in 36 hours, work will resume soon afterwards.

03/19/10

Midterm was a test of rote memory... but putting that aside.

Did some tests, with the following results (by aligning 50000 pairs of random sequences 100 bases in length):

Penalty values
- static const: 0.00059656s
- const: 0.00062844s
- user defined: 0.00063468s
- user defined generic: would not compile...
Matrix data struct
- VLA: 0.00059376s
- dynamic array: 0.00103158s
- vectors: 0.00988438s

From these rough results it seems that explicitly typed user-defined parameters shouldn't be too bad, and using vectors for the matrix is a very bad idea...

I didn't test a few that I thought were obvious, such as the inclusion/absence of traceback, and reducing the memory requirement for the traceback matrix.

In response to the question Chris asked in the email, the Confirm results between different implementations, is indeed using random instances, whose lengths can be specified when calling the correctness test program. Thus, running millions of cases with very long random sequences is theoretically possible...

Finally... I accidentally typed: git push --all instead of git push --all public... and so everything has accidentally been sent to my origin repo, namely, the blessed repository. So some cleanup is probably required. Apologies...

Next up...

Implement a version of the algorithm that gives a free gap in the middle (so indels can be accounted for, I assume)
Try out some vectorized SW code
Get CMake to detect CPU architecture and compile appropriately

03/22/10

Implemented a "Free single gap" version of the linear penalty Smith-Waterman algorithm after much talk and learning from Chris.

Rather than harp on the implementation details, I hope the following score matrix demonstrates the new implementation...

Given match = +1, mismatch = -3, and a split (or k) at the fifth character of the longer reference (x axis)...

Edit: fixed quite a few bugs, I think it's correct now... at least more so that before.

    A A A A X X X X X X X X A A A A
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1
A 0 1 2 2 2 0 0 0 0 0 0 0 0 2 2 2 2
A 0 1 2 3 3 0 0 0 0 0 0 0 0 3 3 3 3
A 0 1 2 3 4 1 1 1 1 1 1 1 1 4 4 4 4
A 0 1 2 3 4 1 1 1 1 1 1 1 1 5 5 5 5
A 0 1 2 3 4 1 1 1 1 1 1 1 1 5 6 6 6
A 0 1 2 3 4 1 1 1 1 1 1 1 1 5 6 7 7
A 0 1 2 3 4 1 1 1 1 1 1 1 1 5 6 7 8

The final max score is +8, compared to a truncated +4 in the normal SW implementation.

I've also confirmed that the free gap only applies to one sequence:

    A A A A X X X X X X X X A A A A
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1
A 0 1 2 2 2 0 0 0 0 0 0 0 0 2 2 2 2
A 0 1 2 3 3 0 0 0 0 0 0 0 0 3 3 3 3
A 0 1 2 3 4 1 1 1 1 1 1 1 1 4 4 4 4
Y 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1
A 0 1 1 1 1 0 0 0 0 0 0 0 0 2 2 2 2
A 0 1 2 2 2 0 0 0 0 0 0 0 0 2 3 3 3
A 0 1 2 3 3 0 0 0 0 0 0 0 0 3 3 4 4
A 0 1 2 3 4 1 1 1 1 1 1 1 1 4 4 4 5

In this case, 8 matches + 1 mismatch = 8 * 1 + 1 * -3 = 5

I assume this is fine (for now)... from what I've discussed with Chris...

I will next attempt to apply the same change to affine gap penalties...

03/23/10

Extended "free gap" idea with affine gap penalty local alignments, although I'm quite skeptical about its correctness.

Given the original recurrence relation for the affine gap model as defined on the fourth page of this document:

I simply added two new cases as follows:

score[x][y] = max {
  // original four cases go here roughly corresponding to:
  1. Match (or a mismatch, depending on the sequence)
  2. Gap in Y (which can be either continuing one, or opening a new one)
  3. Gap in X (which can be either continuing one, or opening a new one)
  4. 0 (to make alignments local when score becomes negative)

  // And my three new cases
  if (x >= k) { // where k is the split 
    5. max_{i = 0 to k} score[i][y - 1] + score for matching chars reference[x] & read[y]
    6. max_{i = 0 to k} gap_y[i][y] + gap extension penalty
    7. max_{i = 0 to k} score[i][y - 1] + gap open + gap extension penalty
  }

I assume my cases correspond to:

5. Giving a free gap, then continuing to align with matches
6. Giving a free gap inside an already existing gap...
7. Giving a free gap, then opening a gap right after...

For some reason, after typing this, I have a feeling cases 6 and 7 aren't required...

I've also fixed a bug so:

it no longer makes ALL gaps after the split free
it will only make one free gap, with a starting position that occurs prior (non-inclusive) to the split index

Now, I think it'd be more efficient to ask Chris about the correctness of this implementation in person tomorrow, rather than attempting to figure this out myself. If this looks good, I'll probably move onto looking at some vectorized code.

Also went to the BCCRC to see Anamaria's talk on her SNP caller, which was much more interesting and understandable to programmers that previously attended BCCRC cancer talks, and Sohrab has requested a small progress report for the Thursday meeting which I need to consult Chris about...

03/24/10

Met with Chris today, and discussed the following:

Overall plan and progress, which I'll convey to Sohrab tomorrow
The questionable correctness of my affine free gap aligner
Kmer indexes, which I will work on next

We decided to start on indexes, so we could get a working aligner before we started optimizing, which can be an endless rabbit hole...

Thus, I've whipped up a quick program that does the following:

Reads in a FASTA file containing a reference sequence (e.g. bacterial genome)
For each k-length subsequences (i.e. kmer):
- Find the corresponding kmer_value... analogous to a hashvalue, unique to each subsequence
- Hash the starting position of the kmer at table[kmer_value]

Rather that worrying about class structure and includes, I decided to make this program in a single main.

I should also note that the hash table is a vector of lists, which is rather inefficient and can be improved upon.

Thus, up next:

Make kmer index into an actual class
Transform the hash table from a vector of lists to a vector of positions, i.e.:
- Concatenate all the result lists from the original hash table into a large array
- Make the hashtable a vector of ints, corresponding to positions in the array where its original list is
Stare at the various pieces of code that Chris sent me when bored...

03/25/10

Sohrab was sick this morning, so the meeting that was happening today has not been postponed to next week.

As planned yesterday, I made the kmer index into an actual class. However, Chris told me to hold off on transforming the vector of lists into a vector of array positions for now.

Instead, I've been told to create a basic aligner with all the current components, and I've my progress with that is as follows:

Extracted the FASTA E. coli genome sequence and FASTQ reads provided by bowtie and placed them in /tool/ngsa_align/input, the files ares:
- NC_008253.fna: a 4.9 million base pair genome
- e_coli_1000.fa & e_coli_1000.fq: 1000 FASTQ/FASTA reads
- e_coli_1000.raw: 1000 read sequences
- e_coli_1000_1.fa, e_coli_1000_1.fq, e_coli_1000_2.fa, e_coli_1000_2.fq 1000 paired end FASTQ/FASTA reads
- e_coli_10000snp.fa & e_coli_10000snp.fq 10000 FASTQ/FASTA reads (maybe with single nucleotide polymorphisms...?)
Created a bacterial_aligner.cc in the /tools/ngsa_align directory, which does the following:
- Open the genome file (abridged for now)
- Reads the entire thing into a very long string (yeah... not the prettiest method... but the simplest)
- Creates an index using this very long string
- Opens and reads the 1000 entry FASTQ file, and for every read:
  - Using the index, finds seed positions in the genome
  - If found, aligns the full read to a subsequence--starting at the seed position (there can be multiple)--of equal length from the genome
  - If the alignment is above some threshold (I'm using perfect score for the moment), return the position and alignment

Next up, I'll have it actually write the alignment into SAM output, and attempt to display it with tview. And once that gets working, I'll use the full unabridged genome with the 10000 entry FASTQ file and pray.

I've also merged this latest branch with a previous performance testing branch, since I wanted to get user-defined aligner values. The experience was a bit... scary to say the least, as I didn't commit before the entire operation, and stash wasn't all that clear... I hope I haven't lost anything during the entire process.

On that note, as the aligner constructor now takes 4 values, I have no clue how to get it working with gtest fixtures, so I've commented out a lot of things to keep the compiler happy...

03/26/10

Finished a rudimentary aligner that does the following:

Takes a FASTA reference and makes a k-mer index
Reads a FASTQ file and aligns to all seeds
Records the best alignment for each FASTQ read into a SAM file

Unfortunately, the SAM output doesn't work with tview, for reasons not exactly known, but I suspect it has to do with the fact that my CIGAR is incorrect, and SAM seems designed for paired-end reads, so I'm still figuring out how to make it happy with single end reads.

Concerning CIGAR, it is indeed an encoding of the alignment that replaces the actual alignment, and the SAM only needs to keep the original read sequence. Some issues have come up:

The standard CIGAR (used by Mosaik), only has M, I, and D:
- M for matches and mismatches
- I for insertions in the read
- D for deletions in the read
  - All three can be easily generated given the pair of aligned sequences
The extended CIGAR, described in detail on page 2 here (in everything made by Heng Li), introduces new terms N, S, H, P, corresponding to:
- N for long gaps in the read (intron)
- S for soft clipped (free ends of a SW alignment), query sequence remains the same
- H for hard clipped, query sequence clipped (for reads with two ends aligning to different locations, thus the query sequence becomes two separate reads, complementarily hard clipped)
- P for padding, which shows how reads align relative to each other with insertions (where reference is gaps)
  - The first three can only be generated during an alignment
  - The last one requires multiple sequence alignments, or something similar...

I have located the large block of BWA code that generates the extended CIGAR string, and I believe its integrated into the aligner code. If the need arises, I'll see if I can get myself to stare at that less-than-friendly code sometime later...

03/29/10

Final talk prep night, so no coding tonight.

Although I did think of a way to figure out some extended CIGAR terms after an alignment has been done.

03/30/10

Gave talk today, which I guess went smoothly.

No coding today, instead today was spent playing with samtools to figure out how to fix my output file, results as follows:

Samtools doesn't work with Cygwin, for reasons unknown
It works happily with the Windows command, so I'll have to settle with that for now
Incorrect CIGAR strings cause parsing errors and premature termination
- Specifically, the length inferred by the CIGAR encoding must match the length of the query sequence
You must index the reference sequence and the sorted bam file before displaying with tview
Doing the above, I was able to display something with tview
Unfortunately, the reference sequence displayed is truncated prematurely, and none of the alignments appear

I suspect it has to do with the flags, which will be the next thing I test tomorrow. I'll start reverse engineering the working example files to try and find the minimum set of values required for everything to work.

03/31/10

I think I've characterized all the odd quirks that Samtool wants in a SAM file for it to display properly in tview:

tview assumes positions start at 1, not 0 (so I had to increment everything during the output stage)
the name of the reference in the read and the reference file itself have to be identical
the number of characters in the sequence field has to be identical to the number of characters in the quality field
the number of characters in the sequence field must equal to sum of specific CIGAR numbers
- number of M's + number of I's + number of S's == length of sequence (or else samtools will crash)
- H, D, and N do not contribute to the above summation
certain bits in the flag, when set, cause the read NOT to be displayed in tview, mainly:
- 0x4 (I unfortunately picked this value in my original implementation, causing all reads not to be displayed)
- 0x100
- 0x200
- 0x400
all other flags, when individually set, have no observable effect on display in tview
in stretches of the reference with no reads, tview will simply display the nucleotide as N (which I mistook for a bug previously, complicated by the fact that there were no reads displayed in the first place...)

Taking all this into account, I've fixed the bugs I mentioned previously, and I now have a working bacterial aligner on my public repo.

For once, I'm quite lost as what to do next...

Raw edit | More topic actions

Topic revision: r2 - 2010-05-04 - jujubix