DanielsJournal < BETA

Tags: view all tags

Well... it feels more journal than research... but I guess it gets the job done.

January 2010 Archive

February 2010 Archive

08/03/10

Continuing to implement the bwa mapper:

Resolved all known bugs that were causing erroneous results
- It was a few independent and minor bugs that caused dramatic effects
- Confirmed that results are positionally correct up to 3 mismatches
  - Unable to confirm that CIGARs are correct
Starting to add CIGAR and MD generation for the aligner
- After some thought, I've decided to have each item on the stack keep a copy of the growing CIGAR
  - It is the simplest solution, which allows supporting accurate results for both mismatches and indels
  - It is relatively straight forward in implementation
  - The only major downside is that stack entries have increased in size, from 12 bytes each to 268 bytes
    - The largest size I've seen the stack take is around 6000 entries, which is an increase from 72kB to 16MB
    - The 256 byte increase is from adding two 128 (max NGSA read length) char arrays
      - This is to include the worst case scenario, where every position is a mismatch, requiring 128 unique chars
      - The minimal CIGAR actually only requires 2 chars, and the minimum MD requires only 1
      - Allowing up to 3 differences, the maximum CIGAR and MD would need around 6
      - Knowing this, it is possible to create 24 byte stack entries, at the expense of limiting the number of differences allowable by the aligner...
  - If the size becomes a significant issue in terms of speed, I may spend a bit more time later trying another implementation
    - My original idea was to keep 2 CIGAR strings (one per directionality), and by the method of the BFS, I assumed the best CIGAR could be kept by constantly overwriting this
    - Of course, this idea gets a lot more tricky if you factor in indels

Next up:

Add CIGAR and MD generation to bwa mapper
Test correctness of mapper
Kmer, PE, half-index

08/04/10

Implemented CIGAR and MD generation methods in the bwa mapper:

Some unexpected challenges came up:
- Alignment occurs backwards... consequently, CIGAR and MD generation were required to be backwards
  - For the reverse complement strand, simply reversing the entire output at the end yields the correct result
  - For the reversed forward strand, I was required to reverse the individual numbers, instead of the entire string
    - (as the reverse alignment on a reversed sequence is forwards... which I realized after writing the backwards generating function)
- Speaking of "numbers", the growing CIGAR and MD strings were char arrays, so "numbers" were really ASCII codes
  - This made incrementing and carrying the numbers a bit more non-trivial than I first thought... especially as how they were backwards
- The actual generation was best characterized as a state machine, which I drew out and hand tested
  - Integrating everything caused the code to increase about 30% in line numbers...
  - I'm hoping this doesn't affect the runtime... although my hopes are slim (I can compare with a previous commit)
I've confirmed correctness with mismatches
I've yet to confirm correctness with indels
Finally, I need to implement a new SamEntry method that takes in CIGAR and MD strings
- To compute the MOSAIK quality, I'll also need a copy of the reference... which I'll generate using the read and MD

Next up:

Integrate CIGAR and MD into SamEntry class
Confirm correctness of bwa mapper with indels
PE, Kmer, half-index

08/05/10

Finished integrating CIGAR and MD into entire aligner:

What was suppose to be a straight forward exercise took much more time and effort than I thought...
As mentioned yesterday, to generate MOSAIK mapping qualities in the SAM entry, I was required to decode the MD and CIGAR
- It struck me halfway through implementation, that aligned read and references with dashes denoting gaps were required
- Decoding the CIGAR was relative straight forward, since it included terms for both insertion and deletion
  - During decoding, I simply treated insertions the same as normal matches
  - Deletions denoted gaps, and were handled by inserting dashes
- Decoding the MD turned out to be a nightmare, simply due to the fact that there are only terms for insertions but none for deletions
  - Mismatches, insertions and matches were relatively straight forwards
  - Deletions... could not be detected from the MD, and the CIGAR had to be used
    - My first naive implementation simply inserted dashes into the decoded reference
      - This is wrong, as it would cause everything behind to be shifted out of alignments
    - Replacing a subsequence of the decoded sequence gave a better solutions
      - But this would yield the incorrect sequence when mismatches occurred immediately before or after a gap
    - The final solution required the complete decoding of the CIGAR, construction of a boolean array to mark gaps, decoding the MD, and building the decoded reference one char at a time using everything
      - At the expense of performance... I'm betting

Next up:

Test bwa mapper
PE, Kmer, half-index

08/06/10

Finished and pushed BWA mapper:

Created a indel_test.fastq file to confirm correctness of general indel combinations
Caught quite a few little bugs and a few more major ones
- I should note that there is a certain value called the indel_edge_buffer_ that prevents the algorithm from opening a gap too close to the edge of the read
  - This unfortunately means that it will sometimes do odd alignments
    - The same occurs in BWA, but a post processing step corrects it
    - My aligner has no post processing, so these odd alignments will occur, for example...

Odd alignment due to "do not open gap zone"
AAAAAAAAAAACAAAA
AAAAAAAAAA-AAAAA
XXXXX000000XXXXX <-- X denotes the 5 base "do not open gap zone" at both ends

Better alignment
AAAAAAAAAAACAAAA
AAAAAAAAAAA-AAAA <-- gap occurs within zone... not allowed, but can be caught in post processing
XXXXX000000XXXXX

- - Shrinking the no gap zone will yield the correct alignment... but I suspect the BFS algorithm will suffer a bit
    - This is because I believe the main purpose of this "zone" was to limit gap openings at start of the alignment, to limit the potential search space
    - It also prevents mismatches at the end from being lopped off as indels
- I should also note that the CIGARs produced by BWA often do not match up to mine
  - This is due to the fact that my CIGARs prefer mismatches before gaps, which is reversed in BWA
  - When presented with highly repetitive regions, my CIGARs will insert indels near the end of the runs, while BWA inserts indels at the start of these runs
  - This isn't a difference in algorithm, but a difference in how CIGARs are generated
    - I make them during the BFS search, which prefers matches > mismatches > gaps
    - bwa creates the CIGAR via a SW-like alignment after the algorithm finishes using the reference and starting positions of hits... I assume
    - UPDATE: on second thought, you can't generalize it that simply... due to the fact that reverse alignments occur with reverse indices on reversed strands... which sort of mess up which way is "forwards"... but the main point is that my CIGARs don't always much up to BWA CIGARs when indels are involved... but I believe mine are still equally valid

saligner prefers mismatches before gaps
AAAAAG-GAAAAA <- G/C mismatch is created, followed by a gap 
AAAAACCCAAAAA

VS.

bwa prefers gaps before mismatches
AAAAA-GGAAAAA <- gap is created first, then followed by mismatch 
AAAAACCCAAAAA

and...

saligner will place gaps at the end of repetitive runs 
AAAAACCCCCCC--AAAA
AAAAACCCCCCCCCAAAA

VS.

bwa will place gaps at the start of repetitive runs
AAAAA--CCCCCCCAAAA
AAAAACCCCCCCCCAAAA

Removed obsolete mapper and aligners, reasons as follows:
- bowtie_k1_mapper: included in bowtie_mapper
- exact_mapper: included in bowtie_mapper
- gap_mapper: slow Makinen implementation, replaced by bwa_mapper
- mismatch_mapper: slow Makinen implementation, replaced by bowtie_mapper
- jump_start_exact_mapper: included in bowtie_mapper
- mutant_mapper: replaced by quality_mapper
- one_mismatch_mapper: replaced by bowtie_mapper
There are currently 4 mappers remaining
- bowtie_mapper: very fast and allows up to 3 mismatches, FM index based, DFS
- quality_mapper: slight variation of the bowtie mapper, performs DFS favouring mismatches at low quality positions
- bwa_mapper: slow, but allows arbitrary mismatches and indels, FM index based, BFS (DFS + indels = very inefficient)
- kmer_mapper: hash-based, out of date... broken?
Did some preliminary time trials with the new bwa class
- Takes 5 times the amount of time as the bowtie_mapper at equivalent settings (3 mismatches, no gaps)
- Takes about a quarter of the time of our very old gap_mapper at equal settings (opening 1 gap)
  - I will admit the old gap mapper had to use the local aligner to compute the final CIGAR...
More formal time trials against BWA and Makinen's readaligner and bowtie in --best mode would be beneficial...
- ...but I'd like to try these after our custom index comes in, hopefully before my talk scheduled for 08/25/10

Next up:

Hash-based mappers
PE and half-index

08/09/10

Redirecting efforts to hash-based mappers:

Our current hash based system (Kmer mapper and index) was developed 4 months ago
From what I've read and seen these few months, I've decided to completely discard those classes...
I'm planning on adopting algorithms from 2 aligners, MAQ and BFAST (which are supposedly similar...)
- MAQ hashes the read, and is done by the author as BWA, and a very popular aligner previously which was also used by our clients
- BFAST hashes the reference, and from its papers, I've been convinced that it's advantageous over many other implementations
I've started reading up on some literature to get a feel for BFAST, which is based off BLAT, which is based off BLAST
- For some perspective, my previous implementation was more similar to a very simplistic BLAST... so there's much room for improvement
After going through the papers, I hope to make sense of the code, create a class based off BFAST, then do the same for MAQ

Next up:

Hash-based methods
PE and half index... which I'm starting to feel I may not actually get to

08/10/10

As per Chris' request, I've documented BwaMapper, BowtieMapper and QualityMapper extensively in their codes...

I've also glanced a bit at the documentation I have for the paired-end classes... and they seems sufficient for myself, but may be cryptic for others. I'll probably supplement them tomorrow.

Next up:

Hash-based methods
PE and half index... which I'm starting to feel I may not actually get to

08/11/10

Supplemented the documentation in the paired end match maker class, and briefly read through yesterday's documentation and applied minor grammar and typo edits...

Also went through the BLAST, BLAT and BFAST papers mentioned two days ago:

Rather than illuminating what the algorithms entailed, the experienced proved much more beneficial to my talk
- ...as it gave a nice history of algorithm evolution
- ...and brief comparison (with nice table and charts) of the speed and accurary of the top aligner at its date of publication
  - It showed that BWA and bowtie are indeed the fastest aligners... but also the worst in terms of accuracy
  - BFAST and MAQ are slower, but offer much greater accuracy

Next up:

Go through the Supplementary Materials for BFAST, which I hope may go a bit more in depth into the implementation and reasoning behind the algorithm
Assimilate BFAST code and create mapper class

08/12/10

Finished BFAST's highly informative supplementary materials and attempted to trace BFAST...

BFAST doesn't compile on my computer, but does on skorpios...
After some quick traces, the complexity is slightly overwhelming...

I've redirected my efforts to implementing a Maq-like mapper first:

Skimmed MAQ's paper and picked out and made sense of the implementation details
Started tracing MAQ

Next up:

Trace MAQ and implement a MAQ-like mapper
- Worst-case scenario, implement something similar to MAQ and BFAST based on the description I've come to understand... although this would probably be exceedingly inefficient

08/13/10

Traced MAQ... and was appalled by the idea... but here it is nonetheless:

load reads, encode reads (they are encoded into this... mind-boggling recursive structure, but I believe it's just 2 bit encoded)
create filter patterns (12 by default)
for each set of filters (3 filters per set)
- create an index from the reads, sort the index
- for every reference in genome
  - for every word (32 chars) in the reference
    - try and find hit
    - if hit, add to candidate list
  - if candidate list full, or reference over process list
    - perform pseudo-SW, keep good, toss bad
process list of good alignments, transform to SAM, sort by position

It can apparently handle 2 million reads against the human genome in a day...

The actual implementation is more than I can pick apart at the moment, so I opted to trace BFAST, which consists of a few phases:

build genome index
run reads against index to extract CAL (candidate alignment positions)
perform full SW for each CAL
return results

I'm in the midst of tracing the first phase:

load pre-processed reference into memory
- pre-processing is 2 bit encoding, and parsing multiple contigs and extracting info for each
for each contig in reference
- extract repeating kmer (discard if invalid, i.e. has N or lower case)
  - record start position and contig of kmer
- sort positions by alphabetical order of kmer
  - recursive merge sort, with shell sort base case
hash the sorted list
to be continued...

The basic reasoning behind the index is this:

if the kmer is too short, each kmer will obtain too many hits, which is bad for efficiency
if a kmer is too long, a lot of kmer's will have no positions, which is a waste of space
- the solution BFAST uses is to use longer kmer's, and bin multiple kmers together
- this decreases the chance of returning an empty hit, and when multiple hits are encountered, a quick binary search for the unique kmer within the bin can quickly locate the kmer in question

Next up:

Finish tracing BFAST
Combine the elements of the two hash-based mappers to create re-usable classes and piece them together... somehow...
- I've starting to believe that I'll probably implement by own from scratch based on theory rather than copy their classes

08/16/10

Continuing to trace BFAST.

The first index phase:

load pre-processed reference into memory
- pre-processing is 2 bit encoding, and parsing multiple contigs and extracting info for each
for each contig in reference
- extract repeating kmer (say... length 18) (discard if invalid, i.e. has N or lower case)
  - record start position and contig of kmer
  - NOTE: To fit a large index into memory, this step can alternatively record these value pairs to different files, depending on the kmer's prefix (i.e. generate kmer value using first 2 chars, sorting them into 16 files)
- sort positions by alphabetical order of kmer
  - recursive merge sort, with shell sort base case
  - NOTE: In multi-file mode, sort each index separately. Appending all sorted indicies will theoretically result in a single sorted index...
hash the sorted list (or lists... in multi index mode) of positions in a hash table of length 4^(length of hash)
- for a counter i from 0 to the length of the genome
  - using the mask, position and the reference, generate a hash value (similar to, but not our standard kmer value...?)
    - NOTE: hashes come out in increasing values, due to the list being sorted
    - If the hash is identical to the last one, ignore it
    - If the hash is different than the last one, record i, using the hash value as the index
- iterating over the somewhat filled hash table from the last position to the first
  - fill all empty entries with the last encountered valid value
    - the last region between the end of the table and the last valid value will be populated by invalid values, the rest of the table will have valid values

The second match phase, which will generate CAL for reads using the index:

Read in all reads from file, parse them and encode them
For each index
- For each read
  - For each kmer in the read created by iterative offsets (skip invalid reads, those too short or those with N)
    - for each direction
      - Obtain kmer value, query hashtable:
        
        query may be out of range (in multi index mode), no hits returned
        
        query returns a valid i value, we obtain record 2 values:
        
        low, the i value obtained using the hash value as the index
        
        high, the i value obtained using the hash value + 1 as the index, then subtracting 1
      - With a valid pair of low and high values, there are a few scenarios * low > high (no hits), return nothing * low = high (1 hit), query the list of positions with value =i to return a CAL * low < high, conduct a binary search querying the list of positions with values between low and high until the specific i value is obtained (single or multiple CALs may be returned)
  - record the position, normalized against the offset, the mask (also normalized), the contig of the position, and the strand

The third align phase, which will use SW to align the reads to the reference using the list of CALs generated in the last step

to be continued... although I can fortunately skip over large portions of this, as we already have a working vSW

Next up:

Continue tracing BFAST
create a BFAST mapper, with classes that can be used for created a Maq mapper

08/17/10

Had a meeting with Chris and Jay:

instead of going into hash-based mappers, a few modification should instead be done to the current mappers
instead of returning processed SAM entries, mappers should instead return the bare minimum information
- the driver, using the values returned by the mapper can then decide what to do
- IDEA: drivers should be the main flow of control
all global macros and constants should go to config.h
- other useful functions and constants should go to some utility class

Also helped Jay integrate the FM index into the bwa_mapper

compiles... after much work... but crashes

Next up:

integrate FM index into bowtie mapper with Jay
concentrate on talk...
- This being said, not much non-talk progress will be happening once this gets underway

08/18/10

Integration of custom index into bowtie mapper completed

Next up:

Talk preparations
- Finish integration with bwa mapper and quality mapper to obtain running times for talk
Abandon hash based mappers
Modify output
- Instead of outputting full SAM, output bare minimum, and leave SAM output optional
- This may occur during talk preparations for time comparisons too...
UPDATE: Unit tests for all created classes and mapper copy constructor

08/27/10

Recovered from an unplanned computer format, and finishing off the last few changes before I take my leave...

To do:

Finish intergration of FM index into quality and bwa mappers, and checkout correctness of all three
Change output method of mappers, so they return sp/ep pairs instead of SAM entries
Create a output formatter that is able to generate the full SAM entry from sp and ep
Add copy constructors into mappers

Will not be able to do:

Write formal unit tests for three mappers, although I do leave behind two .fq read files which I use to conduct manual "unit testing"...
Half index, for the bowtie mapper, idea as follows:
- Runs all reads on all forward index steps, tracks all reads that failed to obtain any hits
- Runs all marked reads with all reverse index steps
- This way, only 1 index needs to be in memory at a time, and thus can be twice as big and much faster...
- Some complications....
  - What about the case where a better reverse index hit exists, but a bad forward hit has already been found?
  - Potential solution, run all reads, regardless of hits on the forward index on the reverse index...
    - Yet another problem... you'll now need to track all results for all reads...
    - Incidentally, this was also the complication that I see no pretty way of resolving when using multiple hash tables
Advance hash table implementations
- Gapped seed support
- Efficient algorithms... including that complication mentioned three points above
Pair end
- See documentation in match_maker.cc

08/30/10

Finished integrating FM index into three main mappers.

Started simplifying the output returned by mappers.

Next up:

Finish simplifying output

08/31/10

Simplified mapper output:

All FM mappers return sp/ep results
All mapper classes return positions
An FM adaptor mapper class can encapsulate any FM mapper and make it return positions

Next up:

Finish position to SAM converter
Change bowtie/quality mapper to work in number space
- Also use LFAll()

09/01/10

Finished removing SAM generation from all mappers, and made SAM generation an optional post process:

As mentioned yesterday, all mappers now return positions
To generate SAM entries, the following is done:
- Using the position, a subsequence from the refernece is extracted
  - The length of this subsequence is typically slightly longer than the read itself, to allow for indels, if present
- An alignment is conducted, and the aligned read and refernece (with dashes) is returned
- The dashed alignments are used to generate SAM and MD
Some major problems:
- LocalAligner has a tendency to lop off mismatches at the end of reads to generate a higher score
  - This is highly evident when aligning with mismatch_test.fq
- Setting the mismatch penalty to 0 can prevent the removal of mismatches at one end, but not both...
  - When this is done, the CIGAR breaks... by inserting a lot of S (for softly clipped, which are "free bases" in the alignment)
- The indel preference of the SW algorithm is more similar to BWA, as described on 08/06/10, and thus, most of the solutions provided in indel_test.fq are incorrect...
- Fixing these requires some changes to all aligners, and also the CIGAR and MD generation functions inside SamEntry

Other changes to mapper:

LFPair and LFAll are used in all possible places
bowtie and quality mappers work completely in numberspace
- bwa still works in letter space... and converts when required

And assuming that last infinite loop in getPosition isn't a mapper problem, I guess that's it big grin

Raw edit | More topic actions

Topic revision: r110 - 2010-09-02 - jujubix