Difference: DanielAprilArchive (1 vs. 2)

Revision 22010-05-04 - jujubix

Line: 1 to 1
Changed:
<
<
META TOPICPARENT name="NGSAlignerProject"
>
>
META TOPICPARENT name="DanielsJournal"
 

04/01/10

The basic working bacterial aligner is complete, next up:

Revision 12010-05-03 - jujubix

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="NGSAlignerProject"

04/01/10

The basic working bacterial aligner is complete, next up:

  • Include quality information
    • i.e. fill in a meaningful mapping quality value in the SAM output
    • details are to be taken from the Maq paper. Specifically, the portion about "single-ended mapping qualities" in the methods section.
  • Include an alignment score, which is a tag in the SAM format
  • Inexact alignments with PH principle
    • Involves partitioning reads into multiple kmers and query each against the reference
    • Of these multiple kmers, the ones with hits are seeds
  • Paired end reads...
    • Thinking about it...

04/02/10

Went through bowtie, maq, and bwa documentation, and compared to when I first started, I actually see some method to the madness...

Importantly, I've come to a few conclusions:

  • Mapping quality:
    • In Maq, it is the Phred-scaled probability that the read alignment is wrong (e.g. Mapping Quality of 30 = 1/1000 chance of error, 20 = 1/100 chance of error)
    • In BWA, it is the same in theory as Maq, but the exact value calculated differs from Maq when the same input is used
    • bowtie... doesn't compute a meaningful mapping quality, instead, uses 255 for aligned reads, and 0 for non-aligned reads
    • I assume we want some meaningful value like Maq's (or preferably the corrected BWA formula)
      • Unfortunately, the BWA paper doesn't explain the formula in depth, so I have no clue what adjustment was made
      • The Maq paper does gives the formula... but refers the reader to a supplemental section to derive one of the terms
        • Scanning the supplement materials, I have not been able to make much sense of it...
      • Long story short, I have no clue how to calculate a mapping score the way maq does... (short of staring at the code and digging it out, which I have tried a bit without success)
    • Hence, I will leave this for later, and take bowtie's 0 or 255 approach for now.
    • I'll probably check out what Mosaik does if I have time... as I recall it also has SAM output
  • Alignment Score
    • None of the three aforementioned software have this property in the final output, and I assume it's more a property specific for SW alignments
    • That being said, none of the papers have mentioned any approach of using base qualities to calculate an alignment score
    • Instead, both Maq and bowtie only use base qualities when considering mismatches (and maq/bwa uses them to calculate mapping quality)
      • Specifically, given multiple hits with equal number of mismatches, the hit whose mismatched bases' base qualities sum to a lower score is deemed the better hit
    • Thus, for now, I'll have this as the raw SW alignment score, appended as a AS:i:## tag as specified in section 2.2.4 of the SAM specifications
  • The "third score"
    • From what I've read about Maq's quality score, and what Sohrab once described months ago, I have a slight idea of what this "third score" may be about...
      • In Maq, when a read maps equally well (i.e. equal mapping quality) to multiple positions on the reference, "MAQ will randomly pick one position and give it a mapping quality of zero"
        • Incidentally, Mosaik's manual questions this behaviour, and thought that picking a random positions was odd... to say the least
        • Mosaik's solutions is to either keep them all, or throw them all out, decided by an option argument
      • Sohrab said something about splitting the score and dividing it between the multiple positions?
        • I think Sohrab wants something more similar to Mosaik's implementation of keeping them all, and assigning some sort of compromised non-zero score to it
    • By definition, one cannot divide mapping scores between multiple reads, so as this isn't a mapping score, or an alignment score, maybe this is that "third score"?
      • I mean, by definition, if a read aligns perfectly to two locations, then there is a 1/2 chance that the position is wrong (ignoring other errors), and that itself is a mapping quality of about \log{(1/2)^{-1}} * 10 = 3, which isn't all that useful either
    • I guess, where mapping quality is a "probability that the read is in the wrong position", this third score would be a "probability that the read is derived from this position of the reference" (thus it can be divided when multiple reference positions occur)
  • Inexact alignments
    • My current implementation handles only exact seed hits, and allows any mismatches outside this initial k-length (e.g. 8) start, but let's just say it handles "exact matches in the seed" for simplicity sake...
    • Per your instructions, I assume the following is what you suggest:
      • Let's assume we permit up to 3 mismatches in the first 28 bases of a sequence (maq and bowtie also use this value)
      • We would then divide 28 into 3 + 1 equal parts, and query the index with these 4 segments
      • Assuming that the read exists in the reference with 3 mismatches or less, at least 1 segment must return an exact hit by principle of holes and pigeons...
      • We proceed to align all hits (extracting the correct reference sub sequence based on the relative position of the segment from the read
        • ...instead of assuming it's always the start of the read, as previously done for exact hits
    • You also mentioned "Perhaps its easiest to ignore gaps for the moment, and just check the hamming distance."
      • Quite frankly, I've got no clue what that statement means...
      • I know of hamming distances, as they're related to hamming weights, which I had to lookup for my talk, but other than that... I'm confused.
  • Paired-end reads
    • No progress or plans on this as of yet, but I'll probably see what maq and bowtie outputs, and try to replicate it
  • Other changes (in the spirit of following bowtie and maq output)
    • For the aligner to consider alignment on the reverse strand...
    • A base-quality sensitive method of choosing best hits, instead of my maximum-alignment-score method
      • Previously stated, choosing the match with the lowest value when summing the base qualities of mismatched bases (if any)
    • Add 2 extra tags to the SAM output
      • NM:i:## for the number of mismatched nucleotides
      • MD:Z:"" for a string encoding of the mismatched nucleotide
  • Some other outstanding questions:
    • For the SW alignment, during a specific alignment, do we ever expect multiple alignments to derive the same maximal score? (i.e. should/can traceback return multiple alignments?)
    • I have slight concerns that much of what I'm doing seems to be hardcoded into the "driver program". I assume I should start pulling out classes and dropping them back into the ngsa library...

04/03/10

Made some changes to the bacterial aligner:

  • Now considers the reverse complement reference strand when aligning
    • Implemented by aligning the reverse complement read to the unmodified reference
      • The seed is also extracted from the last k bases (instead of the first k bases), as these are the ones with the highest quality
    • It seems to work correctly, insofar as the fact that the output is identical to the bowtie for reads with perfect seed hits
      • It can successfully find when the best alignment is on the reverse strand when the seed is a perfect hit
      • When the seed is not a perfect hit (i.e. SNP occurs within the first k bases), it will consistently map it to a very bad position
      • Interestingly, if the SNP is very close to the other end (not the seeded end), it will often be clipped by my SW alignment (yes, it's not a bug by the algorithm, but you're losing nice SNP data doing so)
    • Unfortunately, the aligner now runs ridiculously slow (about 5 times slower, keeping in mind that the original took 2 minutes)
      • I think reversing and complementing the string sequence is not a very efficient operation
  • To accommodate the above change, a few classes were modified and created:
    • sequence_utils.h
      • Has some miscellaneous string functions for reversing and/or complementing nucleotide sequences
    • FASTQ Entry
      • Uses the aforementioned functions to reverse both the sequence and the quality
      • Also has a const function that returns a reversed sequence only
    • SAM Entry
      • Previously a struct, now a class
      • Contains a CIGAR generating functions
      • ...and LOTS of accessors and mutators
      • and a ToString functions for writing to output
  • Next up:
    • Generating the tags for mismatched nucleotide count and the actual mismatched nucleotide
    • Inexact seed alignments

I guess I should also mentioned that the test data I'm using is 10000 35bp reads from the first 10000 bases E.coli genome with 10 SNPs

  • Currently, my output gets nearly all of the reads the same as bowtie, but probably needs inexact reads to get all reads the same as bowtie
  • Of course, using this dataset, it also means I have no clue how it'll handle indels... (although I could easily introduce them by messing with the reference)

Incidentally, I looked a bit into what Mosaik does for Mapping Qualities:

  • They're called "Alignment Qualities" in Mosaik
    • They are technology specific as different aligners made different types of errors (substitution vs. indels)
    • They use logistic regression... which I haven't the slightest clue of
  • They are the same in theory as the ones in Maq, outlined on Page 9 of the current documentation, namely:
    • quality decreases as reference length increases
    • quality decreases as alignment mismatches increase
    • quality increases as read length increases
    • quality increases as "information content" increases (I assume this is base quality)
  • Unfortunately, having yet been published, it doesn't describe exactly how it does it
  • I've stared at the code, and the function that calculates the alignment quality is a nice class
    • Unfortunately, it doesn't look that easy to comprehend

04/06/10

Added functions to calculate the two tags present in both BWA and bowtie output, namely:

  • a function to compute the differences between the aligned reference and read (ignoring clipped sections)
  • a function to encode the reference relative to the query
    • Given this encoded string, one can derive the reference with just the SAM file, without the actual reference
  • in order to accommodate these new tags (and potentially other new tags), I've added a tags_ data member to the SAM Entry class, which is a vector of strings
    • Of course, being private, updating and changing tags in tags_ is rather cumbersome, so improvements are definitely needed

I'm starting to be a little concerned that what was supposed to be a simple "driver program" has now turned quite large and complicated in an attempt to keep the core parts of the aligner modularized and easily replaceable.

Next up:

  • Inexact seeds (...and creating a ridiculously inefficient aligner)

04/07/10

Modified the aligner so it will allow some mismatch in the seed region (currently 2k):

  • Instead of 1 seed, for every read it will take two seeds
    • two non-overlapping seeds of length k are taken from the high-quality end of the read
  • This also means that the number of seeds considered is effectively doubled
  • As expected, the run time has also doubled (now taking ~15 minutes)

With these inexact seeds, the aligner output is now essentially identical to those generated by bowtie (correct positions for all reads). However, as previously mentioned, due to the rather high mismatch penalty, my aligner will often clip SNPs on edges of reads.

I've also tagged on the basic SW alignment score as a tag.

As for paired-end alignments I assume it'll probably work out as follows, but I guess I should confirm with Chris before I move on here:

  • In the normal case, both ends are aligned separately and independently
  • However, when one of the two pairs has an ambiguous position (e.g. multiple best seeds)
    • Using the better pair as the "anchor", a decision can be made as to pick the better seed (i.e. closer to average insert distance)
    • Some changes needed to accommodate this
      • Keep track of the average insert length
      • Some way to store alignment of all previous alignments, including potential anchors (can you write to and read from a file simultaneously...? or perhaps a first output file?)
      • Some way to quickly find alignment results of anchors when needed (such as opening and scanning the first output file)
      • As the first pair cannot also be assumed to be the anchor all the time, some smart way to handle the case where the latter aligned read in the anchor...
        • Align then in parallel...?
        • Mark down low scoring alignments on the first run to potentiall re-align later (which presents some problems in the write-only SAM file) * I think Maq does a brute force SW over a stretch of the reference to save low-scoring pairs
  • Regardless of how the pairs are aligned in the end, the SAM file will need to under some major changes to reflect paired alignments, including:
    • flag values (lots of small checks)
    • reference of mate (assumed to be the same for now)
    • mate position (tricky if one or both reads align to the reverse strand)
    • inferred insert size (I think this is simply the difference using the above value)
      • Of course, many of these will require some way to sync up the reads by their name, whether that be scanning output written to some file, or some ridiculously large structure kept in memory... or some smart way of feeding reads into the aligner... I haven't a clue how best to do this...

As seen then, due to the rather... tricky nature of the ambiguous case, I'll wait for some input from Chris. Of course, I could just completely forget about that middle "ambiguous case" for now (as the resolution will probably depend highly on the aligner and index used), and focus on getting the pair-end-specific SAM outputs correct, which seems to be a rather tricky problem itself.

And finally, as previously mentioned, I'm a bit lost as what to do for Alignment Quality and Mapping Quality.

Small update about performance:

k value index time align time total time
7 ? ? ~15 min
10 15 s 59 s 74 s
11 23 s 33 s 56 s
12 53 s 31 s 84 s

Apparently, I had been using a very bad k value... I have changed k to 11, to strike a balance between index time and alignment time.

04/08/10

Met with Chris today, some notes:

  • A few changes that could increase my aligner performance
    • Compile with ccmake as "build release", to turn on all compiler optimization flags
    • Pack the reference and work with that, instead of pulling out reference subsequences and packing those individually
  • Start a To Do list on the wiki
    • Now linked above
  • In cases with inexact seeds, use the mutant generator
    • When querying a succinct text index, it's supposedly faster with a perfect hit than an inexact hit
    • We hope that one of the many mutants generated will get a perfect hit, faster than it would take to find an inexact hit
  • Figure out Mapping Qualities... as if I was to give a talk on it...
  • Pull out things from the aligner
  • Handle paired ends reads as two parallel running single-read instances, and generate the correct PE SAM output

04/09/10

Met with Chris and Sohrab today at 1300, and spent a few hours before reading up on Maq qualities, some notes:

  • RNA-Seq (transcriptome) data specific needs
    • Works in letter space
    • Align against reference with introns + exon-exon junction data
      • Needs to handle split reads (i.e. a split with two halves aligning to exons split by an intron on the reference)
      • Needs to handle residual introns within reads (when aligned against the junctions)
      • Needs to handle residual flanking introns in reads
    • Eliminate the need to use the full reference + junction data
      • Novel splice graph structure?
  • SOLiD genome data specific needs
    • Works in colour space
    • Fast with small memory footprint
      • Non-hash table index?
  • General needs
    • User-specific choice of how to handle multimapped reads (i.e. read mapping equally well to multiple positions)
    • Two new qualities (3 and 4):
      • 1) Alignment Score (computed by SW alignment, based on alignment only)
      • 2) Mapping Quality (defined by MAQ, a phred-scaled probability that the aligned position of the read is incorrect, given the read and the reference)
        • Sohrab finds Mosaik's calculation of this most palatable
      • 3) Uniqueness (how uniquely this read maps to the reference)
        • For now... let's just make it a percentage, with 100% for a unique 1 position hit, 50% for two positions... etc...
      • 4) Fragment Quality (a probability that the aligned position of the fragment is incorrect, given the alignment of its two pair-ends and the reference)
        • Posterior probability...? Similar to MAQ's Mapping Quality formula?
        • High if both ends map within an expected distance
        • Lower if distance unexpected
        • Undefined behavior if one or both ends not mapped (technically, there would be no "fragment"...)
    • Handle reads about 75bp in length, up to 128bp
    • CIGAR... padding would be nice

As previously planned then, I will continue to work out how to incorporate mapping qualities into my aligner. Having read MAQ paper in detail, I now have a very clear idea of what this value is, and how its derived... in theory. Instead of trying to figure out exactly how its implemented in MAQ (it cannot be used directly as is, as a step in the theoretical formula would be too inefficient in practice), I will instead concentrate on Mosaik's implementation, as this is what Sohrab wants eventually. Having looked at Mosaik's class, the function is straight forward, except for the actual computational part, which is done by iterating through a neural network using pre-defined matrices.

I highly doubt I'll be able to comprehend the exact logic behind the neural network any time soon (and I'm even more clueless as to how those specific pre-defined matrices were selected), but due to the way the function was setup, I believe I can easily modify the input and output steps of the function and insert it into my aligner.

Note to self: rename "TO DO" to "Specifications", add items from meeting, then scour old wiki entries for others...

04/12/10

Some coding today:

  • Rather than deciphering MAQ's paper on "Mapping Qualities", or figure out how Mosaik calculates its analogous "Alignment Quality", I have simply imported Mosaik's implementation into the aligner
  • I've changed the output and input of Mosaik's CalculateQuality function to work with my aligner
  • With this change, nearly all quality values generated by my aligner match those outputed by Mosaik on the same dataset
    • Some minor differences exist, due to the way Mosaik handles multimaps
    • Some other quality differences exist, which I suspect stems from a different alignment algorithm (I suspect... I honestly have no clue)
    • Incidentally, my computer doesn't have sufficient memory to run Mosaik...
  • I've also taken Chris's advice and reconfigured my cmake builds to be "Release" builds, cutting the run time to 26 seconds

Next up for coding:

  • Processing PE output (treating them as parallel SE runs, putting the whole "rescuing" business aside for now)

Other things that I should tackle one day...

  • Integrating mutant seeds
  • Packing the reference (maybe once I get into optimization)

And finally... I have some wiki house keeping to do.

04/13/10

Modified the bacterial aligner to handle pair-end data on a new branch. Some notes:

  • Abstracted the single end alignment process into a function, which returns a SAM entry
  • Opened two FastqFile objects, and read one FASTQ read from each file at a time, feeding them into the aforementioned function
  • LOTS of cross-checking between the two resultant files were done to modify them according to SAM specifications
  • The end result is a single SAM output conforming to SAM guidelines (at least, as much as bowtie and MAQ conform to it, as my input is essentially the same as both of them for PE data)

Next up...

  • Packing the reference...
  • Integrating mutant seeds...

Personally, I think that working with a different aligner/index would put these two more into context... (and potentially eliminate one)

Nonetheless, I'll see what Chris has to say.

04/20/10

Resuming work after final exams.

Starting to look into replacing my kmer index with existing implementations of succinct full-text indexes. Progress as follows:

  • looked into the implementations at PizzaChili
  • specifically, studied the API
  • attempted and failed to incorporate the DNA specific compressed suffix array
    • exact reasons unknown, but it's not due to cmake... and probably with the compiler, and my poor understanding of how to use C libraries in a C++ program...
  • under that assumption, I tried the Compressed Compact Suffix Array, implemented in C++, and had no troubles getting it to compile
  • while all the implementations at PizzaChili share the same interface, a lot of type casting is required to use it...

Next:

  • incorporating the CCSA into my aligner
  • potentially build an interface on top of the PizzaChili interface to simplify access commands
    • I'm somewhat skeptical about this, since I believe that in the end, the types used by PizzaChili may be better in terms of performance... but I'll assume that part of optimization and leave that aside for now.

04/21/10

Incorporated the CCSA into my aligner, details:

  • Split the .cpp and .h files into their respective locations in the project directory, modified the CMakeLists to build a library
  • Unfortunately, CCSA also has an internal C library... and that had to be pulled out, explicitly built, then linked to the aforementioned library
    • Much headache and time spent getting the above two steps to work...
  • Created an OOP-styled container class that greatly abstracts the original CCSA implementation
    • For example, a function will take a single string, and return the result
    • ...instead of supplying a pointer to an array of unsigned chars, the length of the unsigned chars, multiple references for the various parameters of the results to be written to, a void pointer to the index and returning a status code
    • For now, I'm aiming to make everything as friendly as possible... to make things easier for myself
    • I have a bad feeling that this will have to change when we get to optimization...
  • Created an IndexInterface class that will be inherited by all index implementations
    • Successfully inherited by KmerIndex and CompressedCompactSuffixArray classes
  • Modified the bacterial aligner and helper functions to call indexes in a polymorphic fashion
    • Swapping indexes is now as simple as changing a single line

Long story short, I can now swap a KmerIndex and CompressedCompactSuffixArray by doing the following:

// KmerIndex index(k);
CompressedCompactSuffixArray index;

The time to build the index has now dropped from 12 seconds to 3 seconds, alignment time remains the same (so index access difference is probably negligible... for 1000 reads).

With this set up, importing indexes from PizzaChili should be simple in terms of coding (although getting them to work with CMake has made me cry on several occasions...)

Need to ask Chris what to do next...

04/22/10

Paraphrasing an email from Chris, the next plan is for me to improve two things with the current aligner:

  • Pair-end alignment strategies
    • No strategy: treat reads as separate independent single reads
      • What I currently do
    • Strict insert size: only allow pairs with reasonable insert sizes
    • Other strategies: see what other aligners do
  • Improve exact alignments
    • Support mismatches
    • Support gaps

Personally, I feel that pair-end is a sub-base of resolving multimaps, so rather than focusing on pair-end strategies, I will focus on resolving multimaps, which will undoubtedly fall on pair-end data when available.

  • Multimap resolution strategies
    • Return the last positions found (current strategy)
    • Pick a random position (MAQ's strategy)
    • Discard all maps
    • Report all maps
    • Pick one intelligently
      • Infer from pair-end
      • Use qualities
    • See what other aligners do

As for the inexact alignments... I will have to clarify a few things with Chris before tackling that in full force.

Finally, to test both of these, I will require a more realistic and larger dataset.

  • As I type this, I'm currently downloading an 11GB "dataset" file from Sohrab's lab
    • I will not know exactly what is in this file until sometime around midnight, but from the description of the software, I hope it will have pair-end fastq read files
    • If I use these files, I'll probably take excerpts from these files
  • If that fails, I will proceed to checkout the 1000 genomes project

Also, one of these days, I should go through all existing files and do the following:

  • remove trailing whitespace
  • minimize includes in header file, and resolve include dependencies in non-header files
  • change the way ngsa files are included for non-library sources

04/23/10

Continued the hunt for an appropriate dataset:

  • The 11GB "dataset" extracted into a 40 gig tar file... that was corrupt, so naturally I couldn't use it (and from what little I could extract, it seemed to be an indexed reference).
    • In retrospect, this may have been a good thing, as I have extreme technical difficulties manipulating text files over a few hundred megabytes
  • I then realized that using any human data would be a bad idea, for two reasons
    • I'd rather not deal with multiple chromosomes for the reference at the moment
      • I believe one would simply concatenate all the sequences into a super-reference, keeping track of the positions that separate the chromosomes
      • Post-processing the position returned from the super-reference would be done to determine the respective position on the specific chromosome
    • Alternatively, I could deal with a single chromosome, but even the smallest one is ~50 million bp (10 times of E. coli)
  • Thus, I'm going to stick to E. coli for a little longer
    • Having said that, I obtained pair-end data reads for E. coli K-12 MG1655 from the NCBI Sequence Reads Archive
    • I tested the 4 million pairs by aligning it to the E. coli 536 reference I was previously using with BWA... I naively thought they'd be similar enough to get decent results
      • I was wrong...
    • I redid it after obtaining the the genome it supposedly came from
      • The results were still dreadful...
    • While this would be very good for simulating very bad data later... I will leave that for a later day...
  • Finally, I settled on using a dataset containing both reference and fastq pair-end reads from CLC bio
    • 2.5 million pairs aligned nicely to the reference using BWA, but had enough noise and errors to make testing relevant.
    • Occasionally, the errors do occur on the lowest quality bases, but this was anything but consistent

Finally settled on a dataset, started working on a completely different direction than what was previously planned...

  • After discussing with Chris over email, I've decided to implement what I call the "exact read as seed" aligner
  • Simply put, using the "locate" function of succinct indexes, I replace the alignment process with a simple lookup
    • Consequently, "alignment" of 1000 pairs has now been reduced from 3 seconds to half a second
  • Of course, there are two things that I need to work on resolving next:
    • How to handle multimaps... which do occur
      • I imagine this will tie in all the pair end stuff
    • How to handle the case of no maps (which oddly... have not occurred, but that's probably just the dataset)
      • This will involve inexact matches
        • The naive way is to the use the mutate function
        • Alternatively, I can modify the locate function to allow a maximum number of mismatches via backtracking, which will be more efficient than the aforementioned method * Of course... this will also be just a bit harder in terms of coding... a bit.

Incidentally... I'm still using the tiny 1000 read pair end dataset I've used up until now (which takes 4 seconds to align), as I see no real need to use the larger dataset yet... and I most definitely will not be using the entire thing anytime soon anyway (BWA takes a good 10~20 minutes to align it... so I'd rather not think about what mine will do...).

Next up then:

  • Resolving multimaps
  • The naive inexact seed implementation with mutants
  • Give the more advanced back-tracking inexact locate a stab, and if that doesn't seem even remotely possible... seek help from Chris

Incidentally... is there a good way to track bugs and to do's with git?

  • I'd like to keep in mind that the KmerIndex crashes when you give it a query with characters other than "GCTA"

04/26/10

Modified the bacterial aligner to resolve multimaps, and started to add support for pair end resolution:

  • First, a general note that from going over Mosaik and BWA, I noticed that they handled pair-end reads as follows:
    • Steps:
      • Align the two sets of data separately as if they were individual reads and output them
      • Post-process two set of outputs, and link together all the pairs
    • Although counter-intuitive at first (as I would've paired them up DURING alignment, not after), I've decided to adopt this pipeline, as it allows multimap resolution and pair-end to be separated to an extent
  • Multimap resolution then, only applies strictly to single-end alignments
    • By changing a boolean then, two ways to handle multimaps are possible
      • Treat it as unmapped
      • Report all multimapped locations, writing each to the file with a unique name (by appending a decimal number)
    • I considered adopting a random select approach, but doing so would require more than a boolean decider... so I decided against it
    • Similarly, I refrained from doing any "smart" selecting (given just single-end data), as I don't exactly know how to do so, and decided to leave that for downstream analysis
  • Pair-end resolution is done by taking the results of the two sets of results from the individual alignments
    • There will be three general types of results: unmapped, unique map, multimap.
    • Consequently, there will be 3 x 3 = 9 different combination of results, which is actually 6 unique cases that need to be handled:
      • Both unmapped: no resolution possible, fragment lost
      • Unmapped + unique: a search function can do a SW to find the unmapped end
      • Both unique: no explicitly resolution required, just mate the two ends to generate the correct fragment data
      • Unique + multimap: use a select function to use the unique end to select the best location from the multimap site
      • Unmapped + multimap: use the search function across all possible locations, and return the best resulting fragment
      • Both multimap: use select in a cross-product fashion
    • Three key functions have been implemented:
      • mate, previously implemented to handle basic pair-end data, modified and debugged
        • A problem is that it'll accept even erroneous mappings with illogical fragment sizes
      • select, basic version implemented, but relies on the average fragment size
        • It's not actually a true "average", as the average is too sensitive to outliers (erroneous mappings)
          • I do this by keeping a total insert size and a fragment count, and I only update it if the absolute size of the new fragment is below 500bp
        • This average is initialized to 0, I worry what happens if the very first pair of reads requires this function
      • search, to be implemented, will probably be a simple SW
    • The many variations should be a simple matter of calling these basic functions

Next up:

  • Finish pair-end resolution
    • The basic search function
    • All the complicated cases using the basic functions
  • Naive quality/mutation-based inexact alignment support
  • Complex traceback inexact alignment support

04/27/10

Finished all pair-end resolution cases:

  • Completed the three basic functions
    • mate: to generate correct SAM output
    • select: to resolve multimaps
    • search: to find unmapped mates
  • Variation were also completed using a combinations of the above functions seem to take care of all 6 general cases outlined yesterday
  • A few know issues
    • The "average fragment size" should probably be a bit smarter...
    • Average fragment size is calculated by total / count, which can throw a floating point exception when count is 0
    • In the multimap x multimap pair end scenario, it is possible that two equally good fragments can be mated
      • I currently select the last one created... but something else should be done
      • Besides selecting one... or all of them, this is probably also where Sohrab's "fragment quality" comes into play (which I haven't touched)
  • Testing was done with the e_coli_1000_#.fq files, which actually has 1000 perfect reads (no mismatches)
    • search was tested by mutating the reads

Next up:

  • Naive quality/mutation-based inexact alignment support
    • Simple enough in theory and implementation
    • I will probably create a few mutated reads from the 1000 perfect reads, and tweak the qualities to fit the "ideal" case
    • Once that seems to run decently, I can throw more realistic data onto it (hopefully with real qualities)
  • Complex traceback inexact alignment support
    • A bit more sketchy in theory, and somewhat scared of the implementation
    • The bowtie paper seems to mention something about this
    • Should be easier to test though... (only considers sequence, not quality)

04/28/10

Added the naive quality/mutation-based inexact alignment support into the bacterial aligner, some notes:

  • The mutation aligner:
    • Feel very brute force... as it basically generates mutants until one of them gets a hit with locate
      • I'm quite surprised that it actually finds the correct location, as I assumed that a few bases would be enough to map it to another random location
      • ...then again, I'm using a bacterial genome...
    • It then has to call align using the location hit by the mutant (for the reference), and the original wild type sequence
    • You pass the function a threshold integer, for which all bases with qualities below the threshold can be mutated
      • If by some unfortunate chance, the mismatched base has a quality higher than this threhold, the correct mutant will never by generated (ignoring random hits)
        • When this occurs, the mutant generator will NOT stop until it has made all mutants it can
        • This almost feels like an infinite loop... so I've added another iteratation_limit parameter to limit this
      • Finding a good combination of threshold and iteration_limit will probably be required...
    • Needless to say... it's very slow
    • I only tested this by playing around with a single read, so I have no idea what performance will be like when I toss real data onto it... (which I probably won't be doing anytime soon yet, at least not until some optimization and the traceback function for comparison)
  • Pair-end Resolution
    • In the case where both return multimapped locations, a cross product pairing is used to generate the best fragment
      • There are cases when multiple identical fragments can be mapped to the reference
        • Previously, I selected one, much like how bowtie and BWA handled multimapped fragments
        • I've now modified the code to behave similarly to how read multimaps are resolved
          • Either multimapped fragments will be treated as unmapped
          • ...or all pair-end combination resulting in the fragments will be written in the output, differentiated by a suffixed ID number

Next up:

  • Complex traceback inexact alignment support
    • Start by reading the bowtie and bwa papers
    • I assume I'll probably have to look at the PizzaChili index's code
    • Modify the code...?

04/29/10

Started on implementing the mismatch allowing locate:

  • Chris gave me a brief outline of how to do this:
    • The count function in the af-index takes a pattern of length m and has three parts:
      • Initialization of two pointers
      • Iteratively lengthen pattern, by adding characters in reverse order from end, stop when entire pattern is added or pointers overshoot each other
        • pointers are updated for each lengthening pattern, and typically converge
      • Termination
        • the number of entries between the pointers is the count returned
        • or return 0 if the pointers overshoot each over
    • The same pattern is seen for all count-like functions, such as locate
  • In the iterative step then:
    • Each pattern substring is associated with a pair of pointer values (i.e. a range)
    • For a pattern then, we can store the range associated with each pattern substring
      • This way, we can easily change the pattern, determine the shared subsequence with the original pattern, and save a few iterations by starting the iterative step with pointer initialized to the pattern-specific range.
  • This should then work in theory by adding a "retry" block of sorts in between the intialization step and iterative step
  • In a related note then
    • As each pattern subsequence is associated with a specific range
    • We can precompute ranges for specific patterns, which can be looked up to save iterations at the cost of memory
    • This tradeoff between space and time can be user-designated

Now, I've managed to convinced myself that this should work in theory... but I've hit a wall attempting to implement it... as I cannot even import the af-index into the project to start. Thus, I'm currently seeking help from Chris.

04/30/10

Installed ticgit for issue tracking purposes. I did the following, in case anyone ever wants to install it on a Cygwin...

  • Install Ruby via the cygwin installer/updater
  • Install Gem as described here
  • Download the source for jeffWelling's branch
  • Build a gem out of the ticgit.gemspec file
  • Go to your git repo and install the newly built gem

By decision of Chris, I've given up on trying to work with the af-index, and instead, we're to turn to begin dismantling and assimilating the very new aligner software from Mäkinen and friends:

  • The index (FMIndex inheriting TextCollection) seems to be abstracted across two distinct classes
    • TextCollectionBuilder to construct, load input, and initialize the index
    • TextCollection for locating suffix patterns
      • This class is used by various Query classes, which "align" reads to the reference in different ways, subclasses include:
        • IndelQuery
        • MismatchQuery
        • RotationQuery, a faster, but more space-consuming mismatch query
  • I did actually find the paper and attempted to read it a bit, but quickly gave up.

Next up then:

  • Getting everything compiling properly within the project
  • Adapting the index to my interface so it can be used with my aligner
  • Adapting the query classes?
  • A much needed overhaul of the project organization?
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2025 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback