JaysJournal < BETA

[[https://bugs.cs.ubc.ca/cgi-bin/twiki/view/BETA/JayMayArchive][May 2010 archive]]

---+++ 06/01/10
Here are a few more times, this time with all bin and data files on the =/tmp/= folder with times calculated from =/usr/bin/time=.

Using commit =5a82392eeeb5653d1b75b7a8dd7ccdd499605cfa= and =saligner=, with the =125000= readset (real time)

|  *Type*  |  *k*  |  *max_results*  |  *Unmapped (+1662 invalid)*  |  *Search Time (s)*  |  *Avg time/read (s)*  |
|  Exact  |  -  |  1  |  17732  |  4.727  |  0.000037816  |
|  Exact  |  -  |  10  |  17732  |  5.01  |  0.0004008  |
|  Mismatch  |  2  |  1  |  5466  |  8.251  |  0.000066008  |
|  Mismatch  |  2  |  10  |  5466  |  9.438  |  0.000075504  |

Using BWA with flags =-l 99 -k 2= outputting to SAM:

|  *Type*  |  *k*  |  *max_results*  |  *Unmapped*  |  *Search Time (s)*  |  *Avg time/read (s)*  |
|  Mismatch, I guess  |  2?  |  ?  |  ?  |  2.832  |  0.000022656  |

Using readaligner with flags =-k 2 --sam --fastq=:
|  *Type*  |  *k*  |  *max_results*  |  *Unmapped*  |  *Search Time (s)*  |  *Avg time/read (s)*  |
|  Mismatch  |  2  |  1  |  ?  |  6.662  |  0.000053296  |

Bowtie with version info:
<verbatim>
64-bit
Built on sycamore.umiacs.umd.edu
Sat Apr 24 15:56:29 EDT 2010
Compiler: gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)
Options: -O3 -m64  -Wl,--hash-style=both
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}
</verbatim>

|  *Mode*  |  *Flags*  |  *Mapped*  |  *Unmapped*  |  *Search Time (s)*  |  *Average Search Time (s)*  |
| 0-Mismatch  |  =-v 0=  |  105606|  19394|  1.310  |  0.00001048  |
| 1-Mismatch  |  =-v 1=  |  115631|  9369|  1.850  |  0.0000148  |
| 2/3-Mismatch  |  =-v 2=  |  118356|  6644|  1.878  |  0.000015024  |
| 2/3 Mismatch  |  =-v 3=  |  119430|  5570|  3.441  |  0.000027528  |
| Seeded Quality  |  =-n 0=  |   112317|  12683|  1.478  |  0.000011824  |
| Seeded Quality  |  =-n 1=  |   117659|  7341|  1.679  |  0.000013432  |
| Seeded Quality  |  =-n 2=  |   118356|  6644|  1.974  |  0.000015792  |
| Seeded Quality  |  =-n 3=  |   118330|  6670|  2.560  |  0.00002048  |

Also had a meeting with Chris today, where he clarified some of his previous communications. Our first priority should now be to compare Bowtie exact matching functions to saligner or readaligner exact matching. Daniel took on a job of writing a better locate function for exact matches. I will be looking for more areas to optimize and also fix some bugs and write a better saligner wrapper that actually takes in arguments. Also, since mismatches in cigars are recorded no differently from matches, both mismatch and exact can have their own defined cigar (e.g. 35M), instead of having to compare reads to references to generate our own cigar.

To do:
   * In Exact and <nop>MismatchMappers, just make a cigar on the fly instead of passing stuff to <nop>SamEntry.
   * More low level optimizations.
   * Make a better saligner wrapper.
   * Fix a bug in cigar generation (ticket 53df76)

---+++ 06/02/10
I did a bunch of minor changes to the IO files to optimize them a little further. Most of the changes were just reserving space for strings/vectors beforehand, fixing a few unneeded lines, etc. I also looked into ticket 53df76 (buggy CIGAR output on indels), and found that it's not really a bug with the CIGAR generator. The bug actually comes from using the =LocalAligner= after a gap alignment to generate the "dashes" that the CIGAR generator needs. However, the behaviour for =LocalAligner= is unpredictable when used in this context; sometimes it will clip the start or end of the read and throw in mismatches instead of just inserting the gaps (which is actually what it's supposed to do, given it's local). So, there actually aren't bugs in either the CIGAR generator or the =LocalAligner=. Unfortunately, we can't really fix the bug given the current situation, because there's no other way to put dashes into the aligned ref/reads. The only solution is to either figure out how =IndelQuery= does its CIGAR generation or write our own =IndelQuery= (which we might just do anyway). Since the indel support is lower priority, I'm going to table this bug for later.

Another optimization I did today was make a convenience method in =SamEntry= to generate an entry representing an exact match (as per Chris' suggestion yesterday). Because exact matches are so predictable, it saves us having to generate the CIGAR, and a bunch of the tags. We also don't need to extract the original reference from the index, since it's not really needed. Mismatches should also be similar, since mismatches aren't represented differently from matches in CIGARs. However, we do have an =MD:Z= tag that supports mismatches (like BWA), but I'm not sure whether this is needed at all.

Finally, I added support for arguments in saligner, where we can specify reference, reads and output for convenience, so we don't have to recompile every time we need to use a different read file.

Anyways, I decided to run some tests on my newest changes using the 2622382 reads file just for completeness and for more accuracy, since I didn't do any tests on that file yesterday, either.

Under commit =c6f55e33aa732c8952d1f56fa4c1fe6aa3875677=, with the 2822382 reads file and Release build:

|  *Type*  |  *k*  |  *max_results*  |  *Unmapped (+20957 invalid)*  |  *Search Time (s)*  |  *Avg time/read (s)*  |
|  Exact  |  -  |  1  |  350717  |  86.634  |  0.000033036  |
|  Exact  |  -  |  10  |  350717  |  95.99  |  0.000036604  |

And with Bowtie (same version info as above):
|  *Mode*  |  *Flags*  |  *Mapped*  |  *Unmapped*  |  *Search Time (s)*  |  *Average Search Time (s)*  |
| 0-Mismatch  |  =-v 0=  |  2250708|  371674|  27.341  |  0.000010426  |

To do:
   * Do some profiling again...I haven't done that in a while and there's been a lot of changes to the code.
   * Start reading up on FM index.

---+++ 06/03/10
Worked on more optimizations, this time guided by Valgrind. From <nop>KCacheGrind, I found that about 10-15% of our cycles were being used up in IO, especially making/writing SAM entries. So I went in and did a few more minor optimizations, such as using =ostringstream= instead of =stringstream=. Preliminary results (tested on my machine) shows it runs something like 10% faster now.

Up until now, I've really only been testing exact matches with saligner over the 125,000 reads file. I was doing one final test on the 2M read file (to verify output and to do a preliminary speed test), I noticed a lot of disk activity after about 40k reads! I'm guessing this is what cache misses look like (it could just be it reading from the reads file, I'm not really sure on this, have to confirm), so I'm going to have to do some more profiling with Callgrind. Granted, my system doesn't have as much memory as =skorpios=, but this might be one of the areas we can improve on.

Other than that, I think it might be a good idea to start reading up on FM indexes, because I really can't do much else without knowing how the aligner works. So tomorrow, I might start on my reading (probably following in Daniel's footsteps).

To do:
   * Check if we're really getting lots of cache misses.
   * Start on my reading!

Update: Ran a cachegrind over 2M reads, and didn't get very many cache misses (something like <1%), so I guess it was actually reading from input? I'll have to check out the log a bit more tomorrow. On a side note, the profiling took about 2.5 hours with 2M reads on my machine (don't even know the time because the built-in timer went negative), so it's not a very practical thing to do...

---+++ 06/04/10
I feel like I've done all the minor optimizations on the I/O classes at this point, and probably won't be improving speeds much. So I spent the day reading up on [[http://www.google.ca/url?sa=t&source=web&ct=res&cd=3&ved=0CCcQFjAC&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.77.2997%26rep%3Drep1%26type%3Dpdf&rct=j&q=Indexing+Compressed+Text&ei=Q_0GTIfEIsSqlAekh_2LDg&usg=AFQjCNHFq9u2yRmKDlS4LvByeEcNH1p0ag][FM indexes]] using the same paper Daniel listed a while back. I now know how the BWT works and how =count= functions work on the index. I haven't really gotten through =locate= yet, but it shouldn't be too bad.

With my new-found knowledge, I took another look at the exact matching implementation in readaligner and now I feel like I get it a bit more. I tried to look up some papers on wavelet trees (I _think_ that's what =wv_tree= stands for) and the rank functions, since that's pretty much the bottleneck but I couldn't really find much stuff on it (not that I tried that hard).

So for later:
   * Find some literature on wavelet trees and take a look at the rank function some more
   * Do some more reading...

---+++ 06/07/10
Chris suggested finding the ratios between locating with 1 mismatch vs doing an exact locate for both bowtie and saligner. So here it is:

Run on =skorpios= with saligner commit =5f7f77021a321eab0019825bec36969209e707b6= and using the =time= command.
|  *Aligner*  |  *Exact time (s)*  |  *1-mismatch time (s)*  |  *Mismatch:Exact ratio*  |
| saligner, max_results = 1  |  79.397 |  107.989 |  1.360 |
| saligner, max_results = 10  |  86.937 |  127.223 |  1.46 |
| bowtie  (=-v 0/1=)  |  27.115 |  29.966 |  1.105 |
| readaligner (=-k 0/1=)  |  68.646 |  87.841 |  1.280 |
Note: readaligner returns 1 max result, I believe.

Afternoon update: Well, I managed to shave another 5 or so seconds off saligner. Doing a brief benchmark, with same conditions as above:
saligner, max_results = 1: 73.58s for exact alignment.

It's a lot closer to readaligner now. I ended up changing all the =osstreams= and =ofstreams= to =sprintf= and =fprintf=, which made it quite a bit faster. I'm debating whether I should change the reading code to =cstdio= as well, but it might get a bit more complicated, since I don't know the actual size of each line and the C input functions all require a =char= array buffer. Our saligner also does a bit more than readaligner. For example, our FASTQ file processing is much more robust and follows the FASTQ spec better.

Chris also mentioned that something else I can do is to replace the =rank= structure. He mentioned Daniel was looking at a 64-bit implementation, so that might be something to look at? I'll need to get more details later.

For tomorrow:
   * A few more minor optimizations.
   * Take a look at bowtie source code, maybe?

---+++ 06/08/10
Decided to do one last benchmark with 32-bit bowtie instead of 64-bit bowtie, which might be a more accurate comparison, given that readaligner doesn't really take advantage of 64-bit registers (I don't think...). Anyways, here are the results:

|  *Aligner*  |  *Exact time (s)*  |  *1-mismatch time (s)*  |  *Mismatch:Exact ratio*  |
| bowtie (-m32 compiled) (=-v 0/1=)  |  41.526 |  46.473 |  1.119 |

Also had a meeting with Daniel and Chris today (see Daniel's journal, June 8 entry for details).

After the meeting, I did a bit of tinkering with the <nop>FASTQ reader. After some discussion with Daniel, we decided it might be better to trade robustness for speed in the reader, as most <nop>FASTQ output these days are more standardized (for example, no wrap-around for sequence or quality strings). So by taking all those checks out, I was able to get another roughly 10-15% increase in speed. I also started doing some work on converting our C++-style strings to C-strings. I started with the <nop>FASTQ reader again, but it's going to end up being a huge change that'll effect almost all of our classes. At the same time, I think I should also start looking into using regular arrays instead of vectors where applicable.

To do:
   * Pretty much refactoring (converting C++ strings to C strings).

---+++ 06/09/10
Experimented a bunch with =char= arrays vs C++ strings. I did a quick and dirty replacement of strings with =char= arrays in the =FastqEntry= and =FastqFile= classes, and then fixed all the compile errors that came with it, but there actually wasn't too big a difference. In fact, I couldn't really see a difference at all. Granted, I did do a pretty poor job of doing the replacement (just wanted results), so it might not have been as optimized as it could be, but it seems that changing over wouldn't make that much of a difference. 

I also benchmarked my changes from yesterday (swapping out the Fastq read function). It seems saligner is now faster than readaligner (at least in exact matching). Anyways, here are the times:

|  *Aligner*  |  *Exact time (s)*  |  *1-mismatch time (s)*  |  *Mismatch:Exact ratio*  |
| saligner, max_results=1  |  64.461 |  94.434 |  1.465 |

The mismatch:exact ratio is still a bit off from bowtie's and readaligner's, but that might also be because mismatches still unecessarily go through the <nop>CIGAR generator (which I should fix sometime).

There was quite a bit of discussion on some points discussed yesterday, detailed in Daniel's journal, June 10 entry.

Finally, spent the rest of my time conerting strings to char arrays, though I'm still not sure if it's doing anything. Regardless, it should be more future-proof.

There's also one point I'd like to look at in the =ExactMapper=, though I doubt it has too much of a performance impact. When the total results of an alignment is greater than =max_results=, it seems we do a =Locate= on all the results, not only =max_results=. This should be a pretty easy fix.

To do:
   * More converting
   * Look at fix mentioned above.

---+++ 06/10/10
Bahhhhh spent all day converting strings to c-strings. I've decided to convert only things having to do with sequences and qualities to c-strings. To make it easier to see, I put up some typedefs:
<verbatim>
#define NGSA_READLEN 128
typedef char SequenceBase;
typedef char QualityBase;
typedef SequenceBase SequenceString[NGSA_READLEN+1];
typedef QualityBase QualityString[NGSA_READLEN+1];
</verbatim>
I'm about 75% done, I think. I finished the I/O classes and the drivers, as well as half the mappers. Just need to convert the rest, and then start on the pair-end matcher, and some other stuff. Also optimized some of the lower-level algorithms (such as ngsa::ReverseComplement) to make better use of =char= arrays. I hope the end result is a speed up, or at least doesn't result in a speed-down! Some parts still aren't very efficient, because there are other methods that still take in C++ strings, where I end up having to do a conversion (which takes O(n) time). I think =strlen= is also O(n), versus O(1) on =std::string::size= (I _think_ that's what these are), since =std::string= can just update the size every time the string grows/shrinks, so I should make sure to avoid using =strlen= too often.

To do:
   * More conversion...
This topic: BETA > JaysJournal
Topic revision: r38 - 2010-06-11 - jayzhang