Line: 1 to 1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Added: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> > |
06/01/10Here are a few more times, this time with all bin and data files on the/tmp/ folder with times calculated from /usr/bin/time .
Using commit
Using BWA with flags
Using readaligner with flags
Bowtie with version info: 64-bit Built on sycamore.umiacs.umd.edu Sat Apr 24 15:56:29 EDT 2010 Compiler: gcc version 4.1.2 20080704 (Red Hat 4.1.2-46) Options: -O3 -m64 -Wl,--hash-style=both Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}
Also had a meeting with Chris today, where he clarified some of his previous communications. Our first priority should now be to compare Bowtie exact matching functions to saligner or readaligner exact matching. Daniel took on a job of writing a better locate function for exact matches. I will be looking for more areas to optimize and also fix some bugs and write a better saligner wrapper that actually takes in arguments. Also, since mismatches in cigars are recorded no differently from matches, both mismatch and exact can have their own defined cigar (e.g. 35M), instead of having to compare reads to references to generate our own cigar. To do:
06/02/10I did a bunch of minor changes to the IO files to optimize them a little further. Most of the changes were just reserving space for strings/vectors beforehand, fixing a few unneeded lines, etc. I also looked into ticket 53df76 (buggy CIGAR output on indels), and found that it's not really a bug with the CIGAR generator. The bug actually comes from using theLocalAligner after a gap alignment to generate the "dashes" that the CIGAR generator needs. However, the behaviour for LocalAligner is unpredictable when used in this context; sometimes it will clip the start or end of the read and throw in mismatches instead of just inserting the gaps (which is actually what it's supposed to do, given it's local). So, there actually aren't bugs in either the CIGAR generator or the LocalAligner . Unfortunately, we can't really fix the bug given the current situation, because there's no other way to put dashes into the aligned ref/reads. The only solution is to either figure out how IndelQuery does its CIGAR generation or write our own IndelQuery (which we might just do anyway). Since the indel support is lower priority, I'm going to table this bug for later.
Another optimization I did today was make a convenience method in Finally, I added support for arguments in saligner, where we can specify reference, reads and output for convenience, so we don't have to recompile every time we need to use a different read file. Anyways, I decided to run some tests on my newest changes using the 2622382 reads file just for completeness and for more accuracy, since I didn't do any tests on that file yesterday, either.
Under commit
And with Bowtie (same version info as above):
To do:
06/03/10Worked on more optimizations, this time guided by Valgrind. From KCacheGrind, I found that about 10-15% of our cycles were being used up in IO, especially making/writing SAM entries. So I went in and did a few more minor optimizations, such as usingostringstream instead of stringstream . Preliminary results (tested on my machine) shows it runs something like 10% faster now.
Up until now, I've really only been testing exact matches with saligner over the 125,000 reads file. I was doing one final test on the 2M read file (to verify output and to do a preliminary speed test), I noticed a lot of disk activity after about 40k reads! I'm guessing this is what cache misses look like (it could just be it reading from the reads file, I'm not really sure on this, have to confirm), so I'm going to have to do some more profiling with Callgrind. Granted, my system doesn't have as much memory as Other than that, I think it might be a good idea to start reading up on FM indexes, because I really can't do much else without knowing how the aligner works. So tomorrow, I might start on my reading (probably following in Daniel's footsteps). To do:
Update: Ran a cachegrind over 2M reads, and didn't get very many cache misses (something like <1%), so I guess it was actually reading from input? I'll have to check out the log a bit more tomorrow. On a side note, the profiling took about 2.5 hours with 2M reads on my machine (don't even know the time because the built-in timer went negative), so it's not a very practical thing to do...
06/04/10I feel like I've done all the minor optimizations on the I/O classes at this point, and probably won't be improving speeds much. So I spent the day reading up on FM indexes![]() count functions work on the index. I haven't really gotten through locate yet, but it shouldn't be too bad.
With my new-found knowledge, I took another look at the exact matching implementation in readaligner and now I feel like I get it a bit more. I tried to look up some papers on wavelet trees (I think that's what So for later:
06/07/10Chris suggested finding the ratios between locating with 1 mismatch vs doing an exact locate for both bowtie and saligner. So here it is:
Run on
Afternoon update: Well, I managed to shave another 5 or so seconds off saligner. Doing a brief benchmark, with same conditions as above: saligner, max_results = 1: 73.58s for exact alignment.
It's a lot closer to readaligner now. I ended up changing all the
Chris also mentioned that something else I can do is to replace the For tomorrow:
06/08/10Decided to do one last benchmark with 32-bit bowtie instead of 64-bit bowtie, which might be a more accurate comparison, given that readaligner doesn't really take advantage of 64-bit registers (I don't think...). Anyways, here are the results:
Also had a meeting with Daniel and Chris today (see Daniel's journal, June 8 entry for details). After the meeting, I did a bit of tinkering with the FASTQ reader. After some discussion with Daniel, we decided it might be better to trade robustness for speed in the reader, as most FASTQ output these days are more standardized (for example, no wrap-around for sequence or quality strings). So by taking all those checks out, I was able to get another roughly 10-15% increase in speed. I also started doing some work on converting our C++-style strings to C-strings. I started with the FASTQ reader again, but it's going to end up being a huge change that'll effect almost all of our classes. At the same time, I think I should also start looking into using regular arrays instead of vectors where applicable. To do:
06/09/10Experimented a bunch withchar arrays vs C++ strings. I did a quick and dirty replacement of strings with char arrays in the FastqEntry and FastqFile classes, and then fixed all the compile errors that came with it, but there actually wasn't too big a difference. In fact, I couldn't really see a difference at all. Granted, I did do a pretty poor job of doing the replacement (just wanted results), so it might not have been as optimized as it could be, but it seems that changing over wouldn't make that much of a difference.
I also benchmarked my changes from yesterday (swapping out the Fastq read function). It seems saligner is now faster than readaligner (at least in exact matching). Anyways, here are the times:
The mismatch:exact ratio is still a bit off from bowtie's and readaligner's, but that might also be because mismatches still unecessarily go through the CIGAR generator (which I should fix sometime). There was quite a bit of discussion on some points discussed yesterday, detailed in Daniel's journal, June 10 entry. Finally, spent the rest of my time conerting strings to char arrays, though I'm still not sure if it's doing anything. Regardless, it should be more future-proof.
There's also one point I'd like to look at in the To do:
06/10/10Bahhhhh spent all day converting strings to c-strings. I've decided to convert only things having to do with sequences and qualities to c-strings. To make it easier to see, I put up some typedefs:#define NGSA_READLEN 128 typedef char SequenceBase; typedef char QualityBase; typedef SequenceBase SequenceString[NGSA_READLEN+1]; typedef QualityBase QualityString[NGSA_READLEN+1];I'm about 75% done, I think. I finished the I/O classes and the drivers, as well as half the mappers. Just need to convert the rest, and then start on the pair-end matcher, and some other stuff. Also optimized some of the lower-level algorithms (such as ngsa::ReverseComplement) to make better use of char arrays. I hope the end result is a speed up, or at least doesn't result in a speed-down! Some parts still aren't very efficient, because there are other methods that still take in C++ strings, where I end up having to do a conversion (which takes O(n) time). I think strlen is also O(n), versus O(1) on std::string::size (I think that's what these are), since std::string can just update the size every time the string grows/shrinks, so I should make sure to avoid using strlen too often.
To do:
06/11/10Finally finished converting almost all the NGSA library classes to c-strings, except the local aligners. I'm going to forgo the aligners for now because I'm planning on doing some more optimization on them later, so I'll have to change the code there anyway. I'm still not sure whether all my changes work, though, because I still can't seem to compile! It seems there's a problem with my common header, where I store all the constants and typedefs. For some reason, when it gets to linking all the libraries, it keeps giving me errors in whatever headerssingle_end_aligner.cc includes that also use my typedefs, saying that it can't find the definitions. I spent a long time on Google looking up forward declaring headers and typedefs, etc., but I still can't find a solution...maybe I'll have to ask Chris on Monday or Tuesday.
I also did a few benchmarks on a bunch of string vs char array functions, and I've developed the following guidelines for any future compatability issues or conversions:
Finally, I think I might take Chris up on his offer for taking a break to study for my MCAT. I'll see how far I get on the weekend, but I might take Monday off for an extra study day. I guess we'll see after the weekend. To do:
06/18/10Back from my MCAT!Finally got my code to compile for the C string conversion. Now it's off to fix bugs and segmentation faults! I managed to clear up all the segmentation faults during the actual alignment process, but the output's still not quite right. My new build seems miss most of the reads that are alignable from an old build. Also, I get a segmentation fault during the index building process, which is really strange because I haven't even touched the code for that. No new code is even called until after the index has been built! Maybe I'll have to run the build through the debugger...
I also spent a large portion of the day experimenting with an idea I had last night. When calculating the rank value for a position i, instead of going to the next smallest pre-calculated position and doing Edit: Just noticed Daniel's started to implement a different rank structure, so my change might not even be relevant :(. Finally, I noticed Daniel's made a lot of changes...not looking forward to the huge merge later... To do:
06/22/10I realized I forgot to make a post yesterday, so here it is:
Worked on moving over from C strings still. I managed to fix all the segmentation faults, even with the index-building step. Also fixed most of the bugs that arose from the conversion. Now, the exact aligner returns exactly the same output. The Also worked on merging the branches, which might take a bit of time. There's so many new changes! Entry for today: Had a meeting with Anne and Chris today (see Daniel's journal for notes). Anne and Chris have been discussing giving me a bigger aspect of the project to work on, which is great. Chris suggested three things from the top of his head:
So far, option 2 seems the most interesting to me, but all of them are pretty good! Anyways, managed to finish merging everything. Mismatches and Exact matching work fine so far. The jumpstart mapper doesn't work (segfaults somewhere), but I think that might be something to do with the jump table not saving to the right place. Daniel also made some bug fixes, so I think I'll have to do some final (minor) merges, and then I'll be done the conversion! To do:
06/23/2010FINALLY finished porting everything over tochar arrays. I'm pretty sure everything's solid, I do as much to prevent buffer overflows as possible, and output looks the same as the old aligner using saligner and most mappers. Speedwise, it's pretty similar; I really can't tell if there's even been an improvement. I guess this port was to make it more future-proof (with packed strings) than anything.
Anyways, some outstanding bugs:
The similarities of the above two cases seem to point to something that they have in common going wrong. They both use a
Chris also suggested I take a look at threading the aligner in the meantime, while he's still thinking up something for me to do. He suggested looking into Boost threads, which I did. I don't know...from the To do:
06/24/10Did some last-minute verifying for my C-string conversions (still a bit nervous about it!), and it all looks fine, aside from theKmerMapper and MutantMapper , detailed above. So I finally pushed the changes into dev.
Also began looking at multi-threading. It doesn't look too hard, and I've verified that the only places requiring locks should be with
On another note, I tried integrating the Boost.Thread libraries in, but I really don't know how with CMake. I tried extracting the Thread library using boost's To do:
06/25/10Finally got the Boost.Thread library to compile with NGSA. Interesting thing...to test if linking worked, I did an#include <boost/thread/thread.hpp> in the saligner code, but after all our other includes. When I tried to compile, I would get compile errors from within Boost. But, when I moved the include statement to the top of the include list, it was perfectly fine! None of the other NGSA classes include the boost thread library either, so that was quite strange. I guess there's yet more incentive to follow the Google guidelines![]()
Anyways, I've added options in the To do:
06/26/10It works! The multi-threaded driver works, at least with a minimalCMakeLists.txt and no bells and whistles. For some reason, I had a bit more trouble linking, and the test file I was able to compile on Friday wouldn't compile, despite pulling it directly from the website! So I had to reinstall Boost and it seems to work now...I feel like I've spent more time trying to get this thing to build than actually coding it!
I noticed something strange while running the threaded driver: it seems that when I specify 2 threads, it doesn't use both cores of my CPU, but when I specify >=3 threads, it seems to start utilizing both cores.
The threaded driver does work though; at least, it aligns all the 125k reads fine. I still have to lock up the driver files (single end driver and pair end driver), because I'm still running into a minor race condition when approaching the end of the file. Also, I still have to add the alignment status reporting into the threaded driver (e.g. report how many reads were aligned, not aligned, failed, etc). Finally, I need to verify the output of my threaded aligner, which I'm not really sure how to do; it seems to me that the threaded aligner might print reads out of order, so a simple To do:
06/29/10Finally got the multi-threaded aligner working, with pretty much the same functionality as before. I'm pretty sure I solved all the race conditions (with the exception of the two Query classes from readaligner), but I still don't really know how to compare the output to the old version, since everything is out of order. For now, I just made a quick glance through the output to check that every line aligns at the end properly (since I'm using the exact aligner, and each read is 35nt, every line should have pretty much the same length, given 8 spaces for each tab). It looks pretty good, but even going through the file like that is huge.
I realized that Finally, I got a new project from Chris. I'm to build a new rank structure he proposed, which should utilize the cache more efficiently and does away with the wavelet tree. Basically, we will be using two arrays for the popcount, so that we can accurately represent the four bases (as opposed to one array, which would only give us two values). The implementation will involve:
To do:
06/30/2010Finished up the multi-threading implementation. I've still yet to verify the output (I'm kind of putting it off...). I also did some tests on skorpios as follows:
Tested on commit
Same as above, but using
Notes:
Shifting gears, I've also started to implement the new rank structure, as discussed yesterday. I've completed step 1 and can now build one 64-bit block from a 64-character sequence of A/T/G/C. I'm using Daniel's representation for now, where A=0, C=1, G=2, T=3. It's interesting to note one rank operation is slightly faster than the other 3 rank operations, so it may be advantageous to change the alphabet assignment up to whichever base is the most common? Determining the most common is probably too slow and/or the speedup probably wouldn't be worth it though. To do:
|