DanielFebruaryArchive < BETA

02/01/10

Made FastqEntry in a class (it was a struct), which required the addition of a lot of accessor methods. More importantly, it now has a new MutateBelowThreshold() method, which will:

mutate all nucleotides with a quality below a given threshold
for every nucleotide, append the 3 mutants and the unmutated copy to the end of a given vector
the vector will end up with 4^{number of nucleotides with quality below threshold} entries...

Finally, I should note that the FASTQ parser (in FastqFile.NextRead()), will fail miserably when faced with Windows style CRLF line endings...

As usual... everything is in the \src folder, and should by typing the following commands:

make
test.exe, will create "output.sam", and print out the list of mutants...

PS: I should probably learn how to create child topics and split this journal up by month... or week, as its getting a bit long

02/02/10

Chris went over what I had briefly, inserted plenty of comments... Changed things according to the comments. Broke everything... fixed many things. Uploaded working preliminary revisions in scratch/src_temp_revising. Some comments on the comments:

Why the double FASTQ in the FastqEntry.h guard? Artifact from my previous attempts of putting subdirectories into the src folder. Makefile didn't like that, so I scrapped that idea, but forgot to change the guard.
As for the FastqEntry.Clone() method, I actually created it specifically to avoid creating a copy constructor as suggested here (Decision, 2nd paragraph). I'll make one (along with an assignment constructor), just for the fun of it though ;)
Awaiting instructions on #include standardization
Awaiting instructions on shared project namespace
I should mention that I'm typically very generous with vertical whitespace... but reading Google's style guide has made be go the other extreme (to the dismay of Chris' aging eyes). I will attempt to strike a balance henceforth.
Awaiting discussion on logging class for errors
I hope it's not overkill making both the method itself and its inputs const (I don't think I've made return values const yet... but I think it's possible)

Will continue tomorrow...

02/03/10

Went and addressed comments form Chris, and attempted to adhere to the guidelines from Google. The revised files are now in scratch/src_temp_revised. Major changes include:

Vertical white space
Lots of comments... although still not full documentation (e.g. no example usage)
const and & in all places possible...

Also met with Anne, and debriefed current progress and potential next step, namely:

Another implementation of the MutateBelowThreshold(). Maybe... as we lack the proper formatted data to fully use this... but I'll see what I can do with FASTQ qualities.*
Test suites
A preview into algorithms, with dynamic programming...

* Proving to be quite difficult... will need to think about this on paper...

Two hours of thrashing, 10 minutes of pen and paper... then 1 more hour of implementation, and I have a newly implemented MutateBelowThreshold() that should (now having just typed it out) first be renamed. Putting that aside, I've got it to work somewhat like what Chris described on the whiteboard. There are some problems with it though:

As FASTQ doesn't have quality values for individual nucleotides, only a single one, I had to "make up" qualities for the other nucleotides. (For now, this is the difference with the max quality value found in the sequence)
The made up numbers work fine for the algorithm, but are dreadfully incorrect in terms of statistics and biology.
If we seriously consider using this method for FASTQ inputs only... I may look into tweaking these values so they reflect the biology and statistics a bit better...
My assumption was we would get real quality values for the other nucleotides, which could be used to replace the currently made up qualities.
Finally, every mutant created is another FastqEntry, with its own quality. For now, I simply copy the quality string from the wild type to the mutant, but needless to say, the quality doesn't mean much anymore, as the sequence has changed.

I've put the new implementation in scratch/src_temp_revised, completely replacing the old one.

02/04/10

Starting reading up on Google Test documentation. Having used JUnit in Eclipse to create test cases for Java, Google Test seems quite familiar due to many similarities (which is also mentioned by the primer document).

I've downloaded and compiled the library in scratch, and I've been able to run the provided example files successfully. I'll try to write a few tests for my own classes just to get a feel for things.

Also attended the Pipeline Taskforce Meeting at the BCCRC today. Contrary to Sam's general lab meeting, these meetings would be very helpful once our implementation gets going, as it's essentially a meeting bringing up bugs and frustrations about their current pipeline. It may be a long while before that happens though...

One potential thing to consider is how to handle reads with large (e.g. 1/3 or length) indels in the middle.

02/05/10

Started writing tests for my own classes. However, as most of the classes are simply glorified containers, there aren't many ways to test except for confirming their contents...

Also starting to think about implementing the Mutate function as an on demand generative iterator, instead of simply returning an entire list.

02/08/10

Finished writing Google tests for FastqEntry, FastqFile and SamFile. However, as previously mentioned, the lack of methods with these classes make the tests rather... lacking.

Some points:

Currently, tests for all three classes are in one huge test.cc file. I assume we should eventually have a separate file for the tests of every class.
The tests themselves are not that good... many are dependent on each other (well, maybe this is a good thing...) and not very exciting in terms of computations.
It compiles with a lot of warnings... (all of them about comparing signed with unsigned integers...). Do I simply cast them to make the warnings go away?
All files in /ubc/cs/research/condon/share/projects/ngs-aligner/scratch/src_temp, complete with working Makefile.
Success of tests seems to be platform dependent... running the same tests on my Windows system gets quite a few failures.
- Problems seems to be specifically about how file pointers are initialized. On both systems, file pointers are "good" (via good()) even prior to opening a file. I didn't want this, so I manually set the failbit in the constructor. The Linux init() seems to update this failbit correctly, while Windows does not. Maybe I should define my own "failbit" instead of using C's...

Next up... the iterator. For now, I'm completely ignoring the C++ iterator part of it, and like Chris suggested, concentrate on getting a working algorithm. At the moment... it's slowly taking shape as some sort of state machine, which I hope will fit nicely into an iterator that simply tracks the current sequence and state to output the next sequence.

02/09/10

Clarified some points with Chris concerning the iterator. As previously mentioned, still ignoring the iterator portion. So far, it looks like the algorithm will do the following:

Take a sequence and a threshold phred quality score
Ignore all nucleotides with qualities above this threshold, and potentially mutate all remaining nucleotides below the threshold
Create mapping between position and quality for nucleotides below threshold
Sort mapping by increasing quality
Mutate in a specific order according to the sorted map (with changes applied to positions on the sequence), i.e.:
- Consider all 1 nucleotide mutations first, starting from the lowest to highest
- Consider all 2 nucleotide mutations next... and etc.
The actual order of mutation is a specific pattern easily represented as a single array the size of the sequence, and incremented in an easy algorithm, and strictly dependent only on the previous pattern (plus 1 other variable... I think)

Of course, this is all still in my head, and I came to this "easy algorithm" after failing miserably and wasting 2 hours creating an extremely inefficient (and incorrect) Turing Machine-styled algorithm (yes... it read one symbol at a time until it found specific symbols... unfortunately, it ended up requiring a dynamically increasing number of loops for every mutation it considered... I was essentially hard-coding patterns until I gave up after 2 mutations). Putting that aside... I'm pretty sure this new algorithm should work out fine (at the very least, I can verbally describe it a few consistent and general steps that also seem to handle the edge cases well). I'll find out tomorrow.

02/10/10

I've got the bulk of the iterator completed, based on the algorithm I came up with last night. It works correctly... assuming you do not overshoot the sequence and cause a seg fault (I've yet to define itr.end() properly). I've still got a few loose ends to tie up, namely:

Implement FastqEntry.MutateBelowThreshold() (I've reverted name once again...) so it returns a properly initialized iterator.
Define iterator.end() and iterator.begin()
Define == and != for the aforementioned functions to work properly
Potentially extend the C++ iterator class for compatibility...

There is one final "small" detail that I thought would be trivial... but is coming back to haunt me, which is getting every mutation site to cycle through all 3 other nucleotides (and the 3 nucleotides to mutate will vary... according to the one originally present). Currently, I never worried about it, as I was simply mutating a string of A's to N's. I have an idea how to tackle this, though it will probably double the length of the code already present...

After I get something satisfactory, I'll get it onto the server, and starting implementing some basic dynamic programming alignment algorithms.

02/11/10

Worked out some pseudocode that should resolve the "trivial" detail I mentioned yesterday. Also confirmed with Chris that I should extend C++'s iterator class, so that should keep me occupied for a few days.

Also dropped by the BCCRC and finally got my CRC account working, was able to check my CRC mailbox, and realized I had unfortunately missed a presentation by Sohrab last Monday.

02/13/10

I have resolved the "trivial detail", and besides some odd behavior at the end and start cases, I have a FastqEntryIterator that generates all unique mutants with all nucleotide patterns. Some final things to sort out:

Make the begin() and end() functions in FastqEntry to spawn the respective FastqEntryIterator objects.
There probably will not be a begin() function, and will in reality be the MutateBelowThreshold() function that acts identically to begin(), but takes a numerical threshold as an argument.
Fix the start and end cases...
- I know the start is funny, probably not because of the implementation, but because the way prefix-incrementation works. I will probably resolve it by creating a postfix-incrementation function.
- I've yet to sort out the end case, except I know that it ends prematurely...
Finally... extend the C++ iterator class.

02/14/10

The iterator is complete, and the toy executable /ubc/cs/research/condon/share/projects/ngs-aligner/scratch/iterator/test demos the iterator in a small loop (run it, then hit ENTER to obtain next item). Some details:

End cases have been resolved, through implementation and documentation
- The first element of the list of mutants is the original sequence (i.e. 0 mutations).
- Incrementing the iterator beyond the end will cause a segmentation fault
==, != and end() have been declared in their respective locations to aid clients from overshooting the iterator
The class extends STL iterator and is tagged as an output_iterator, so it should theoretically work with certain STL Algorithms... in theory...
FastqEntry has undergone a huge makeover to accommodate the iterator. Including an end() function. Note there is no begin() function, but instead a MutateBelowThreshold() which acts identically, except that it takes a threshold int.
I've disallowed copy and assignment operations for the iterator
There is no post-increment, although the implementation is trivial using pre-increment.

One concern about the mutant FastqEntry objects returned is that only the sequence is changed, while the quality and description string are identical to the original wild type.

Should the description be altered to differentiate from the original? (e.g. appending an identification number? easily achieved with a counter variable in the iterator)
Should the quality be altered, as they clearly do not represent the true qualities of the mutated nucleotides? (e.g. we could set them to 0, although this would require some minor implementation changes)

Finally, I finally understood the meaning of this Google Style rule. It was a hard learned lesson after creating a circular header file reference with header guards... causing the compiler to prematurely exit a file without reading the definitions.

I guess I should get to creating Google Tests for the new function. One concern I have is the fact that Google Tests all seem to be black box tests, meaning I can only infer the correctness of all the complicated private functions through the output of one single public function. But I guess that's all that matters to the client...

After the tests, I'll start on some dynamic programming.

02/15/10

The tests are done, and the executable is /ubc/cs/research/condon/share/projects/ngs-aligner/scratch/src_temp/test. I've also made some slight changes to the file IO (for both SAM and FASTQ) classes so:

The Open() function is now known as Init(), to conform to the Google style guide
file pointer statuses are explicitly updated, so platform-dependent errors no longer occur

Next up... Implementing a "basic Smith-Waterman [and] Needleman-Wunsch algorithm".

I've looked over the lectures suggested by Chris, searched up a few pseudocode examples, and had no problems with the theory of things. Fortunately, having learned Smith-Waterman in a previous CPSC course, and having just finished the Viterbi dynamic programming algorithm last week as an assignment, I have a pretty good idea of how to start the code implementation tomorrow.

02/16/10

Finished a working (and AFAIK) bug-free Needle-Wunsch implementation global aligner. The class is in /ubc/cs/research/condon/share/projects/ngs-aligner/scratch/global/, and a toy executable test is in there as usual (and yes, I recalled to set user permission to the group this time). It does the following:

Given 2 sequences (of x and y length), construct an object and initialize two matrices (score and pointer), both x + 1 by y + 1 in size.
Initialize the matrices
Iterate through the entire score matrix to determine the max scores values of each sub-alignment, recording the path taken in the pointer matrix
Retrieve all alignments with the maximum score with a recursive Traceback function
Return all optimal alignments in a vector< pair< string,string > > ... or pairs of strings in an array

Some notes on the implementation:

Works with sequences of length 0, unknown behavior with very long strings...
I have learned that switch statement cases require braces if local variables are declared in C++...
I tried manually allocating 2D arrays with **int, but the destructor (where I explicitly delete []) would cause an error (Google suggests I was freeing memory I wasn't supposed to, to my extreme confusion). Rather than assuming that the array was indeed being freed some something somewhere I had no clue of... I opted to use vectors, which I knew were objects that would free itself after use (I hope...).
The most difficult part was figuring out the Traceback() that would generate all maximal alignments, and how the pointer_matrix_ had to be filled to accommodate this
Testing was not extensive, but I was able to recreate the example seen here
I've been commenting a lot, but I have this feeling it's not following the Google style guide...

Having finished this implementation, I imagine implementing Smith-Waterman should be easier tomorrow.

02/17/10

Finished Smith-Waterman, found in /ubc/cs/research/condon/share/projects/ngs-aligner/scratch/local/. Some details:

Copy and pasted files from yesterday, did mass string replace of "Global" to "Local" on everything from filename to header guards.
Changed how the grid was initialized
Added an additional case to Max (i.e. 0)
Added a few class members to track the grid(s) with the maximum score
Added some code in the matrix population loop to track and update the grids with the maximum score
Added a loop to StartTraceback(), so all subsequence starting from all maximal score grids would be returned...

The most difficult part was introducing the changes to track, update, and retrieve all subsequence alignments, done with yet another vector.

I will give the Affine Gap Model a go next...

As mentioned this morning, I have implemented the affine gap model with Needleman-Wunsch located at /ubc/cs/research/condon/share/projects/ngs-aligner/scratch/global_affine/. Due to the rather odd way it aligns, I'm not quite sure how to confirm it's actually working correctly. Ran into some rather silly problems:

I defined gap opening and extending penalties as negative values, and I was subtracting them in the matrix calculation. Of course, the alignments that came out favoured gaps rather than actual matches, which took me some time to catch...
Playing with the static class constants kGapOpen, kGapExtend, kMatch and kMismatch (prefixed k as Google style dictates), really alters the outcome of alignments... I've yet to find a combination that I'm satisfied with.
In the initialization of a few matrices, there is a requirement to make the value as negative as possible, so I tried using INT_MIN. Unfortunately, the algorithms start to subtract from this value, so the number wraps around and becomes a huge integer number. Needless to say, this spells chaos for the alignment. Thus, I've defined yet another static constant kNegativeValue to be -1000000.
Overall, some rather confusing alignments appear, which I assume is due to the combination of global alignment and affine gap penalties. I hope this scoring system does better with local alignments, which I will try tomorrow.

02/18/10

Combining everything over the last two days, I have implemented a Smith-Waterman local aligner with an affine gap penalty. Not much to mention, as it took 20 minutes of actual coding. Located at /ubc/cs/research/condon/share/projects/ngs-aligner/scratch/local_affine/ with proper permissions and test executable. Assuming I can get there via transit, I'll be meeting with Chris at the BCCRC this afternoon.

Met with Chris and Andrew this afternoon... listened to lots of technical talk that went over my head. More importantly, talked with Chris about the next few things to consider, including:

A rigorous test suite, which may involve hand computed alignments for specific cases, then randomized runs confirming that different implementations output the same alignments
Have a unified interface for all the alignment implementation, I imagine this would literally be an OOP interface
Allow user-define scoring values
Compare performance between allowing user-defined scoring values, and hard-coded scoring values
Compute only the alignment score, and not the matrix, given two sequences (for better performance)
Only recompute the entire matrix and retrieve the alignment when the operation is explicitly called
Some performance testing suite...

02/19/10

Attempted to find some Smith-Waterman code using affine gap penalties in order to check the correctness of my implementation, how other people were doing it, and a set of penalty constants to use in my own tests. I have indeed found a few rather interesting ones, and I'll take some time to look carefully at them some other day.

Also found that Google Test actually does display the elapsed times for tests, so that may be a potential way to compare run times , without the need to use the Boost timer class (which, AFAIK, also simply reports elapsed time). As for CPU time splits, I've previously done this in assignments with the simple unix time command, but I'm unsure if that'll be enough for extensive testing and comparison.

Either way, I will start with modifying my local aligner as follows:

Does not create an object to align every pair of sequences (i.e. create a single reusable object with an function that takes in and aligns pairs)
Does a ScoreOnly alignment (which will only require two matrix rows) which only returns alignment scores.
Has a FullAlignment when required (which will require a full matrix) and return aligned sequences.

Other things will hopefully follow.

02/22/10

Finished making the three aforementioned changes to my linear gap penalty aligner, but modifying this to an affine gap penalty should be relatively simple tomorrow. Some details:

The constructor LocalAligner() takes a string reference as an argument, to which all reads are aligned to. I could remove this to make the function more general (it's just adding an argument, and replacing one variable name), but I assume we align multiple reads to a single reference (and it makes downstream parameter passing a bit simpler).
On that note, as a single aligner object can be used repeatedly to align any number of sequences, there are essentially no class members. Thus, functions used in the alignment process have very long parameter lists, and is one parameter short (the reference sequence) of becoming a static class (using Java terminology).
There is now a ScoreOnlyAlignment and a FullAlignment function, the former requiring O(n) space while the latter takes O(mn), where n is the read sequence length and m is the reference length. (I assume the reference length is greater than the read length, but it should have no effect on the correctness of the algorithm).
ScoreOnlyAlignment uses C++ variable length arrays, which was somewhat new and interesting for me, and I hope will be safe in terms of memory management.

Also set up my public git repository in /ubc/cs/research/condon/people/jujubix/repo/ngsa_suite.git, and a private repository on my home Windows computer (using Cygwin, after installing ssh, git, and cmake... for later).

I'll try to start tossing classes into their respectively folders into my private repository and push to my public after everything's sorted out (including figuring out CMake).

Upcoming next:

Modify the local aligner to have an ScoreOnlyAlignment with an affine gap penalty.
Figure out what CMake is and how to make it work...
Figure out what to put where in my private repo and start pushing to my public repo
Create a C++ abstract class as a unified interface for all aligner classes (C++ has no "interface" class that Java has, but an abstract class is essentially the same)
Performance and correctness testing suites, I've looked into the Boost timer class, which seems pleasantly simple enough to work with. (Create object to start, destroy to get elapsed time)

02/23/10

Did not manage to get any coding done today, as I was set back by a few rather odd Windows problems. In point form:

Due to some rather bad PATH environmental variables, my make and GCC was actually being run from a version included in a Python installation I had
CMake (from Cygwin) didn't like this version for reasons unknown, so I started installing new versions and playing with the PATH
Long story short... after trying a lot of odd things (like Installing CMake for Windows and MinGW), I found it easiest to have everything installed to Cygwin and setting the correct PATH

After getting CMake working, ran through some CMake tutorials, I finally figured out enough to make sense of Chris's instructional email. Well, more like I finally realized that CMakeLists.txt files were the key to everything, and I'm going to have to start learning how to write them.

Then dropped all my existing files into their respective folders in my private repo, made a branch, checked out the branch, added everything, committed everything, (installed vim to Cygwin, as the commit prompt ran in vi... and then had to learn how to use some vim commands), and finally pushed everything to my public repository, following Chris's email, and this nice tutorial.

I will next attempt to make CMakeLists.txt files to get everything compiled (I will probably ignore gtest at first, as the thought of that makes me cry at the moment)... and resume coding once everything works.

I guess I should also mention that I'm to learn about rank and select queries for an upcoming talk, which I assume will eventually apply to navigating BWT indexes of the reference.

02/24/10

By suggestion of Chris, I have put my talk (scheduled for March 16th) first priority until preparation are complete. Thus, time was spent reading the documents Anne sent me. Some notes:

While Anne suggested reading sections 6 and 6.1 of Navarro, extremely confused by the notation and context, I decided to read everything up to and including 6.1.
Most basic concepts in the first few sections were okay (very tough... but okay), except for everything concerning k-th order empirical entropy... which I hope won't be key to understanding things down the road. In the case that it is, I may require help.
The concept of rank and select seem simple enough, but the algorithms elude me. However, I think I should be able to get it after some concrete examples with pen and paper, past all the abstract values and unfamiliar variable names...
As I'm still struggling with the basic implementation, I'm going to leave the broadword implementation paper for later...

02/25/10

Went over Navarro section 6.1 in detail, and I think I have a pretty good grasp of the implementation rank, and an iffy grasp of select, in n + o(n) space. To summarize (well... I should disclaim, what started out as an attempt to make a nice summary has spiraled into a gruesome notational self-expository as I attempted to make sense of things I was typing, nonetheless, a very educational experience for myself):

Given an uncompressed bit vector B of length n
For rank_1(B, i):
- Divide the sequence into superblocks of length log^2(n)
- Have a corresponding dictionary superblockrank with n/log^2(n) entries (1 per superblock), having values corresponding to the cumulative number of 1s seen up to the i'th superblock, where 0 < i < n/log^2(n) (i.e. stores the number of 1s up to a certain superblock)
- For each superblock, divide it into log(n) blocks, each log(n) in length, all corresponding to a second level dictionary: blockrank
- blockrank has log(n) entries per block, for a total of n/log^2(n) * log(n) = n/log(n) entries.
- Rather than thinking of blockrank as a single large dictionary, it is easier to picture every superblock having its own blockrank dictionary of size log(n), and the entries of the blockrank contain the relative number of 1s between the current blockrank and its respective superblock. (i.e. superblockrank is the "base" and blockrank is the "displacement", and sum to the actual answer)
- Finally, there is a global table (or 2D matrix) smallrank, of size [2^t][t], where t = log(n)/2. Running vertically, it lists all the unique binary strings of length t (thus, 2^t of them), and running horizontally, it lists has the number of 1s up to the i'th position of the unique binary string, for 0 < i < t (i.e. the precomputed ranks of all binary strings up to length t)
- Thus, in constant time, we can lookup rank_1(B, i) = superblockrank[floor(i/log^2(n))] + blockrank[floor(i/log(n))] + smallrank[B[floor(i/log(n))log(n) + 1 : floor(i/log(n))log(n) + log(n)]][i - floor(i/log(n))log(n)]
- *In retrospect, that last portion is incorrect, as the remainder of the query after blockrank may be [0, log(n)), and as the max length of a query to smallrank = t = log(n)/2, there may need to be up to two queries to smallrank... so that last query should be to "rank1", and in reality be a few lookups to smallrank (again... this whole expository is more to my own benefit than anyone else...)
- Now, in layman terms... given some position, use superblockrank to count the number of 1s contained in all preceding (non-inclusive) superblocks, then repeat with blockrank, and finally use smallrank (up to 2 times) to get the exact count.
- Noting that rank_0(B, i) = i - rank_1(B, i), so this solves rank_0
For select_1(B, i):
- Similar... well, not really, but indexed over all possible arguments (i.e. the number of 1s in the sequence of size n, call this m)
- Divide all possible arguments into superblocks of log^2(m), with a corresponding superblockselect similar to above
  - Two types of superblocks exists, those spanning more than log^4(m) bits in B are long, and those less than are short
    - The answers to select queries landing in long superblocks are explicitly stored (due to something about fact the that the values answers can take in such a large range are not guaranteed to all fit in a limited number of bits)
    - short superblocks (having log^2(m) arguments, and spanning less than log^4(m) bits in B) are split into miniblocks of log^2(log^2(m)) arguments
- miniblocks are split into two types again, long if they span more than log(m) bits in B, and short otherwise
  - Answers to long miniblocks are stored explicitly
  - Answers to short miniblocks are stored in a precomputed table analogous to smallrank

I'm much more comfortable with rank than select, and I should mention this layer system feels somewhat similar to skiplists I previously learned, and it wasn't too bad once I looked past all the logs and space calculations riddling the paper...

I've started the broadword paper, but got sidetracked before any real progress was made, will continue tomorrow.

Edited to add:

After some quick thought, I recalled that for an integer of value x, it can be represented in a binary string of floor(log(x)) + 1 length, or O(log(x)). Thus, multiplying the number of entries by the number of bits required to represent the maximum value that every dictionary value can take, the bit memory space required for the three levels of the auxiliary structure for rank is:

superblockrank = number of superblockrank entries * bits needed to represent integer n (i.e. length of B) = n/log^2(n) * log(n) = O(n/log(n)) bits
blockrank = number of blockrank entries * bits needed to represent integer log(n) = n/log(n) * log(log(n)) = O(nlog(log(n))/log(n)) bits
smallrank = product of matrix dimensions(2^t and t, where t = log(n)/2) * number of bits required to represent integer t = n^(1/2) * log(n)/2 * log(log(n)/2)= O(n^(1/2)log(n)log(log(n)))

And with that, I finally understand where Navarro got all those numbers, and why this solution is possible in n + o(n)-bit space... which of course, has little to do with understanding the implementation, but it's just a personal relief (and surprise) that all the numbers worked out...

02/26/10

Went over Vigna's broadword implementation of rank today, some comments about the paper:

It is riddled with errors, some harmless, others not so (I also got another version of the paper off his website, which unfortunately contains the same errors, plus a frightfully enlarged section on k-bit comparisons):
- A serious one are two mistakes in Algorithm 1, which as you can imagine, didn't help with the understanding
- To resolve the above error, I attempted to access source code via the provided URL... which also contains a typo
- Google was used to resolve both aforementioned problems, and for once, I was more glad to stare at the source code than the paper
- A few other less significant problems were also present, but corrections were either obvious, or the error was very unobvious

Now, I will summarize the implementation of Vigna's rank9. Again, this is more for me, and not for the reader...:

Given a bit vector B of length n
Imagine the string stored in 64-bit memory, thus, every word is 64-bits
Thus, B has n/64 words
Partition all words into groups of 8, called a basic block (analogous to superblock)
- I will refer to the words in basics blocks with indexes [0,7]
Every basic block is associated with 2 entries within a dictionary, I will call count
- Each entry is a word, or 64-bits
  - Entry Type A: The first word contains the rank of basic block 0
  - Entry Type B: The second word contains seven 9-bit values, representing the rank count relative to block 0 for blocks 1 to 7
- Thus, the Type A is analogous to superblockrank, while Type B is analogous to blockrank
- Notice that Type A and Type B are contiguous in the same structure, meaning that a single cache miss to Type A will bring in its respective Type B to ensure a subsequent hit. (versus two separate structures in "basic solution")
- Assuming count[0] is Type A, then all even count entries will by A, while all odd entries will be B
There is no structure analogous to smallrank, instead sideways addition is used to determine the rank of any sub-sequence within a word
Similar to the basic solution then, given some position i in a rank query:
- Using some tricky integer division and mod operations, we can determine exact Type A cumulative count entry to look up
- + 1 to that index to obtain the value of the Type B entry, then apply a LOT of bit shifting and masking to extract the correct 9-bit relative count value
- Some final bit-masking and sideways addition to find the rank of sub-sequences within words
- Sum all three values to obtain the correct rank
Memory requirement
- As every 8 word basic block requires 2 word-length entries from a count structure, it requires only 25% additional space to answer rank in constant time

I will note that I spent quite a lot of time convincing myself that sideways addition really worked, and I'm still working out a few mental knots when it comes to all the bit masking and shifting.

Having just typed all that out though, I've come to believe that maybe I don't have to understand the exact details to present the overall idea...

I will probably have to accept this train of thought as I move onto select9, which seems to include quite a few complex bit operations...

Raw edit | More topic actions

Topic revision: r3 - 2010-05-27 - jujubix