DanielJanuaryArchive < BETA

Daniel's January 2010 Research Journal

01/20/10

Spent afternoon familiarizing self with BETA Lab workstation leros and the following NGS aligners (and tool) currently available:

Binaries of the programs for Linux i686 were deposited in /ubc/cs/research/condon/share/projects/ngs-aligner/scratch, and example inputs (when provided) were attempted.

Unfortunately, the examples were often too simplified (assuming it even worked) to gain a real appreciation of the entire interface.

01/21/10

Pipleine Taskforce Meeting with Sohrab at the BCCRC, based on what little I was able to understand, some points that (I think) were brought up:

Current downstream analysis limited by high false positive rates (30% to 40%), potentially cause by sequencing errors and/or bad alignments.
The exact reason for bad alignment is unknown, it'd be nice if we could characterize sequences with this property... e.g. low complexity?
Let's start out our aligning approach using full Smith-Waterman on all reads... (I think Mosaik does something similar, but the paper is unfortunately still in manuscript)
From this, we then attempt to trim and optimize this
Specific example to watch out for, polynucleotide stretches... and low complexity regions in general

Then... talk shifted away from aligning general full genome reads, to aligning transcriptome-derived reads to references sequences.

Chris's original plan: design a data structure representing splices as a graph, instead of junctions
Advantages would include the easy addition of splice sites and representation of alternative splicing
Structure could be overlayed onto the full genome, and one could toggle easily between the full genome and transcripts
Chris may or may not have to rethink a few details of this structure to include exome information too...
Exome data, aligned to a full genome reference (unfortunately... or else a smaller reference would make this faster), can detect non-expressed mutant alleles or mono-allelic expressed genes
Also keep RNA edits in mind, which may result in variant transcripts from a non-variant genome.

Back to the tools...

They use SAMtools, so we're aiming to output a SAM (or its binary equivalent BAM, according to Anamaria)
It'd be nice if the aligner could bootstrap SNP results, propagating newly discovered SNPs to subsequent alignments (simple with slice graph... doesn't require complete re-indexing of reads)
There were three probabilities useful for downstream analysis
- "Alignment Probability": As Sohrab puts it, what's the probability that the read in question actually arose from the aligned portion of the reference genome. It is not the confidence based on how unique the alignment is in the genome. That is the next one...
- "Alignment Quality": How unique the alignment is in the genome, the less aligned positions in the genome, the better. Sohrab's dissatisfied with the current software, which sets quality to 0 whenever more than 1 positions is found in the sequence, consider assigning partial probabilities to both? Evenly spread out?
- Was there a third one...? A placement score? Well.. maybe not.

Current issues

Memory is limited
Aligning is typically slow on the cluster... but quite fast if maps are unique.
in-del detection is required
Need specificity in SNP detection

Now, of course, this is a very limited and potentially misinterpreted version of the meeting on the second day of the job. Maybe I'll run over this with Chris to clarify some things.

01/22/10

Started to look into source code of the aforementioned programs... Quite frankly, I'm overwhelmed by it all (an undergraduate education sure doesn't help much... not that I didn't know).

Took a different approach and started reading up on the corresponding papers available for the programs:

Didn't quite help with understanding the interface, but gave a good overall picture of software usage and may help put things into context down the road. Maybe I'll go look over the user manual's again now that I have a better idea.

Incidentally, the Maq paper defines the "probabilities" quite well.

01/25/10

Met with Chris, the initial plans are to create a dummy interface that:

Reads and parses a FASTQ file containing multiple short (potentially variable-length) reads
Holds each parsed FASTQ sequence and its respective quality in a struct
Feeds this struct into an alignment function
"Magic goes here"
The alignment is returned in a SAM-friendly format
Output alignment data to SAM file
Convert SAM to BAM...? (Is this required? Or can we directly output in BAM?)

Leaving aside the iffy issues that will probably resolve itself once implementation gets underway, I've found the following resources which may be of use:

Maq's FASTQ specifications page and another one from the IIGB
Real data from the 1000 Genomes Project in Sanger FASTQ format, freely available for testing use. Using this circumvents the need to obtain potentially confidential data from the BCCRC.
A MIT licensed FASTA/FASTQ Parser in C by the author of MAQ and SAMtools, that should solve half my work.
SAMtools... which I will probably need, although I don't know when. At the very least, the page provides the specifications for the file format.

01/26/10

Well, I've made something that feels like a first year programming project... but it's a start. I've dropped it off in /ubc/cs/research/condon/share/projects/ngs-aligner/scratch/dummy_interface, since it is far from being /src/ worthy...

While I started out wanting to use existing code and libraries, such as the FASTQ Parser (which was actually just a single .h file extracted from SAMtools), due the lack of a proper API and the fact that everything went against our proposed guidelines, I felt it would be easier if I simply started from scratch.

It is currently, one large main(), two structs, and a "magic" function (as described above), all in a single .cc file. Things to do:

Make it more robust, the parsing algorithm assumes ideal FASTQ2 files, without line wraps or empty lines. (If required, I could use the parsing algorithm from the FASTQ parser mentioned above)
Test the outputted SAM file in an actual alignment viewer, such as MAQViewer, which unfortunately, wasn't as simple as I thought
Figure out some way of outputting a BAM file at the end, which may involve converting SAM to BAM, or directly generating BAM output.
Properly learn and create Makefile and header files... as it's always been a self-taught trial and error process that's barely been getting me through assignments.

If possible, I should meet with Chris soon and decide what to do with this...

01/27/10

Met with Chris, will forge on ahead. Some things to do:

Refactor code so it conforms to the Google C++ Style. To aid this, I've found that looking at some example code using the same style quite helpful (at least a tad more entertaining than staring at the guide all day).
Pull functions out of the gigantic main() in .cc, and start forward declaring and documenting in the .h, so it'll hopefully be part of the library (I doubt it... the way I'm handling files is probably not going to work when faced with FASTQ files with over 100k reads efficiently enough...)
Rewrite the fastq parsing function so it's robust (i.e. takes line wraps, empty lines, possibly fasta, and detects erroneous input)
Make sure output SAM file works in tview, which is a command in SAMtools. (I've given up attempting to display it in MAQViewer, as it doesn't seem to take SAM files natively)
Contemplating whether to output as BAM or not... as bowtie outputs as SAM, and tells endusers to convert it themselves using SAMtools...
Think of some way of representing a single read that aligns partially to two parts of the reference (i.e. an exon read split by an intron in the reference)
Eventually... think of some algorithm that, when given a sequence and a quality, will generate all possible sequences above a certain quality threshold (i.e. a nucleotide with a low quality has a higher probability of being another nucleotide in reality, so we substitute it with the 3 other possibilities, to generate 4 read sequences in total... and the number of reads will then explode exponentially)

01/28/10

Modified dummy_interface... now know as fastq_parser (but still in /ubc/cs/research/condon/share/projects/ngs-aligner/scratch/dummy_interface), as follows:

Started pulling procedures out of the large main, considering whether to make them methods (i.e. OOP). Also declared them in the header... but I've yet to get to documentation.
FASTQ parser now very robust, thanks to the wonderfully coded and commented parser implementation I found at in the open source BioPython package. It now deals correctly with:
- Whitespace/comments between and before FASTQ reads
- Linewrapped sequence and/or qualities
- Qualities starting with @ and +
- Invalid FASTQ input (it's not very compromising though... as it'll simply abort all operations once this occurs)
Confirmed that outputted SAM can be displayed in tview. (After you change it to BAM, which requires a FASTA reference, which has to be indexed, and then you have to sort the BAM, then index the sorted BAM, then you can display it against the reference... phew... I've done all this already, so simply typing samtools tview sorted.bam ex1.fa will display the results)

01/29/10

After confirming with Chris, I've begun restructure everything in OOP style. I've started to toss some working classes into /ubc/cs/research/condon/share/projects/ngs-aligner/src. Some notes:

While the Google style guide discourages the use of streams, I've used them because:
- C++ streams are objected oriented
- They work with C++ string class
- The alternative is to work with C's stdio library, which uses char buffers and FILE pointers in procedures... and while I've build FTP and HTTP servers out of them, I'd prefer to avoid them if possible.
Classes yet to be made for the "output" end of the interface, i.e. SAM classes
Still need to document everything afterwards...
I've given up on using macros in Makefile... and I'm just sticking to manually typing lines...
Will need to ask exactly how to include libraries in header and class files... i.e. should I try to include everything in the .h, which will propagate to the .cc? or include only the bare minimum in the .h to allow it to compile, and have everything else in the .cc?
I've also given up on putting sub-directories into /src e.g. /src/fastq , since I can't seem to figure out how to target a file in a different directory in my Makefile...

I'll finish off the last few classes for the interface IO, and potentially start documenting...

01/30/10

Finished OOP implementation of original parser, now split into the following classes:

FastqFile: a FASTQ file... you can open it, and get reads from it
FastqEntry: a FASTQ read, has a description, a sequence, and a quality. Will probably need methods to output quality numerically (which is interesting, due to various different encodings)
SamFile: a SAM file, you open it, and write into it
SamEntry: a SAM entry, has 11 mandatory parameters, and additional optional ones
Aligner: magic goes here... currently has a dummy method that changes a FastqEntry to a SamEntry

Everything is in the \src folder, and should by typing the following commands:

make
test.exe, will create "output.sam".

Still... no documentation, which will probably be my next goal.

Raw edit | More topic actions

Topic revision: r2 - 2010-05-04 - jujubix