Difference: JaysTermTwoJournal (9 vs. 10)

Revision 102011-02-07 - jayzhang

Line: 1 to 1
 
META TOPICPARENT name="NGSAlignerProject"

01/05/2011

Started reading more in-depth into CUDA. Made some notes (I'll post them later).
Line: 56 to 56
 
  • Optimize!
  • Figure out the efficiency of my current implementation
Added:
>
>

02/07/2011

Haven't done much updating lately. As per Chris' suggestion, I've been working on finding ways to assess and benchmark performance gains on the GPGPU kernel code. I've also finished a working version of a full Smith-Waterman alignment (no affine scores yet).

Luckily, it turns out that my emulator, =gpuocelot=, is able to do some pretty good performance benchmarks. Some numbers that can be looked at are:

  • Memory Occupancy (given as a percentage) - this one is generated by a spreadsheet provided by Nvidia. It is basically a measure of how many warps are able to be active at a given time versus the maximum number of allowable warps. This is determined by how much shared memory and registers each thread uses, as these are shared resources. The effects of memory occupancy, however, are lessened by increasing the number of thread blocks (i.e. kernels should run at least one thread block per multiprocessor). I believe the end aligner will actually use as many thread blocks as possible (one thread block per read, can scale the reads up however much I want), so occupancy might not be as much of an issue.
  • Activity factor - I'm not sure if this number is given specifically by gpuocelot; it might just be a number used for other calculations. Basically, activity factor is determined by branching; it tracks the ratio between the average number of threads run at one time to the maximum number of threads. Activity factor decreases when there are more divergent branches (e.g. if a branch splits a warp into two, then the activity factor may be something like 50%).
  • Memory intensity - basically, a measure of how much the global memory is being used. Lower memory intensities mean the kernel is more compute-bounded than latency-bounded.
  • Memory efficiency - measures how efficient global memory is being accessed. I think divergent branches and bank conflicts may lower this number.
  • Inter-thread dataflow - measures how much the shared memory is being used. This number will affect the aligner quite a bit, I think, since I make use of shared memory a lot for each alignment.
  • Parallelism - measures MIMD and SIMD parallelism. These values are a measure of how scalable the kernel is, if we add more available multiprocessors. MIMD parallelism pretty much depends on the number of blocks (i.e. what benefit adding more multiprocessors will give), while SIMD parallelism is determined by the efficiency of the kernel within a block (I think this is akin to the benefit adding more warps will give? I'm not sure on this count). I believe the SIMD parallelism is a reflection on the activity factor.

So there's a lot of numbers being thrown around. I found this [[http://www.gdiamos.net/papers/iiswcOcelot.pdf][paper] to be helpful in this regard. I'm still having a little trouble getting all these numbers to actually display, so this is still an area I'm working on.

On the flip side, I'm also working on getting a testing framework set up so I can verify the correctness of my results. I'm thinking I'll just generate some random sequences, run them through one of my existing local aligners to get the scores. Then, I'll just make sure the scores match.

 -- Main.jayzhang - 05 Jan 2011
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2025 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback