20091213

20091208

Parallel Beast

Been too long since I've posted anything about parallel programming, considering my addiction with the topic. Guess I should also make this clear, that what I post here is my (sometimes radical) personal view and not necessarily representative of my employer.


Perpetual "Computational" Burnout

Burnout as in data moves around and around in a circle, but just cannot seem to put the power to any useful computation. Bill Dally's SC09: The Future of GPU Computing Presentation puts this to numbers on page 14,

"Moving a word across die = 10FMAs"
"Moving a word off chip = 20FMAs"



Data Routing and Work Scheduling Problem

Efficient parallel programming is effectively a data routing and work scheduling problem, balancing the cost of moving data and computation. Chip is effectively a static routed network, going parallel involves reappropriating chip area towards computation at the cost of something else. Massively parallel machines will continue to enable algorithms which take advantage of limiting data movement to scale in performance.

Scheduling is of critical importance because it insures data is reused instead of being moved to and from a holding area as a result of inefficient ordering of computation.


Memory Hierarchy

Furthermore as the balance of chip area enables more computation, the area devoted to holding data is used more for storing active computation rather than simply storing data. Area devoted to holding data includes registers, cache, and/or program managed local store.

The difference between a massively scalar processor and a massively parallel processor can be visualized by thinking of the on-chip memory hierarchy pyramid. At the tip of the massively scalar processor pyramid is the registers and at the bottom is L3. In contrast on a massively parallel processor the registers might be the widest part of the pyramid (inverted pyramid)!

The life time of the data in this holding area is greatly reduced as throughput increases. This places even more importance on scheduling, data must be reused in the ever smaller window of opportunity. The roll played by the memory hierarchy changes from holding data for a long time for use in much later computations, to more of a roll of smoothing out a short window of access patterns.

What this literally means is that to be efficiently parallel at a fine granularity, work must be explicitly scheduled group reusing data. Anything fine granularity and dynamic comes at a cost. The reason for vectorization and grouping vectors into blocks is part way to insure the tight scheduling required to group data access to common data.


Medium Granularity Scheduling: OS Hand of Perf Death

Preemption is a nightmare for the latency of task parallel programming dependencies. When the scheduler preempts your worker thread while it is in the middle of processing a task, dependent tasks wait on the rescheduling of that thread so the task can finish. Often that thread was just placed on the end of a queue, so there is going to be a good delay before that thread will run again. What is desired in this case is a cooperative or task aware scheduling, so that a thread will yield to another thread at a task boundary. Working around these system level problems causes a cascade of design complexities.

What I want personally in the future of Operating Systems and massively parallel machines is an OS which can allocate entire cores and memory to a application for exclusive access! Take your awful mess and get out of the way. Let me return to the days of microsecond and nanoseconds response time instead of the abysmal peaks of milliseconds of latency on scheduling.

Actually no need to wait, can do this right now.


Raw Programming a Parallel Beast

Hidden under the guise of a GPU is an awesome parallel machine, and a driver which provides chunks of exclusive access to the entire machine, and during those periods of exclusive access, no operating system or driver to screw anything up.

Getting Naked With the Parallel Beast!

Dead simple to play in CUDA and a GT200 series GPU: launch enough warps via 32xN sized CTAs to fill the machine just once using a Cx1 size grid. Where C is the number of CTAs to fill the machine. The result is N*C "hardware threads" (as in a traditional CPU) with 32-wide vector units, texture units, and unified memory! Use nvcc's "--maxrregcount" to set the maximum number of registers per warp such that C CTAs will correctly fill machine (and be careful with shared memory usage).

Use clock() to insure the warps exit at a task aware boundary at some Hz, then simply repeatedly relaunch the same kernel to continue the machine. Allocate a huge chunk of GPU memory to work within without fear that accessing that memory might ever incur a page fault. Use page-locked host memory to roll your own queue to communicate to and from the CPU for keyboard and mouse IO. Do a little OpenGL interop for drawing to the display, and you have a 100% software massively parallel machine to prototype or code on.

Scalar code in CUDA is easy, simply write C code and don't explicitly use "threadIdx.x" to have the code adapt to the lane of the 32-wide vector. Use a mix of scalar and vector code at will.

The launch kernel becomes the binary of the "operating system" and "application". Use "threadIdx.y + N*blockIdx.x" to obtain an index to the "hardware thread". CUDA 2.3 PTX ISA does NOT provide branch or call by register, workaround to provide a data dependent branch is a giant switch (GPU's actually a lot better at it than you think it is). Use global memory atomics to communicate between warps or CTAs.

Could run N*C instances of something like Contiki and do message passing through global memory. Or could test data structures for 100's of "cpu thread" scaling. Perfect for those who want to return to the glory days of 64KB but with 1Tflop and 1GB of memory ... 320x200 at 30 fps provides 0.5 Mflops per pixel per frame on a GTX 275. What's not to love about that!

Imagination is your limit.

Demo Tube 2













20091202

GigaVoxels Mandelbulb

GigaBroccoli: The Mandelbulb into GigaVoxels Blog Post

Really cool work by Cyril, this time run-time on demand generation of 3D geometry into the GigaVoxels brick cache. See the blog link above for more info, videos, and screen shots.

PixelJunk Shooter Dec 10th!

PixelJunk Shooter Thursday, December 10th

20091201

Real-Life "CFD Visualization"



Longer exposure night shot of the sparks from a backyard bonfire visiting family on Thanksgiving. Thanks to my wife Kathryn for the photograph, and my nephew Joe for motivation required to take the time to burn wet firewood,

PixelJunk Shooter Youtubes