20100103

PC CPU Task Parallelism Limits

All numbers below are from a Windows XP + 2.39 GHz Intel Core 2 Duo laptop. Performance on other machines and configurations will vary (duh).


Some Numbers

Histogram of the measured clock cycles/task of one thread processing the same ~10K clock cycle task many times (in a loop) on the above laptop,

835330 ... 0-16K cycles
8005 ..... 16K-64K cycles
1178 ..... 64K-256K cycles
66 ....... 256K-1M cycles
20 ....... 1M-4M cycles
5 ........ greater than 4M cycles


This specific task is a fully ALU bound dummy task which spins doing pure math (all values in registers). Run time variability is a function of hyper-threading (minor effect) and operating system preemption (major effect). Time scale here is somewhere around 4 seconds of total run time (for 0.8M tasks), which could be 120 frames of a 30 Hz game on this laptop. Note the variability.


Preemption

A major problem for any task parallel system on the CPU is preemption at non-task boundaries. The net result is that tasks at random could easily get stalled for over 1 millisecond (happened 5 times in 4 seconds in the above example). Any tasks which depend on the result of those stalled tasks also get stalled.

The visible result is variable frame rates and a screwed over player/user experience when the application attempts to have more than a few task parallel dependencies per video frame.


Workarounds

The best solution is cooperative multitasking: tasks release the CPU when they are finished, programmers insure task run time is at the desired program response latency. As desktop operating systems have de-evolved over the years we have lost the ability to provide a correct solution.

Are there any workarounds?

UNDER-UTILIZE CPU - In order to always hit v-sync, the program can under-utilize the CPU such that the maximum amount of time taking to compute a frame is the minimum amount of time in the worst case preemption. Clearly preemption is not a bounded problem, a program will always have to accept some kind of frame drop.

DUPLICATE TASKS - With a painful case of over-engineering, one could build a task system with transactional memory, such that if a currently needed task dependency is not completed due to preemption, that a running thread could manually duplicate and run the dependent task.

Anything less painful?


Manual Self Preemption?

Seemed like a great idea: keep track of thread run time and yield execution to another task worker thread before the operating system would preempt the thread's time-slice. This way the program could simulate cooperative multitasking switching at task boundaries.

Epic fail on the XP Laptop!

Of the two ways to do this with fixed thread affinity: pair of threads yielding execution, or pair of threads blocking/releasing each other via signals, all have high overhead and do not fully solve the variability problem.

So anything less painful?

No!

20091213

20091208

Parallel Beast

Been too long since I've posted anything about parallel programming, considering my addiction with the topic. Guess I should also make this clear, that what I post here is my (sometimes radical) personal view and not necessarily representative of my employer.


Perpetual "Computational" Burnout

Burnout as in data moves around and around in a circle, but just cannot seem to put the power to any useful computation. Bill Dally's SC09: The Future of GPU Computing Presentation puts this to numbers on page 14,

"Moving a word across die = 10FMAs"
"Moving a word off chip = 20FMAs"



Data Routing and Work Scheduling Problem

Efficient parallel programming is effectively a data routing and work scheduling problem, balancing the cost of moving data and computation. Chip is effectively a static routed network, going parallel involves reappropriating chip area towards computation at the cost of something else. Massively parallel machines will continue to enable algorithms which take advantage of limiting data movement to scale in performance.

Scheduling is of critical importance because it insures data is reused instead of being moved to and from a holding area as a result of inefficient ordering of computation.


Memory Hierarchy

Furthermore as the balance of chip area enables more computation, the area devoted to holding data is used more for storing active computation rather than simply storing data. Area devoted to holding data includes registers, cache, and/or program managed local store.

The difference between a massively scalar processor and a massively parallel processor can be visualized by thinking of the on-chip memory hierarchy pyramid. At the tip of the massively scalar processor pyramid is the registers and at the bottom is L3. In contrast on a massively parallel processor the registers might be the widest part of the pyramid (inverted pyramid)!

The life time of the data in this holding area is greatly reduced as throughput increases. This places even more importance on scheduling, data must be reused in the ever smaller window of opportunity. The roll played by the memory hierarchy changes from holding data for a long time for use in much later computations, to more of a roll of smoothing out a short window of access patterns.

What this literally means is that to be efficiently parallel at a fine granularity, work must be explicitly scheduled group reusing data. Anything fine granularity and dynamic comes at a cost. The reason for vectorization and grouping vectors into blocks is part way to insure the tight scheduling required to group data access to common data.


Medium Granularity Scheduling: OS Hand of Perf Death

Preemption is a nightmare for the latency of task parallel programming dependencies. When the scheduler preempts your worker thread while it is in the middle of processing a task, dependent tasks wait on the rescheduling of that thread so the task can finish. Often that thread was just placed on the end of a queue, so there is going to be a good delay before that thread will run again. What is desired in this case is a cooperative or task aware scheduling, so that a thread will yield to another thread at a task boundary. Working around these system level problems causes a cascade of design complexities.

What I want personally in the future of Operating Systems and massively parallel machines is an OS which can allocate entire cores and memory to a application for exclusive access! Take your awful mess and get out of the way. Let me return to the days of microsecond and nanoseconds response time instead of the abysmal peaks of milliseconds of latency on scheduling.

Actually no need to wait, can do this right now.


Raw Programming a Parallel Beast

Hidden under the guise of a GPU is an awesome parallel machine, and a driver which provides chunks of exclusive access to the entire machine, and during those periods of exclusive access, no operating system or driver to screw anything up.

Getting Naked With the Parallel Beast!

Dead simple to play in CUDA and a GT200 series GPU: launch enough warps via 32xN sized CTAs to fill the machine just once using a Cx1 size grid. Where C is the number of CTAs to fill the machine. The result is N*C "hardware threads" (as in a traditional CPU) with 32-wide vector units, texture units, and unified memory! Use nvcc's "--maxrregcount" to set the maximum number of registers per warp such that C CTAs will correctly fill machine (and be careful with shared memory usage).

Use clock() to insure the warps exit at a task aware boundary at some Hz, then simply repeatedly relaunch the same kernel to continue the machine. Allocate a huge chunk of GPU memory to work within without fear that accessing that memory might ever incur a page fault. Use page-locked host memory to roll your own queue to communicate to and from the CPU for keyboard and mouse IO. Do a little OpenGL interop for drawing to the display, and you have a 100% software massively parallel machine to prototype or code on.

Scalar code in CUDA is easy, simply write C code and don't explicitly use "threadIdx.x" to have the code adapt to the lane of the 32-wide vector. Use a mix of scalar and vector code at will.

The launch kernel becomes the binary of the "operating system" and "application". Use "threadIdx.y + N*blockIdx.x" to obtain an index to the "hardware thread". CUDA 2.3 PTX ISA does NOT provide branch or call by register, workaround to provide a data dependent branch is a giant switch (GPU's actually a lot better at it than you think it is). Use global memory atomics to communicate between warps or CTAs.

Could run N*C instances of something like Contiki and do message passing through global memory. Or could test data structures for 100's of "cpu thread" scaling. Perfect for those who want to return to the glory days of 64KB but with 1Tflop and 1GB of memory ... 320x200 at 30 fps provides 0.5 Mflops per pixel per frame on a GTX 275. What's not to love about that!

Imagination is your limit.

Demo Tube 2













20091202

GigaVoxels Mandelbulb

GigaBroccoli: The Mandelbulb into GigaVoxels Blog Post

Really cool work by Cyril, this time run-time on demand generation of 3D geometry into the GigaVoxels brick cache. See the blog link above for more info, videos, and screen shots.

PixelJunk Shooter Dec 10th!

PixelJunk Shooter Thursday, December 10th

20091201

Real-Life "CFD Visualization"



Longer exposure night shot of the sparks from a backyard bonfire visiting family on Thanksgiving. Thanks to my wife Kathryn for the photograph, and my nephew Joe for motivation required to take the time to burn wet firewood,

PixelJunk Shooter Youtubes



20091124

Battlefield: Bad Company 2

Been playing the PS3 beta because it is an awfully addictively fun game. Don't think I will ever want to go back to a game with a trivially static play area again, no matter how pretty the developer makes it. Baked global illumination, forget it, just not good enough any more. Nothing compares to the fun factor of being able to leave your mark, to be able to interact with your surroundings, and to do that in a tank!



The point of excellence in game design is when a game provides more than just the sum of canned content generated in production. Otherwise a game is simply an interactive movie: you play to see the next scene, and when the end is found, there is not anything more to it. Sure there is the market for the interactive novel, but games since the dawn of humans have been about enjoying the interaction between people in a system engineered to produce fun in the process. By this definition, BBC2 multi-player beta definitely captures the essence of what it is to be a game.



Other devs out there should be taking notes.

You've got greater than 1 Tflop to play with on high end GPUs in the PC space, over 100,000 times more processing power than many of us started programming with, and more to come in future years.

What are you going to do with that?

Many of you are simply going to increase your resolution, provide a more pretty interactive novel. Others are going to push forward with something new, provide gamers with an interactive experience which brings renewed energy into the industry, and learn to wield the scaling of the parallel machine for something beyond just graphics!

20091120

Real Time Global Illumination Using Temporal Coherence

Martin Knecht has posted a video and his thesis on Real Time Global Illumination Using Temporal Coherence.

20091101

Link Soup

meshula.net : Stone Soup - Thanks Nick for this post reminiscing of the exotic time before the common C compiler!

Real-time Parallel Hashing on the GPU - Neat paper, and poster at GTC. Builds upon Cuckoo hashing (using N>=2 hash functions instead of one, lookup requires at most a check of N places, insertion requires recursive eviction and insertion of the filled bin using the other hash functions), cuckoo hashing for N=3 hash functions can almost achieve 90% hash table occupancy. Paper presents parallel cuckoo hashing. First step is to use a high level hash function to divide the input into bins sized to fit into the local store, followed by parallel version of cuckoo hashing for the bin. Read the paper for more details.

Stochastic Progressive Photon Mapping - PPM extended to compute the average radiance over a region instead of a point.

Amortized Supersampling - Interesting, still problems with rapidly changing shading such as specular highlights, but great progress towards a complete solution.


Rudebox by Alcatrax




Jellyfish

Visited the Monterey Bay Aquarium the week before Halloween,



20091028

Random End of October

Looks like Insomniac updated their website: Link to R&D Page. Going to have to finish Uncharted 2 before I get onto A Crack in Time however. Been meaning to watch Nürburgring 24 Hour Race in 3D.