Arch Linux Website | x86-64 Package List
At home for OpenGL and CUDA development I now have Arch Linux on my new AMD64 machine. Lots of thanks to Judd Vinet and Aaron Griffin for the best Linux distro out there!
pacman -S awesome
20090729
MinWM
Not sure why I didn't do this years ago, I open sourced my tiny X window manager on google code.
Mimimal Window Manager [MinWM]
Project Hosted on Google Code
Simple and tiny (10KB) single screen tiling X Window Manager. No messy configuration files. Provides the following keyboard interface to manage windows,
ALT+ESC .......... Close window.
ALT+TAB .......... Cycle through window list (just like Windows).
ALT+` ............ Cycle window shape between full, and left/right halves.
And that is it!
MinWM requires that the user starts a terminal, prior to running MinWM, which can be used to start other applications.
History
Moving windows around the screen is a waste of time. I got sick of traditional X Window Managers and decided to write my own. 100% of the time I either wanted one window full screen, or two windows split screen, and a very simple way to manage windows without the mouse. From this MinWM was born.
I have been using it for maybe a decade, and still the current tiny tile based X Window Managers are too complicated and get it all wrong. So MinWM is now open source so anyone can enjoy working fast!
Compile From Source
Grab minwm.c from google code. Might have to modify your include and library path, but something like this should work,
gcc minwm.c -Os -o minwm -I/usr/X11/include -L/usr/X11/lib -lX11
Install
Compile yourself a binary. Setup your .xinitrc file to startup a terminal and then minwm,
xterm -rv -ls +sb -sl 4096 &
$HOME/minwm
Then run xinit to start X.
Mimimal Window Manager [MinWM]
Project Hosted on Google Code
Simple and tiny (10KB) single screen tiling X Window Manager. No messy configuration files. Provides the following keyboard interface to manage windows,
ALT+ESC .......... Close window.
ALT+TAB .......... Cycle through window list (just like Windows).
ALT+` ............ Cycle window shape between full, and left/right halves.
And that is it!
MinWM requires that the user starts a terminal, prior to running MinWM, which can be used to start other applications.
History
Moving windows around the screen is a waste of time. I got sick of traditional X Window Managers and decided to write my own. 100% of the time I either wanted one window full screen, or two windows split screen, and a very simple way to manage windows without the mouse. From this MinWM was born.
I have been using it for maybe a decade, and still the current tiny tile based X Window Managers are too complicated and get it all wrong. So MinWM is now open source so anyone can enjoy working fast!
Compile From Source
Grab minwm.c from google code. Might have to modify your include and library path, but something like this should work,
gcc minwm.c -Os -o minwm -I/usr/X11/include -L/usr/X11/lib -lX11
Install
Compile yourself a binary. Setup your .xinitrc file to startup a terminal and then minwm,
xterm -rv -ls +sb -sl 4096 &
$HOME/minwm
Then run xinit to start X.
Morphological AA Part III - GPU Ideas
Filtering Via Fractional Pixel Offset
Seems like a filtering pass could be done using one bilinear fetch per pixel using a computed fractional pixel offset to blend between the neighborhood of 4 pixels. This should be enough to remove aliasing. One could look at the simple case of a horizontal span,
In this case (horizontal) the fractional pixel offset would be mostly a vertical adjustment. This offset would always be on the side of the pixel center with the highest intensity gradient. As drawn in ascii below, offset would be highest in the pixels CC, would taper off to low offset by the Zs, and would reverse direction in area CC (left side would point down, right side would point up). The CC area is where the edge crosses the pixel center.
Temporal Amortizing Searching?
After re-thinking this a little: not worth it!
The expensive part is finding the "CC" points on the rows and columns of pixels in the image, and then the "Z" to "Z" length, which is used to compute the sub pixel offset to filter with.
The first part, finding "CC" points, is NOT going to have much temporal coherence other than being close to edges in the image. The span length ("Z" to "Z" length) however should have some good temporal coherence, but this span length can be easily computed from the "CC" point data.
Making This Fast on the GPU
This is just a bunch of notes for some later point when I have both some free time to try stuff and a desire to use this...
I have a feeling that using the offset method one could fully separate this into vertical and horizontal passes (if one wanted to do this, which I wouldn't). Meaning fractional pixel offset in X direction is dependent on Y spans only, and Y direction is dependent on X spans only.
Problem with the GPU is that one needs tens of thousands of threads to be fast. So the trivial case of one scan line (or vertical line) per thread isn't going to work. The algorithm will have to be tiled.
Searching for "CC" points in parallel might be possible with the following process.
1. First map the image into X and Y intensity gradients (fold this into the first reduction pass), then reduce by highest gradient, keeping position of highest gradient during the reduction. With CUDA/CS/CL one might be able to do the entire reduction in one kernel pass (unlike with traditional GPGPU methods).
2. Then run a full screen pass parallelized in tiles, use the reduction tree to search for a set of important "CC" points per the vertical and horizontal spans in the tile (note the search will have to extend outside the area of the tile to support AA of near vertical and horizontal edges). Use the diff in closest "CC" coordinates between lines to compute the "Z" to "Z" length per span. From this information, the fractional pixel offset could be computed...
Seems like a filtering pass could be done using one bilinear fetch per pixel using a computed fractional pixel offset to blend between the neighborhood of 4 pixels. This should be enough to remove aliasing. One could look at the simple case of a horizontal span,
XXXX .,xXX -- example horizontal span
XXXXXXXX .,xXXXXXX
XXXXXXXXXXXX to ,xXXXXXXXXXX
XXXXXXXXXXXX XXXXXXXXXXXX
In this case (horizontal) the fractional pixel offset would be mostly a vertical adjustment. This offset would always be on the side of the pixel center with the highest intensity gradient. As drawn in ascii below, offset would be highest in the pixels CC, would taper off to low offset by the Zs, and would reverse direction in area CC (left side would point down, right side would point up). The CC area is where the edge crosses the pixel center.
Z CC Z
| || |
XXXXXXX -- talking about this span
XXXXXXXXXXXXXX
XXXXXXXXXXXXXX
Temporal Amortizing Searching?
After re-thinking this a little: not worth it!
The expensive part is finding the "CC" points on the rows and columns of pixels in the image, and then the "Z" to "Z" length, which is used to compute the sub pixel offset to filter with.
The first part, finding "CC" points, is NOT going to have much temporal coherence other than being close to edges in the image. The span length ("Z" to "Z" length) however should have some good temporal coherence, but this span length can be easily computed from the "CC" point data.
Making This Fast on the GPU
This is just a bunch of notes for some later point when I have both some free time to try stuff and a desire to use this...
I have a feeling that using the offset method one could fully separate this into vertical and horizontal passes (if one wanted to do this, which I wouldn't). Meaning fractional pixel offset in X direction is dependent on Y spans only, and Y direction is dependent on X spans only.
Problem with the GPU is that one needs tens of thousands of threads to be fast. So the trivial case of one scan line (or vertical line) per thread isn't going to work. The algorithm will have to be tiled.
Searching for "CC" points in parallel might be possible with the following process.
1. First map the image into X and Y intensity gradients (fold this into the first reduction pass), then reduce by highest gradient, keeping position of highest gradient during the reduction. With CUDA/CS/CL one might be able to do the entire reduction in one kernel pass (unlike with traditional GPGPU methods).
2. Then run a full screen pass parallelized in tiles, use the reduction tree to search for a set of important "CC" points per the vertical and horizontal spans in the tile (note the search will have to extend outside the area of the tile to support AA of near vertical and horizontal edges). Use the diff in closest "CC" coordinates between lines to compute the "Z" to "Z" length per span. From this information, the fractional pixel offset could be computed...
20090728
Morphological AA Part II
This is a follow up to Justin's question on the Morphological Antialiasing Post and more crazy AA ideas.
What I think is most interesting about these types of post filtering techniques, is that it likely might help with things like specular, cubemap, and alpha test aliasing which can be tough to solve using other methods.
Filtering via Fractional Pixel Offset
EDIT. Nope that idea needed a lot of work... so I killed it!
EDIT. Actually, idea is sound, see third post!
Temporal Feedback
I did a prior test on a polygon based renderer in which I only used motion vector based framebuffer reprojection feedback and a little bit of smooth camera breathing motion even when still (as you often get in a FPS) to remove aliasing. Results were promising, but I never did have the opportunity to use it anywhere (yet).
Note, any method which is amortized temporally will require some form of data feedback, which will at a minimum be another RT, and thus be costly like a full screen post pass. Another important thing to note is that under motion the eye won't see over-filtering or noise as much, as long as it converges to a good enough looking sharp but not aliased edge. One could take advantage of noise or excess filtering under motion when temporal feedback likely won't help as much.
Crazy Depth Pre-Pass for AA
Jake Cannell's Deferred Rendering Without MSAA but With MSAA Z Prepass Idea Post talks about doing a 4x MSAA depth only pass, followed by careful (to keep accelerated coarse Z cull working) Z restore at non-MSAA resolution for shading, followed by bilateral up-sample based on MSAA depth for antialiasing. This would be crazy expensive (Z pre-pass with deferred, and bilateral up-sample), but with DX11 CS maybe not as bad.
Could use depth information to compute and X,Y sub-pixel offset instead of the full bilateral up-sample.
What I think is most interesting about these types of post filtering techniques, is that it likely might help with things like specular, cubemap, and alpha test aliasing which can be tough to solve using other methods.
Filtering via Fractional Pixel Offset
EDIT. Nope that idea needed a lot of work... so I killed it!
EDIT. Actually, idea is sound, see third post!
Temporal Feedback
I did a prior test on a polygon based renderer in which I only used motion vector based framebuffer reprojection feedback and a little bit of smooth camera breathing motion even when still (as you often get in a FPS) to remove aliasing. Results were promising, but I never did have the opportunity to use it anywhere (yet).
Note, any method which is amortized temporally will require some form of data feedback, which will at a minimum be another RT, and thus be costly like a full screen post pass. Another important thing to note is that under motion the eye won't see over-filtering or noise as much, as long as it converges to a good enough looking sharp but not aliased edge. One could take advantage of noise or excess filtering under motion when temporal feedback likely won't help as much.
Crazy Depth Pre-Pass for AA
Jake Cannell's Deferred Rendering Without MSAA but With MSAA Z Prepass Idea Post talks about doing a 4x MSAA depth only pass, followed by careful (to keep accelerated coarse Z cull working) Z restore at non-MSAA resolution for shading, followed by bilateral up-sample based on MSAA depth for antialiasing. This would be crazy expensive (Z pre-pass with deferred, and bilateral up-sample), but with DX11 CS maybe not as bad.
Could use depth information to compute and X,Y sub-pixel offset instead of the full bilateral up-sample.
Twitter: Database Lag or Account Broken?
Decided to start using Twitter: @TimothFarrar (Timothy sans y)
Except the interface seems rather broken right now (at least for me)?
Sorry if a few of you got multiple follow notifications!
Follow links act as if they work. Got emails back, yet nothing shows in my following list. Tweets from those I followed showed in my home page until today and everything vanished. Nice. Help says "database replication lag"... also Firefox likes to spin forever on twitter.com. I smell multidimensional lag.
Except the interface seems rather broken right now (at least for me)?
Sorry if a few of you got multiple follow notifications!
Follow links act as if they work. Got emails back, yet nothing shows in my following list. Tweets from those I followed showed in my home page until today and everything vanished. Nice. Help says "database replication lag"... also Firefox likes to spin forever on twitter.com. I smell multidimensional lag.
Morphological Antialiasing
Linked off Real-Time Rendering, Alexander Reshetov's Morphological Antialiasing Paper is quite awesome. Unoptimized approach is about 150 cycles/pixel in a CPU implementation. For real-time performance, I bet you could amortize the cost of something like this over multiple frames!
EDIT: Yeah, after a second look, amortizing this over multiple frames would be tough! See second post...
EDIT: Yeah, after a second look, amortizing this over multiple frames would be tough! See second post...
20090727
Final DX11 Hardware Predictions
My last few round(s) of hardware predictions (or better guesses) was quite wild, this one is perhaps more conservative! BTW, I don't ask or receive any info about yet to be released hardware but what I gather from public sources which anyone has access to, otherwise I wouldn't be writing any of this.
A little on the progression of NVidia hardware. Naming might be a little off because I don't follow NVidia's wild naming changes all too well...
G80 - base GPU arch
G92 - global atomics
GT200 - shared memory atomics, global coalescing, double precision
G210 - added DX10.1 support
G300 - ?
Each line of GPUs sees a steady progression of improvements on the base arch set in the G80 line. Many of these improvements are things required for DX11. Almost like each generation of hardware was a chance to test and later improve upon the design of individual components required for DX11. NVidia was early on CS, while ATI was early on framebuffer and tessellation.
IMO, DX11 = DX10.1 + NVidia's CUDA + ATI's Tessellation
Seems like nothing revolutionary will be required this DX round, perhaps both primary vendors will be running with a similar design as for DX10. I would NOT be surprised if some DX11 features are "emulated" in software instead of having dedicated hardware. For example, append/consume, or maybe part of the tessellation pipeline pipes through memory instead of having direct GPU routing, or even interleaved MPMD via software emulated dynamic branching (link a bunch of pipeline stage kernels into one binary)?
Also not surprised that ATI is ready with DX11 in the R800 series for desktop GPUs, but perhaps DX11 notebook GPUs not until next year. R900 is likely to be the real interesting arch from ATI for the DX11 generation.
Likely DX11 Games Will be Designed for DX9/DX10 Hardware First
Lets rewind to DX10 generation history. It seems as if the DX10 generation was largely a generation of faster DX9 cards in the eyes of the average user. We can place the blame on many factors: such as a Vista upgrade requirement, consoles based on mostly DX9 generation hardware, developers not being in a situation conducive to pushing DX10 feature set, etc.
Not much of this has changed. DX11 is way beyond consoles, and next generation of consoles is well off. Many PC gamers still using XP (like 50% on Valve Survey). Hardcore PC devs are leaving the PC market behind (Id, Epic, Crytec all now have a console focus). When DX10 and DX11 support is used only to improve a core engine which must run on DX9+ consoles, this is crippling for tech.
Given this, I'd bet that DX11 supporting engines will be mostly designed for DX11 CS4 for some time. Which means that many games will likely run just fine on current generation DX10 hardware under DX11 drivers...
What about NVidia and DX11?
My current guess is that the 300 series will be all about increased ALU performance and power efficiency.
Not sure if we see a arch revision or a arch derived from G80 evolution. Could be no core arch changes from a programming perspective, simply better DX9 and DX10 performance all around, with a tiny minor amount of DX11 stuff added (maybe FF tessellation unit). Looks like at least some rumors agree with this speculation. Speculation that NVidia adopted a G80 derived arch for DX11 launch, and the next major revolution arch happens post DX11 launch GPUs (perhaps like ATI getting R800 early, with a later DX11 arch in R900)?
What about CUDA 3.0?
If the above speculation is true, something must be new for 3.0 and 300 series hardware. Maybe now the hardware can run different kernels simultaneously instead of in series? Maybe CUDA 3.0 is this plus something like DX11's better dynamic linking via better JIT and shader patching?
Or perhaps 300 series is a completely new arch?
It is all a very intriguing mystery!
A little on the progression of NVidia hardware. Naming might be a little off because I don't follow NVidia's wild naming changes all too well...
G80 - base GPU arch
G92 - global atomics
GT200 - shared memory atomics, global coalescing, double precision
G210 - added DX10.1 support
G300 - ?
Each line of GPUs sees a steady progression of improvements on the base arch set in the G80 line. Many of these improvements are things required for DX11. Almost like each generation of hardware was a chance to test and later improve upon the design of individual components required for DX11. NVidia was early on CS, while ATI was early on framebuffer and tessellation.
IMO, DX11 = DX10.1 + NVidia's CUDA + ATI's Tessellation
Seems like nothing revolutionary will be required this DX round, perhaps both primary vendors will be running with a similar design as for DX10. I would NOT be surprised if some DX11 features are "emulated" in software instead of having dedicated hardware. For example, append/consume, or maybe part of the tessellation pipeline pipes through memory instead of having direct GPU routing, or even interleaved MPMD via software emulated dynamic branching (link a bunch of pipeline stage kernels into one binary)?
Also not surprised that ATI is ready with DX11 in the R800 series for desktop GPUs, but perhaps DX11 notebook GPUs not until next year. R900 is likely to be the real interesting arch from ATI for the DX11 generation.
Likely DX11 Games Will be Designed for DX9/DX10 Hardware First
Lets rewind to DX10 generation history. It seems as if the DX10 generation was largely a generation of faster DX9 cards in the eyes of the average user. We can place the blame on many factors: such as a Vista upgrade requirement, consoles based on mostly DX9 generation hardware, developers not being in a situation conducive to pushing DX10 feature set, etc.
Not much of this has changed. DX11 is way beyond consoles, and next generation of consoles is well off. Many PC gamers still using XP (like 50% on Valve Survey). Hardcore PC devs are leaving the PC market behind (Id, Epic, Crytec all now have a console focus). When DX10 and DX11 support is used only to improve a core engine which must run on DX9+ consoles, this is crippling for tech.
Given this, I'd bet that DX11 supporting engines will be mostly designed for DX11 CS4 for some time. Which means that many games will likely run just fine on current generation DX10 hardware under DX11 drivers...
What about NVidia and DX11?
My current guess is that the 300 series will be all about increased ALU performance and power efficiency.
Not sure if we see a arch revision or a arch derived from G80 evolution. Could be no core arch changes from a programming perspective, simply better DX9 and DX10 performance all around, with a tiny minor amount of DX11 stuff added (maybe FF tessellation unit). Looks like at least some rumors agree with this speculation. Speculation that NVidia adopted a G80 derived arch for DX11 launch, and the next major revolution arch happens post DX11 launch GPUs (perhaps like ATI getting R800 early, with a later DX11 arch in R900)?
What about CUDA 3.0?
If the above speculation is true, something must be new for 3.0 and 300 series hardware. Maybe now the hardware can run different kernels simultaneously instead of in series? Maybe CUDA 3.0 is this plus something like DX11's better dynamic linking via better JIT and shader patching?
Or perhaps 300 series is a completely new arch?
It is all a very intriguing mystery!
20090723
Pixel Junk Shooter Video with Ice
Lots of really nice attention to detail in this game: shooting water out of pools from under the surface, ice melting and cooling your player, dunking to quench overheating.
Gather Centric Parallel Programming
The tool box of ideas to efficiently tackle complex problems on single threaded serial machines is rather well explored. IMO one of the core themes is scatter centric design. Many linear time algorithms depend on efficient scatter. One common use for scatter is binning (linear pass through N objects, output object pointer or index into 1 or more of M bins). A step up from binning is full sorting at which point when N is 1000 (or larger), radix sort is used. Radix sort of 32-bit integers on PCs is often fastest with one parallel histogram generation pass (binning), followed by 3 scatter passes. Binning or sorting (both of which depend on scatter) is crucial for many forms of optimization. For example factoring out branching by sorting objects into categories and processing all similar cases together.
Given that general purpose APIs for GPU parallel programming are relatively new, IMO there is very much left to discover (or re-discover) in terms of methods and tools to solve complex problems on highly parallel GPUs. IMO one of the core themes in GPU programming is gather centric design, which contrasts greatly with serial optimization.
Why Gather Centric?
(1.) LIMITED ATOMIC OPERATION FUNCTIONALITY. Atomic operations are designed for unordered usage because fine grain ordered atomic operations simply do NOT scale to massively parallel machines. Scattering data via global atomics is limited to a small subset of single ALU operations, and these operations are relatively expensive. So global communication via global atomic operations is limited in usefulness. Instead data must be gathered to an independent in-order thread for complex logic.
(2.) READ ONLY CACHES. Excluding Larrabee, current GPUs effectively have only read-only caches. Write-able caches used for scatter centric communication does NOT scale because of cache contention. The most important feature caches provide on GPUs is more bandwidth efficient fine granularity gather (at the cost of high latency). Global vector gather is expensive because performance is a function of the number of cache lines or memory segments required to service the gather.
(3.) SCATTER IS EXPENSIVE. Global scatter is not bandwidth efficient as it is a function of the number of cache lines or memory segments touched. In the case of un-cached global masked writes (non-full writes), hardware would have to both read and write back all touched lines or segments.
Gather Centric Rules for SIMD GPUs
(1.) Algorithm split into independent stages.
(2.) Coarse gather from vector sized and aligned global reads.
(3.) Fine granularity gather from texture reads.
(4.) Reuse read data in a program managed local store.
(5.) All threads do the same amount of work.
Likely this means that a tree structure is accessed via TEX reads but object data is carefully setup in vector sized packets (for efficient global memory access). Parallel methods where all threads of a vector process the same data are fast on all platforms. Parallel methods where all threads of a vector gather from different local objects require an extra transpose pass on Larrabee but on CUDA only require that objects be loaded with different bank offsets (or that threads access data from a different bank given objects loaded on the same bank offset).
Gather Centric Toolbox
Note, some serial programming techniques live on inside the thread group (or CUDA block). More serial program techniques live on within a Larrabee thread because of the larger write-able cache.
What to do with no scatter? Scatter via gather.
Scatter via gather means emulating scatter by having the scatter destination search for the result of the scatter. This REQUIRES data in a prior pass to be efficiently searchable. Which means one of two things,
(1.) Either do a hierarchical parallel reduction in a prior pass to generate some kind of acceleration structure to reduce search cost to log time. If the data is static then this prior pass was done offline.
(2.) Or have the data kept and updated at run-time in a hierarchical tree structure such that no extra parallel reduction is required to generate the acceleration structure for searching.
Gather centric optimization has more to do with the following,
(1.) Limiting the amount of gather searching by "good for the GPU" acceleration structures. Often this means having fixed N-ary trees where N packs well with the SIMD vector size of the machine.
(2.) Keeping thread groups doing the searching grouped by good data locality, and maintaining a high amount of local store data reuse (to avoid getting bandwidth bound).
(3.) Keeping all threads doing a fixed (instead of variable) amount of work. To me this screams "amortize sorting and/or searching over multiple passes".
The "Holy Grail" of GPU Ray Tracing
This post wouldn't be complete without some talk about GPU ray tracing, given that ray tracing is a perfect parallel example of gather centric design.
GPU ray tracing breaks rule (5) which states that all threads do the same amount of work. GPUs lack ability to efficiently do fine granularity regrouping of divergent threads. However a recent CUDA paper found 2x speed up from bringing down the cost of long running threads from thread block granularity to warp granularity. This is the current state of the art.
The problems of dynamic objects and performance have yet not been solved for ray tracing (or ray casting). This fact alone has made the topic a worthy background obsession of mine.
Temporal Coherence
I'm convinced that a solution can be found in use of temporal (in combination with spacial) coherence to lower search cost. I've managed getting part way to a solution using a scatter centric design in the form of what I call stochastic visibility. However if I want something which will scale on GPUs and something which I can port to OpenCL to run on any GPU, I need to re-formulate a gather centric solution to the problem of real-time tree-traversal of dynamic data.
A solution has been eluding me for some time!
Last night / this morning, a new possible solution became evident thanks to exploring a combination of unrelated topics (including thinking about doing a hierarchical cellular automata to increase the "speed of sound" in a fluid computation), writing out the above rules again of gather centric design for SIMD GPUs, and rethinking how to solve reprojection without scatter. The solution involves maintaining a GPU side scene tree structure of which I use the previous frame's tree traversal to accelerate the search for the solution to the next frame's visibility...
More on this later, if it works!
Given that general purpose APIs for GPU parallel programming are relatively new, IMO there is very much left to discover (or re-discover) in terms of methods and tools to solve complex problems on highly parallel GPUs. IMO one of the core themes in GPU programming is gather centric design, which contrasts greatly with serial optimization.
Why Gather Centric?
(1.) LIMITED ATOMIC OPERATION FUNCTIONALITY. Atomic operations are designed for unordered usage because fine grain ordered atomic operations simply do NOT scale to massively parallel machines. Scattering data via global atomics is limited to a small subset of single ALU operations, and these operations are relatively expensive. So global communication via global atomic operations is limited in usefulness. Instead data must be gathered to an independent in-order thread for complex logic.
(2.) READ ONLY CACHES. Excluding Larrabee, current GPUs effectively have only read-only caches. Write-able caches used for scatter centric communication does NOT scale because of cache contention. The most important feature caches provide on GPUs is more bandwidth efficient fine granularity gather (at the cost of high latency). Global vector gather is expensive because performance is a function of the number of cache lines or memory segments required to service the gather.
(3.) SCATTER IS EXPENSIVE. Global scatter is not bandwidth efficient as it is a function of the number of cache lines or memory segments touched. In the case of un-cached global masked writes (non-full writes), hardware would have to both read and write back all touched lines or segments.
Gather Centric Rules for SIMD GPUs
(1.) Algorithm split into independent stages.
(2.) Coarse gather from vector sized and aligned global reads.
(3.) Fine granularity gather from texture reads.
(4.) Reuse read data in a program managed local store.
(5.) All threads do the same amount of work.
Likely this means that a tree structure is accessed via TEX reads but object data is carefully setup in vector sized packets (for efficient global memory access). Parallel methods where all threads of a vector process the same data are fast on all platforms. Parallel methods where all threads of a vector gather from different local objects require an extra transpose pass on Larrabee but on CUDA only require that objects be loaded with different bank offsets (or that threads access data from a different bank given objects loaded on the same bank offset).
Gather Centric Toolbox
Note, some serial programming techniques live on inside the thread group (or CUDA block). More serial program techniques live on within a Larrabee thread because of the larger write-able cache.
What to do with no scatter? Scatter via gather.
Scatter via gather means emulating scatter by having the scatter destination search for the result of the scatter. This REQUIRES data in a prior pass to be efficiently searchable. Which means one of two things,
(1.) Either do a hierarchical parallel reduction in a prior pass to generate some kind of acceleration structure to reduce search cost to log time. If the data is static then this prior pass was done offline.
(2.) Or have the data kept and updated at run-time in a hierarchical tree structure such that no extra parallel reduction is required to generate the acceleration structure for searching.
Gather centric optimization has more to do with the following,
(1.) Limiting the amount of gather searching by "good for the GPU" acceleration structures. Often this means having fixed N-ary trees where N packs well with the SIMD vector size of the machine.
(2.) Keeping thread groups doing the searching grouped by good data locality, and maintaining a high amount of local store data reuse (to avoid getting bandwidth bound).
(3.) Keeping all threads doing a fixed (instead of variable) amount of work. To me this screams "amortize sorting and/or searching over multiple passes".
The "Holy Grail" of GPU Ray Tracing
This post wouldn't be complete without some talk about GPU ray tracing, given that ray tracing is a perfect parallel example of gather centric design.
GPU ray tracing breaks rule (5) which states that all threads do the same amount of work. GPUs lack ability to efficiently do fine granularity regrouping of divergent threads. However a recent CUDA paper found 2x speed up from bringing down the cost of long running threads from thread block granularity to warp granularity. This is the current state of the art.
The problems of dynamic objects and performance have yet not been solved for ray tracing (or ray casting). This fact alone has made the topic a worthy background obsession of mine.
Temporal Coherence
I'm convinced that a solution can be found in use of temporal (in combination with spacial) coherence to lower search cost. I've managed getting part way to a solution using a scatter centric design in the form of what I call stochastic visibility. However if I want something which will scale on GPUs and something which I can port to OpenCL to run on any GPU, I need to re-formulate a gather centric solution to the problem of real-time tree-traversal of dynamic data.
A solution has been eluding me for some time!
Last night / this morning, a new possible solution became evident thanks to exploring a combination of unrelated topics (including thinking about doing a hierarchical cellular automata to increase the "speed of sound" in a fluid computation), writing out the above rules again of gather centric design for SIMD GPUs, and rethinking how to solve reprojection without scatter. The solution involves maintaining a GPU side scene tree structure of which I use the previous frame's tree traversal to accelerate the search for the solution to the next frame's visibility...
More on this later, if it works!
20090720
My Ideal Parallel Machine
Clearly I am NOT a hardware engineer, but I am seeing a very consistent pattern of what I want in massively parallel hardware from what I attempt to create in software, and from this I've derived my ideal parallel machine below.
The machine is a very dense 3D computing/memory matrix of which a 2D slice looks somewhat like this,
(CNM)(CNM)(CNM)(CNM)(CNM)...(CNM)
(CNM)(CNM)(CNM)(CNM)(CNM)...(CNM)
(CNM)(CNM)(CNM)(CNM)(CNM)...(CNM)
(CNM)(CNM)(CNM)(CNM)(CNM)...(CNM)
(CNM)(CNM)(CNM)(CNM)(CNM)...(CNM)
.................................
.............SCALES_IN_X_Y_Z.....
.................................
(CNM)(CNM)(CNM)(CNM)(CNM)...(CNM)
- Each (CNM) is a compute/network/memory node in the matrix.
- Each node is connected to neighboring nodes in the matrix.
- All memory is distributed among the nodes.
- There is no external memory.
- There is no hardware cache coherency between nodes.
The network (N) part of each node would route data from local memory (M) in the background of computation (C). Total interconnect bandwidth would be tremendous, do to the highly parallel neighbor to neighbor routing. Network latency would be mostly a function of the distance between the sending and receiving nodes.
Crucial Change from Current Practice
I am just starting to really understand how I would program for this kind of machine in a general purpose way, and have algorithms scale to a variable number of compute nodes.
The crucial change is that memory is no longer external to the CPU, and we abandon the "load, compute, store" model of programming at larger scales.
Parallel computation on current hardware is either distributed to fixed locations or variable locations based on core load balancing, and data is either distributed among local and distant caches, or distant banks of memory. Effectively computation becomes fine grain loop of loading data from distant locations, doing work, and then storing the results back to other distant locations. This does NOT scale well at all.
The problem is that the physical data flow in the machine does NOT match the needs of the program, and computation does NOT follow the required data flow.
Micro Scale Example
When looking at the current model of CPU atomic operations, parallel computation interfacing with shared data is distributed to many nodes, requiring an expensive serialization of computation because the atomic logic plays out finely interleaved across nodes in which there is a long communication latency between nodes.
For example, with CPU nodes fetching or writing to a shared queue, the queue pointer ends up on a shared cache line, and each add to the pointer happens on a different node requiring a synchronous transfer of the full cache line over long distances before an operation can finish.
Data and computation is NOT flowing through the machine in sync with the needs of the algorithm. Data and computational locality should be in parallel with each other!
I believe GPUs, excluding Larrabee, have traditionally had atomic operations on globally shared memory computed at the point of communication synchronization near the memory controller (MC) or ROP, with no need to transfer "cache lines" back and forth around the machine for the ALU operations which compose the atomic operation.
In this case data is flowing through the machine in sync with the needs of the algorithm. Computation on data which needs to be joined physically happens at the point where data is routed to a common location.
The DX11 CS, OpenCL, and CUDA parallel computing models (on non-Larrabee hardware) logically terminate here, and are vastly limited by a small microcosm of transfer of "compute locality" expressed in the set of supported atomic operations. Any other transfer of "compute locality" requires complex routing to/from distant memory and node-to-node synchronization. Clearly in the case of Larrabee, this routing might be of lower cost (or lower latency) because of the coherent caches, but still suffers from the same problem, just less so!
Macro Scale Example
A more macro example would be the non-general GPU fixed function ROP and framebuffer logic. Data is transferred on chip, results pass from compute units to distributed serialization points (ROP units) for a given tile of the framebuffer. Work distribution is to my knowledge a fixed layout to provide good load balancing and to better parallelize framebuffer memory requests. Note that a complex set of logic happens with respect to Z, stencil, and blending in the ROP units themselves. Compute locality transfers from ALU units to ROP units as does the data.
This is a macro example because the entire set of ROP blend operations is conditionally dependent on fine granularity Z and stencil checks. This sort of thing cannot be done efficiently in software via the small subset of unordered atomic operations provided by DX11 CS, OpenCL, or CUDA.
An attempt to emulate this ROP unit in software on non-Larrabee GPUs is a disaster. The problem results from high overhead in all the many possible ways to do the required node to node communication to provide framebuffer functionality. At some point node to node communication must go through global memory, or data must be fetched from global memory (or TEX cached) many times (gather/searching) to avoid the node communication (from scatter, etc). Either way, performance is screwed compared to the fixed function hardware. Note, Larrabee should be better at software ROP emulation because of lower overhead (lower communication latency) do to caches, but looks to me as if it digresses into worst case behavior as the granularity of writes to the framebuffer become small (either the binning costs grow huge, or the synchronization on Z buffer check/write grow huge, choose your poison).
The Big Picture
The programming model on my suggested compute/memory matrix becomes one of direct data flow. Data flows through the machine based on the needs of program synchronization and communication, sync points become physical nodes where data is transferred to be jointly processed (or serialized), the program's next step of computation (the joint processing) happens at the destination node, instead of the compute node where the data originated (as with traditional programming models).
In the ROP example, compute nodes would processes fragment shader kernels, and then output a network packet of fragment shader intermediate results. This network packet gets background routed to the node(s) assigned to the corresponding ROP tile(s) hit by the output samples. These ROP nodes do any programmable blending and can even continue doing more programmable shading (which might include transferring new packets between other nodes for computation). The ROP tile nodes can route back coarse Z/stencil data packets to raster nodes to early cull.
For load balancing, nodes themselves could easily service many different destination compute domains at the same time. Also note with this programming model, data need not be transferred to and from fixed long term set locations of memory, data structures can instead continuously be dynamically moved around in the distributed memory.
Undoubtedly more posts on this later when I get more time...
The machine is a very dense 3D computing/memory matrix of which a 2D slice looks somewhat like this,
(CNM)(CNM)(CNM)(CNM)(CNM)...(CNM)
(CNM)(CNM)(CNM)(CNM)(CNM)...(CNM)
(CNM)(CNM)(CNM)(CNM)(CNM)...(CNM)
(CNM)(CNM)(CNM)(CNM)(CNM)...(CNM)
(CNM)(CNM)(CNM)(CNM)(CNM)...(CNM)
.................................
.............SCALES_IN_X_Y_Z.....
.................................
(CNM)(CNM)(CNM)(CNM)(CNM)...(CNM)
- Each (CNM) is a compute/network/memory node in the matrix.
- Each node is connected to neighboring nodes in the matrix.
- All memory is distributed among the nodes.
- There is no external memory.
- There is no hardware cache coherency between nodes.
The network (N) part of each node would route data from local memory (M) in the background of computation (C). Total interconnect bandwidth would be tremendous, do to the highly parallel neighbor to neighbor routing. Network latency would be mostly a function of the distance between the sending and receiving nodes.
Crucial Change from Current Practice
I am just starting to really understand how I would program for this kind of machine in a general purpose way, and have algorithms scale to a variable number of compute nodes.
The crucial change is that memory is no longer external to the CPU, and we abandon the "load, compute, store" model of programming at larger scales.
Parallel computation on current hardware is either distributed to fixed locations or variable locations based on core load balancing, and data is either distributed among local and distant caches, or distant banks of memory. Effectively computation becomes fine grain loop of loading data from distant locations, doing work, and then storing the results back to other distant locations. This does NOT scale well at all.
The problem is that the physical data flow in the machine does NOT match the needs of the program, and computation does NOT follow the required data flow.
Micro Scale Example
When looking at the current model of CPU atomic operations, parallel computation interfacing with shared data is distributed to many nodes, requiring an expensive serialization of computation because the atomic logic plays out finely interleaved across nodes in which there is a long communication latency between nodes.
For example, with CPU nodes fetching or writing to a shared queue, the queue pointer ends up on a shared cache line, and each add to the pointer happens on a different node requiring a synchronous transfer of the full cache line over long distances before an operation can finish.
Data and computation is NOT flowing through the machine in sync with the needs of the algorithm. Data and computational locality should be in parallel with each other!
I believe GPUs, excluding Larrabee, have traditionally had atomic operations on globally shared memory computed at the point of communication synchronization near the memory controller (MC) or ROP, with no need to transfer "cache lines" back and forth around the machine for the ALU operations which compose the atomic operation.
In this case data is flowing through the machine in sync with the needs of the algorithm. Computation on data which needs to be joined physically happens at the point where data is routed to a common location.
The DX11 CS, OpenCL, and CUDA parallel computing models (on non-Larrabee hardware) logically terminate here, and are vastly limited by a small microcosm of transfer of "compute locality" expressed in the set of supported atomic operations. Any other transfer of "compute locality" requires complex routing to/from distant memory and node-to-node synchronization. Clearly in the case of Larrabee, this routing might be of lower cost (or lower latency) because of the coherent caches, but still suffers from the same problem, just less so!
Macro Scale Example
A more macro example would be the non-general GPU fixed function ROP and framebuffer logic. Data is transferred on chip, results pass from compute units to distributed serialization points (ROP units) for a given tile of the framebuffer. Work distribution is to my knowledge a fixed layout to provide good load balancing and to better parallelize framebuffer memory requests. Note that a complex set of logic happens with respect to Z, stencil, and blending in the ROP units themselves. Compute locality transfers from ALU units to ROP units as does the data.
This is a macro example because the entire set of ROP blend operations is conditionally dependent on fine granularity Z and stencil checks. This sort of thing cannot be done efficiently in software via the small subset of unordered atomic operations provided by DX11 CS, OpenCL, or CUDA.
An attempt to emulate this ROP unit in software on non-Larrabee GPUs is a disaster. The problem results from high overhead in all the many possible ways to do the required node to node communication to provide framebuffer functionality. At some point node to node communication must go through global memory, or data must be fetched from global memory (or TEX cached) many times (gather/searching) to avoid the node communication (from scatter, etc). Either way, performance is screwed compared to the fixed function hardware. Note, Larrabee should be better at software ROP emulation because of lower overhead (lower communication latency) do to caches, but looks to me as if it digresses into worst case behavior as the granularity of writes to the framebuffer become small (either the binning costs grow huge, or the synchronization on Z buffer check/write grow huge, choose your poison).
The Big Picture
The programming model on my suggested compute/memory matrix becomes one of direct data flow. Data flows through the machine based on the needs of program synchronization and communication, sync points become physical nodes where data is transferred to be jointly processed (or serialized), the program's next step of computation (the joint processing) happens at the destination node, instead of the compute node where the data originated (as with traditional programming models).
In the ROP example, compute nodes would processes fragment shader kernels, and then output a network packet of fragment shader intermediate results. This network packet gets background routed to the node(s) assigned to the corresponding ROP tile(s) hit by the output samples. These ROP nodes do any programmable blending and can even continue doing more programmable shading (which might include transferring new packets between other nodes for computation). The ROP tile nodes can route back coarse Z/stencil data packets to raster nodes to early cull.
For load balancing, nodes themselves could easily service many different destination compute domains at the same time. Also note with this programming model, data need not be transferred to and from fixed long term set locations of memory, data structures can instead continuously be dynamically moved around in the distributed memory.
Undoubtedly more posts on this later when I get more time...
20090719
Windows SDK, Windows from eSATA, ETC
This post is an open request for comments from anyone who wants to share advice regarding the topics below.
Visual Studio Express 2008 or Windows SDK
For personal development I really need just one thing, a compiler/linker binary which runs from the command line. Clearly I also need the system headers, but I absolutely DO NOT need an IDE or debugger, those just slow me down.
Anyone have experience with the free Visual Studio Express 2008?
I didn't even know it existed until today, and it seems easier to just to get Express than attempting to maintain a manual install of MinGW/MSYS for GCC.
What about just installing the Windows SDK without Express?
Looks like I really only need the SDK and NOT Express. If this is the case, damn, thanks Microsoft! Love to get any feedback anyone has as to this. For example, are there any limitations as to the Windows SDK compilers? Like hidden missing support for assemply or something crippling like that? What about with Windows 7 Beta? Are Windows 7 GPU drivers good or buggy?
Bootable Thumb/USB or eSATA
The dead machine is helping me rethink my development environment. I'm now looking at the option of building bootable external drives for development, to make it easy to switch between operating systems and physical machines (I used to switch internal drives). I made sure to get eSATA on the new motherboard. Too bad the iMac requires boot-camp.
XP from USB : Cedrick Collomb Blog | Ngine.de
Anyone run XP or Vista from eSATA?
Switching Linux Distro
My first Linux experience I think was Slackware, and post that I did my own from scratch because I wanted pthreads before it was available and later wanted everything to run from a ZIP drive. I've also played with NetBSD, OpenBSD, and FreeBSD. Settled with Crux for a long while (KISS), but went to Gentoo on the last machine for easy 64-bit support (big mistake). Now Slackware is just getting 64-bit support (in current branch only), and FreeBSD needs some kernel changes to get 64-bit NVidia drivers, so those are out for me.
Turns out that Arch Linux is a very Crux like (KISS) distro with good 64-bit support, so I'm going to move to it on the new machine.
Visual Studio Express 2008 or Windows SDK
For personal development I really need just one thing, a compiler/linker binary which runs from the command line. Clearly I also need the system headers, but I absolutely DO NOT need an IDE or debugger, those just slow me down.
Anyone have experience with the free Visual Studio Express 2008?
I didn't even know it existed until today, and it seems easier to just to get Express than attempting to maintain a manual install of MinGW/MSYS for GCC.
What about just installing the Windows SDK without Express?
Looks like I really only need the SDK and NOT Express. If this is the case, damn, thanks Microsoft! Love to get any feedback anyone has as to this. For example, are there any limitations as to the Windows SDK compilers? Like hidden missing support for assemply or something crippling like that? What about with Windows 7 Beta? Are Windows 7 GPU drivers good or buggy?
Bootable Thumb/USB or eSATA
The dead machine is helping me rethink my development environment. I'm now looking at the option of building bootable external drives for development, to make it easy to switch between operating systems and physical machines (I used to switch internal drives). I made sure to get eSATA on the new motherboard. Too bad the iMac requires boot-camp.
XP from USB : Cedrick Collomb Blog | Ngine.de
Anyone run XP or Vista from eSATA?
Switching Linux Distro
My first Linux experience I think was Slackware, and post that I did my own from scratch because I wanted pthreads before it was available and later wanted everything to run from a ZIP drive. I've also played with NetBSD, OpenBSD, and FreeBSD. Settled with Crux for a long while (KISS), but went to Gentoo on the last machine for easy 64-bit support (big mistake). Now Slackware is just getting 64-bit support (in current branch only), and FreeBSD needs some kernel changes to get 64-bit NVidia drivers, so those are out for me.
Turns out that Arch Linux is a very Crux like (KISS) distro with good 64-bit support, so I'm going to move to it on the new machine.
20090718
Weak Computer Took One for the Team
Insanely Twisted Shadow Planet looks really cool.
Had a weekend of freedom to really work (wife is out of town), and the damn PC Linux dev machine died today. After testing 3 different GPUs and 3 different power supplies I've ruled out everything but one or more of motherboard, memory, or CPU. So instead my bother is driving up to visit and I'm catching up on PS3 games. It is amazing how much waiting on cheep parts costs in opportunity cost.
I'd really like to just immediately transition development to my older iMac (with it's 8800M GTS GPU), but STILL waiting on September for Snow Leopard for OpenGL 3 support (I'm not a ADC Premier or Select Member so don't have access to the beta, assuming GL3 was in the beta, which I don't really know). Going back to pre-GL3 isn't really an option.
Atom Update
Tested out having 4K unique scene tree roots with heavy overlap and sparse l-systems. Visibility worked with everything moving. Drawing is mostly constant regardless of how many root nodes I have (as long as I stay within a low percent of root nodes to total nodes). Will have no problem with the 64K dynamic object target. Also pushed the GTX275 to 8M nodes (4x my high-end GPU target), but couldn't hit 60Hz. BTW, I'm now targeting 60Hz on low end, 30 just isn't smooth enough for me.
I have a new better way to store the node tree in a modified 360 fisheye projection mapping alone instead of having to reproject between an octahedron (scene tree storage) and fisheye mapping (drawn for player). The new mapping is a spherical 360 degree fisheye with resulting xy coordinates modified by sin() to push the projection into half way between a rectangle and sphere. Screenshot would have been included if computer was alive.
This was an accidental discovery, I didn't expect it to work well. It has the side effect of a more rectangular (normal) view in the center, and actually performs better with the stochastic visibility. The view expansion near the end of projection (the edge of the projection is from one "point" behind the viewer) actually helps leave room for more of the tree structure without collisions, and the view "compression" under fast forward motion as objects fly behind the viewer requires less frames to fully hole fill.
The result of all this is that the scene tree itself it stored directly in a very thick "frame buffer", and it works do to having some kind of 360 degree view mapping.
If this sounds alien, or like I'm in crazy land, it should. I spend more than 40 hours a week programming in reality (at Humanhead), so going off the deep end is quite refreshing indeed!
Had a weekend of freedom to really work (wife is out of town), and the damn PC Linux dev machine died today. After testing 3 different GPUs and 3 different power supplies I've ruled out everything but one or more of motherboard, memory, or CPU. So instead my bother is driving up to visit and I'm catching up on PS3 games. It is amazing how much waiting on cheep parts costs in opportunity cost.
I'd really like to just immediately transition development to my older iMac (with it's 8800M GTS GPU), but STILL waiting on September for Snow Leopard for OpenGL 3 support (I'm not a ADC Premier or Select Member so don't have access to the beta, assuming GL3 was in the beta, which I don't really know). Going back to pre-GL3 isn't really an option.
Atom Update
Tested out having 4K unique scene tree roots with heavy overlap and sparse l-systems. Visibility worked with everything moving. Drawing is mostly constant regardless of how many root nodes I have (as long as I stay within a low percent of root nodes to total nodes). Will have no problem with the 64K dynamic object target. Also pushed the GTX275 to 8M nodes (4x my high-end GPU target), but couldn't hit 60Hz. BTW, I'm now targeting 60Hz on low end, 30 just isn't smooth enough for me.
I have a new better way to store the node tree in a modified 360 fisheye projection mapping alone instead of having to reproject between an octahedron (scene tree storage) and fisheye mapping (drawn for player). The new mapping is a spherical 360 degree fisheye with resulting xy coordinates modified by sin() to push the projection into half way between a rectangle and sphere. Screenshot would have been included if computer was alive.
This was an accidental discovery, I didn't expect it to work well. It has the side effect of a more rectangular (normal) view in the center, and actually performs better with the stochastic visibility. The view expansion near the end of projection (the edge of the projection is from one "point" behind the viewer) actually helps leave room for more of the tree structure without collisions, and the view "compression" under fast forward motion as objects fly behind the viewer requires less frames to fully hole fill.
The result of all this is that the scene tree itself it stored directly in a very thick "frame buffer", and it works do to having some kind of 360 degree view mapping.
If this sounds alien, or like I'm in crazy land, it should. I spend more than 40 hours a week programming in reality (at Humanhead), so going off the deep end is quite refreshing indeed!
Console Art Detailed
Neat to see how others build things, awesome character artist at work shared these links from Ben Sprout Artworks : Prefabs in KZ2 and Environments in MP3.
Alaska : Juneau
We ended our trip in Juneau to catch our flight back to Chicago (which went indirectly through Anchorage). Juneau is one of the very common ports for Alaska cruise ships so be warned. I'd suggest skipping the Mount Roberts Tramway and instead hiking up/down the deadly path (very dangerous in the rain) which leads to/from the top (it is only 1800 feet, and the local kids run up the path).
Best local food we found in a very short amount of time was the Sandpiper Cafe. Other places of interest included the Alaska State Museum (where they had a great Science on a Sphere exhibit with hundreds of different visualizations of planetary data), the Alaskan Brewery Company tour, and the salmon hatchery sporting a really nice set of aquariums to keep people visiting outside of spawning season,


This is it From Alaska Folks
After this I will stick to the usual crazy programming stuff for a while. By the way, nearly all the photos thanks to the lovely lady below,

Hope you all enjoyed a glimpse of Alaska!
One Last Departing Gift : Alaska Stream Crossing Epic Fail!
An attempt to do something really stupid for the camera turned out horribly right, we laughed non-stop for about an hour after this...
Best local food we found in a very short amount of time was the Sandpiper Cafe. Other places of interest included the Alaska State Museum (where they had a great Science on a Sphere exhibit with hundreds of different visualizations of planetary data), the Alaskan Brewery Company tour, and the salmon hatchery sporting a really nice set of aquariums to keep people visiting outside of spawning season,


This is it From Alaska Folks
After this I will stick to the usual crazy programming stuff for a while. By the way, nearly all the photos thanks to the lovely lady below,

Hope you all enjoyed a glimpse of Alaska!
One Last Departing Gift : Alaska Stream Crossing Epic Fail!
An attempt to do something really stupid for the camera turned out horribly right, we laughed non-stop for about an hour after this...
20090717
Where is my Raytraced Physics Toy?
Someone really needs to join GPU physics with GPU raytracing into an interactive physics toy. The scene has shown the way, interesting 11k Intro by FRequency,
Another 4k by FRequency,
Another 1k by FRequency,
Another 4k by FRequency,
Another 1k by FRequency,
20090716
DX10/GL3 GPGPU : Stream Out with Scatter
In regards to DX10/OpenGL3 level hardware and point scatter, moving PS work to the VS stage was a huge win. In the case where there isn't enough VS work to amortize the setup cost, perhaps using stream out (or GL transform feedback) would enable adding a secondary task to be done in the VS stage. So VS would stream out a set of outputs which are not passed along to the PS stage (PS stage gets separate VS outputs).
This might be perfect for the case where an object needs to self update (do that via stream out), and then scatter its index into another data structure (via point drawing).
This might be perfect for the case where an object needs to self update (do that via stream out), and then scatter its index into another data structure (via point drawing).
20090714
Wow.
Damn, Naughty Dog wizards have set the graphics bar again! Particles colored to match scene tint, particles getting shadowed, perhaps sub-surface scattering, huge numbers of triangles, and more...
Hybrid Cached Raycast and Reproject Raster
I've been suggesting this idea and permutations of this idea partly because I'd like to see someone try it, and I'm busy "traversing" a different solution to a similar problem (just geared towards dynamic rather than more static geometry). Undoubtedly there are those who have thought of this, just I haven't seen a paper on it yet.
The rough idea is to amortize the ray traversal cost over many frames and to make use of both spacial and temporal coherence. This shares a lot of similarities to video compression, in regards to reprojection being like motion prediction in B/P-frames, and the tile cache being like bits of I-frames.
The idea uses a traversal tile cache. A tile being a warp or vector (SIMD) friendly packet of the result of raycasts into an object in the scene. For example for CUDA, a tile might be a 8x4 grid of raycasts. An object might be the entire static world, or a character, or a dynamic object. Tiles are associated with an object ID so that the tile can be correctly reprojected as the object and view moves.
Tracing into a skinned character is a separate problem for which I don't have a good direct solution. However, given a solution to this problem, one could use the per tile object ID to "re-skin" the tile during reprojection. Actually one would want to store the tile in the base pose coordinate space, and simply apply the bones to the fragments in the tile to do the reprojection.
Tiles in the cache become less valid as objects move or the view changes. Each frame the cache gets updated to fill areas of the view(s) (note, there might be multiple views such as shadowmaps, etc) which have sparse coverage. Tile cache update rate would scale to match the capacity of the GPU.
Drawing the frame would be done via a composite pass which re-projects, lights, and blends all the cached tiles (and fills in any holes). Part of lighting could very well be done in tile space, and at a frequency lower than the screen refresh for low frequency lighting, and perhaps in image space at the screen refresh rate for high frequency lighting like specular and hard shadows.
Ultimately the success and failure of this type of system depends on if the composite/reprojection cost is lower than the full traversal.
Compared to What I'm Doing
With classic ray traversal the structure is geared towards answering the question "if a ray intersects the object, what does it hit?". With my structure the traversal is geared towards answering the question, "given a lower quality version of the object in this 3D region, what does the next higher quality level of detail look like". In my case the traversal could be of any tree, even if the tree self intersects. In the case of a skinned object, classic ray traversal is a complex problem, in my case it is trivial.
I go directly down the tree without divergence, because my traversal doesn't go by ray intersection. However I depend on a reprojection scatter pass (which can be redone as a coarse binned scatter, local gather, or whatever form is fastest on the hardware). This reprojection scatter pass is used to prune out traversals of the tree which are not adding to visible nodes.
My scene is that of a forest of l-systems (64K roots, 2M nodes). Each traversal step is amazingly expensive (decompress rule, apply rule) compared to a triangle or voxel raycast or raytrace, and in my case the composite/reprojection step isn't as costly as traversal, so the hybrid traverse/raster works.
Not sure if the hybrid cached trace/raster approach will work well for classic raycasting, but I'd bet one of you out there will be answering this question in the next year or so!
The rough idea is to amortize the ray traversal cost over many frames and to make use of both spacial and temporal coherence. This shares a lot of similarities to video compression, in regards to reprojection being like motion prediction in B/P-frames, and the tile cache being like bits of I-frames.
The idea uses a traversal tile cache. A tile being a warp or vector (SIMD) friendly packet of the result of raycasts into an object in the scene. For example for CUDA, a tile might be a 8x4 grid of raycasts. An object might be the entire static world, or a character, or a dynamic object. Tiles are associated with an object ID so that the tile can be correctly reprojected as the object and view moves.
Tracing into a skinned character is a separate problem for which I don't have a good direct solution. However, given a solution to this problem, one could use the per tile object ID to "re-skin" the tile during reprojection. Actually one would want to store the tile in the base pose coordinate space, and simply apply the bones to the fragments in the tile to do the reprojection.
Tiles in the cache become less valid as objects move or the view changes. Each frame the cache gets updated to fill areas of the view(s) (note, there might be multiple views such as shadowmaps, etc) which have sparse coverage. Tile cache update rate would scale to match the capacity of the GPU.
Drawing the frame would be done via a composite pass which re-projects, lights, and blends all the cached tiles (and fills in any holes). Part of lighting could very well be done in tile space, and at a frequency lower than the screen refresh for low frequency lighting, and perhaps in image space at the screen refresh rate for high frequency lighting like specular and hard shadows.
Ultimately the success and failure of this type of system depends on if the composite/reprojection cost is lower than the full traversal.
Compared to What I'm Doing
With classic ray traversal the structure is geared towards answering the question "if a ray intersects the object, what does it hit?". With my structure the traversal is geared towards answering the question, "given a lower quality version of the object in this 3D region, what does the next higher quality level of detail look like". In my case the traversal could be of any tree, even if the tree self intersects. In the case of a skinned object, classic ray traversal is a complex problem, in my case it is trivial.
I go directly down the tree without divergence, because my traversal doesn't go by ray intersection. However I depend on a reprojection scatter pass (which can be redone as a coarse binned scatter, local gather, or whatever form is fastest on the hardware). This reprojection scatter pass is used to prune out traversals of the tree which are not adding to visible nodes.
My scene is that of a forest of l-systems (64K roots, 2M nodes). Each traversal step is amazingly expensive (decompress rule, apply rule) compared to a triangle or voxel raycast or raytrace, and in my case the composite/reprojection step isn't as costly as traversal, so the hybrid traverse/raster works.
Not sure if the hybrid cached trace/raster approach will work well for classic raycasting, but I'd bet one of you out there will be answering this question in the next year or so!
20090713
Amusement Machine Video
The Amusement Machine guy posted an interesting video (a month ago), real-time gpu raycasting on ATI 4850.
Alaska : Gustavus
Gustavus, with its somewhere over 300 residents (you know a town has its priorities right when the airport is larger than the town itself), is a launching point for all things done in Glacier Bay. We stayed at Aimee's Guest House and would highly recommend Aimee to anyone interested in staying in Gustavus,

The most dangerous animal in Alaska (2nd to people of course) viewed from the deck of the guest house during breakfast,

Dinner was mostly a combination of biking to Pep's Packing for some smoked salmon and the local grocery. The best dinner in town was the Gustavus Inn where you have to reserve your seat before noon (so they know how much fish to get), food is served family style, and the highlight of dinner is a mix of local fish/crab and greens/fruit from the owner's extensive garden.
Eight mile bike ride from the guest house is Glacier Bay National Park which only really sports 3 hiking trails from the lodge (and only one sized for a day trip). Aimee's Guest House provides vintage 1-speeds for the journey,

And on route is path to a military plane crash site (sans the awesome radial motors),

On the other direction is the road into town, from that, on route to the pier is the Mount Fairweather Golf Course,

The most dangerous animal in Alaska (2nd to people of course) viewed from the deck of the guest house during breakfast,

Dinner was mostly a combination of biking to Pep's Packing for some smoked salmon and the local grocery. The best dinner in town was the Gustavus Inn where you have to reserve your seat before noon (so they know how much fish to get), food is served family style, and the highlight of dinner is a mix of local fish/crab and greens/fruit from the owner's extensive garden.
Eight mile bike ride from the guest house is Glacier Bay National Park which only really sports 3 hiking trails from the lodge (and only one sized for a day trip). Aimee's Guest House provides vintage 1-speeds for the journey,

And on route is path to a military plane crash site (sans the awesome radial motors),

On the other direction is the road into town, from that, on route to the pier is the Mount Fairweather Golf Course,
20090710
Video of Particles via L-System and Stochastic Visibility
I talked about this a while back and never did post a video, and I had to post something cool before SigGraph 2009, so below are screen shots and a quick video of an old stochastic visibility with particle generation effect test. The OpenGL2 (NVidia Specific) GLSL demo was built a long time ago for a 8600 GTS, and runs way too fast on the GTX275 (even bottle-necked by triangle setup in point scatter). Unfortunately all I had time to do was add a very slow serial frame capture and write MJPEG, which when turned on slows the program down too much to make a good video (it is hard to think in slow motion). The conversion from MJPEG to MPEG4 isn't ideal either (it is a huge file) at 60fps 720p.
There are a lot of things I never bothered fixing in the demo, such as the boundary between black background and the fractal (you will see a blurry 16x16 block pattern outline late in the video), or the pops when a bad floating point number gets into the particle buffer. This only shows the result of the tree expansion limited to one level per frame (fill artifacts from edges of occluders). Also hole filling was broken. So it isn't really representative of what I'm planning on for Atom, but it was a huge milestone in learning GPGPU techniques!
Video
Link to MPEG4 Video - 720p 60fps (115 MB)
Other Screen Shots


What is it?
I'd qualify it as bad coder art used to test GPGPU tree data structures (everything is computed on the GPU, the CPU just sends in view position and view direction). The full fractal structure is huge, something like 2^24 pixels wide before precision problems sink in?
The scene tree is an 8-arry tree with one simple l-system rule for the 8 children based on the parent (position/scale/quaternion). It isn't limited to just one rule, I'm just lazy, can lookup any rule from a texture. The engine maintains the scene tree on the GPU (256K coarse nodes, 2M fine nodes). The projection is a 360 fisheye, and the particle effect runs in projected fisheye space based on motion vectors and parent tree position relative to child tree position. The visibility is computed in an octahedron space (different mapping from the view), so the edges of the fisheye projection loose quality fast. Color is instanced via a direct visualization of eye relative world position,
vec3 rgb = vec3(0.0);
rgb += vec3(0.7,0.625,0.5) * vec3(pow(abs(sin(sqrt(abs(y/4096.0))*2.0)),16.0));
rgb += vec3(0.3,0.4,0.5) * vec3(pow(abs(sin(sqrt(abs(x/4096.0))*32.0)),4.0));
rgb += vec3(0.5,0.4,0.3) * vec3(pow(abs(sin(z/4096.0*256.0)),4.0));
return rgb * rgb;
Future
I'm still fascinated with the problem of solving visibility for fully dynamic geometry requiring very expensive ray traversals (if ray cast or traced), but without ray casting (or tracing) and instead using what I've termed as "stochastic visibility" and blogged about. Effectively keeping a consistent tree structure of the scene as required by visibility (and shading), then only expanding and contracting the tree structure enough each frame to service rendering to the quality level required for human frame to frame perception. Stochastic visibility collides a point per scene tree node in an view space mapping to both help compute tree updates and solve for visibility. It is stochastic because the points collided are randomly positioned inside the bounding volume of the node. Collisions help prune out nodes which yield non-visible geometry. Collisions also directly solve the tree node memory allocation problem (there is no memory allocation).
Vacation has provided some new ideas to improve upon my old non-CUDA/OpenCL "stochastic visibility" which I will be trying soon,
(1.) IMPROVED TEMPORAL CONSISTENCY VS HISTORY BUFFER. Collisions would sometimes prune out a part of the required scene tree branch, and keeping persistent nodes was way too costly to do in OpenGL3 with GPGPU methods on a 8600 level graphics card. One trivial solution to this problem would be to run a pass which checks the source data for collisions and does a reduction of the source nodes keeping the highest priority node which had a collision. If I had a set of reducing resolution history buffers, I could likely insure good temporal consistency (solving the random pruning problem).
(2.) ADD PARENT LINKS. My scene tree nodes didn't have links to the parents. Was a problem for animated l-systems and particle fluid dynamics, because I wanted a physical constraint which pulled nodes back to the static position in parent (as defined as a possibly animated l-system rule). Turns out that the history buffer pass enables me to correct parent links in child nodes, even though parents move to different memory locations each frame. If this works, this GPU data structure truly becomes awesome because it solves the 1M node memory allocation problem per frame with fully dynamic trees, automatic "defragmentation" of memory, automatic regrouping for good data/branch locality and cache performance, and now nodes maintaining parent links even though all memory locations change per frame.
(3.) MULTI-LEVEL TREE UPDATE PER FRAME. With LOD transparent blend-in and a triangle based scene, I've found in previous results that I can add/prune nodes at only a LOD level per frame and effectively service visibility if I have a conservative amount of overlap. However when the screen tree goes down to the pixel or near pixel/level, the tree must be able to expand by more than one level per frame to fill visibility gaps in dynamic geometry. I'm planning on a new method where I first re-project the current node set, then do a hierarchical image space reduction to choose the highest priority nodes for varying level multi-level tree update per frame.
There are a lot of things I never bothered fixing in the demo, such as the boundary between black background and the fractal (you will see a blurry 16x16 block pattern outline late in the video), or the pops when a bad floating point number gets into the particle buffer. This only shows the result of the tree expansion limited to one level per frame (fill artifacts from edges of occluders). Also hole filling was broken. So it isn't really representative of what I'm planning on for Atom, but it was a huge milestone in learning GPGPU techniques!
Video
Link to MPEG4 Video - 720p 60fps (115 MB)
Other Screen Shots


What is it?
I'd qualify it as bad coder art used to test GPGPU tree data structures (everything is computed on the GPU, the CPU just sends in view position and view direction). The full fractal structure is huge, something like 2^24 pixels wide before precision problems sink in?
The scene tree is an 8-arry tree with one simple l-system rule for the 8 children based on the parent (position/scale/quaternion). It isn't limited to just one rule, I'm just lazy, can lookup any rule from a texture. The engine maintains the scene tree on the GPU (256K coarse nodes, 2M fine nodes). The projection is a 360 fisheye, and the particle effect runs in projected fisheye space based on motion vectors and parent tree position relative to child tree position. The visibility is computed in an octahedron space (different mapping from the view), so the edges of the fisheye projection loose quality fast. Color is instanced via a direct visualization of eye relative world position,
vec3 rgb = vec3(0.0);
rgb += vec3(0.7,0.625,0.5) * vec3(pow(abs(sin(sqrt(abs(y/4096.0))*2.0)),16.0));
rgb += vec3(0.3,0.4,0.5) * vec3(pow(abs(sin(sqrt(abs(x/4096.0))*32.0)),4.0));
rgb += vec3(0.5,0.4,0.3) * vec3(pow(abs(sin(z/4096.0*256.0)),4.0));
return rgb * rgb;
Future
I'm still fascinated with the problem of solving visibility for fully dynamic geometry requiring very expensive ray traversals (if ray cast or traced), but without ray casting (or tracing) and instead using what I've termed as "stochastic visibility" and blogged about. Effectively keeping a consistent tree structure of the scene as required by visibility (and shading), then only expanding and contracting the tree structure enough each frame to service rendering to the quality level required for human frame to frame perception. Stochastic visibility collides a point per scene tree node in an view space mapping to both help compute tree updates and solve for visibility. It is stochastic because the points collided are randomly positioned inside the bounding volume of the node. Collisions help prune out nodes which yield non-visible geometry. Collisions also directly solve the tree node memory allocation problem (there is no memory allocation).
Vacation has provided some new ideas to improve upon my old non-CUDA/OpenCL "stochastic visibility" which I will be trying soon,
(1.) IMPROVED TEMPORAL CONSISTENCY VS HISTORY BUFFER. Collisions would sometimes prune out a part of the required scene tree branch, and keeping persistent nodes was way too costly to do in OpenGL3 with GPGPU methods on a 8600 level graphics card. One trivial solution to this problem would be to run a pass which checks the source data for collisions and does a reduction of the source nodes keeping the highest priority node which had a collision. If I had a set of reducing resolution history buffers, I could likely insure good temporal consistency (solving the random pruning problem).
(2.) ADD PARENT LINKS. My scene tree nodes didn't have links to the parents. Was a problem for animated l-systems and particle fluid dynamics, because I wanted a physical constraint which pulled nodes back to the static position in parent (as defined as a possibly animated l-system rule). Turns out that the history buffer pass enables me to correct parent links in child nodes, even though parents move to different memory locations each frame. If this works, this GPU data structure truly becomes awesome because it solves the 1M node memory allocation problem per frame with fully dynamic trees, automatic "defragmentation" of memory, automatic regrouping for good data/branch locality and cache performance, and now nodes maintaining parent links even though all memory locations change per frame.
(3.) MULTI-LEVEL TREE UPDATE PER FRAME. With LOD transparent blend-in and a triangle based scene, I've found in previous results that I can add/prune nodes at only a LOD level per frame and effectively service visibility if I have a conservative amount of overlap. However when the screen tree goes down to the pixel or near pixel/level, the tree must be able to expand by more than one level per frame to fill visibility gaps in dynamic geometry. I'm planning on a new method where I first re-project the current node set, then do a hierarchical image space reduction to choose the highest priority nodes for varying level multi-level tree update per frame.
20090709
Cool Blog Posts
I'm starting to really like the Followers tool on the blog,
Sanders' Blog : Deferred Virtual Texture Shading Post Series - Sander van Rossen has a really neat idea to combine virtual texturing with texture space shading and do shading in the physical pages of the texture cache itself. Dynamic geometry would require updating normals (and coordinates) of the texels in the physical texture page cache. But lighting calculations could be cached (or not done every frame) and transparent rendering is simple. The challenge is efficiently rendering lights into physical texture page cache. I think this would be a highly worthy problem to solve in DX11 with CS5. With unique texturing, for each texture page in the page cache, have an associated bounding geometry to intersect with lights. Bin lights GPU side per texture page in the CS. Do highly coherent shading in the page cache with the lights in the page bin. This could be done by running enough threads to fill the hardware and just extracting work from the bins to load balance all the shading.
EnterTheSingularity - Jake Cannell has a bunch of cool ideas, such as this, "Taking the spatial-temporal coherence optimization idea to its logically conclusion results in a very interesting radical way to approach a renderer. Really what you want is something like a real-time video compression loop, but without the encoding step. When you need new fresh image macrotiles, you invoke the rasterizer, sometimes rendering them at reduced spatial resolution depending on the amount of motion blur, and then these are fed into a forward reprojection system which composites most of the image, along with some holes and errors introduced by new/unpredicted moving objects, lighting changes, etc. But that would be an entire new line of research, and is more suited to a ray tracer or really tight tree-based micro-rasterizer that can effeciently handle micro draw call invocations."
Sanders' Blog : Deferred Virtual Texture Shading Post Series - Sander van Rossen has a really neat idea to combine virtual texturing with texture space shading and do shading in the physical pages of the texture cache itself. Dynamic geometry would require updating normals (and coordinates) of the texels in the physical texture page cache. But lighting calculations could be cached (or not done every frame) and transparent rendering is simple. The challenge is efficiently rendering lights into physical texture page cache. I think this would be a highly worthy problem to solve in DX11 with CS5. With unique texturing, for each texture page in the page cache, have an associated bounding geometry to intersect with lights. Bin lights GPU side per texture page in the CS. Do highly coherent shading in the page cache with the lights in the page bin. This could be done by running enough threads to fill the hardware and just extracting work from the bins to load balance all the shading.
EnterTheSingularity - Jake Cannell has a bunch of cool ideas, such as this, "Taking the spatial-temporal coherence optimization idea to its logically conclusion results in a very interesting radical way to approach a renderer. Really what you want is something like a real-time video compression loop, but without the encoding step. When you need new fresh image macrotiles, you invoke the rasterizer, sometimes rendering them at reduced spatial resolution depending on the amount of motion blur, and then these are fed into a forward reprojection system which composites most of the image, along with some holes and errors introduced by new/unpredicted moving objects, lighting changes, etc. But that would be an entire new line of research, and is more suited to a ray tracer or really tight tree-based micro-rasterizer that can effeciently handle micro draw call invocations."
GPU Ray Traversal Efficiency
Understanding the Efficiency of Ray Traversal on GPUs
by Timo Aila and Samuli Laine of NVidia Research
I really like this HPG2009 paper which explores the primary factors effecting performance of ray traversal on NVidia GPUs. They compared actual results with results from a simulator which had zero cost memory performance. So actuals vs upper bound on best (unobtainable) GPU performance of only ALU cycle counts.
They found work distribution yielding lower SIMD efficiency was the primary bottleneck for ray traversal and NOT the memory system. Quote from the paper, "We have also shown that, contrary to conventional wisdom, ray tracing performance of GTX285 is not significantly hampered by the lack of cache hierarchy. In fact, we can also expect good scaling to more complex scenes as a result of not relying on caches."
Upper End Result in the Paper on a GTX285
- 140M rays/sec for primary rays (coherent)
- 40M rays/sec for secondary randomly shuffled GI rays (non-coherent)
They found around (and often better than) a 2x performance increase via two primary optimizations. With these changes they managed between 77% to 91% of the theoretical maximum GPU performance assuming infinite memory speed.
(1.) PERSISTENT THREADS. Instead of launching a thread per pixel, launch enough threads to fill the GPU once, and have the threads pull work from a global pool. The problem is that ray traversal has varying execution time. The advantage of threads grabbing their own work is that long running warps only stall the warp instead of stalling the entire block (another block would have to wait until the longest running warp finished before being issued). Also note paper pulls work from the global pool into small local (shared memory) pools to reduce pressure on a single global atomic counter.
(2.) PER-RAY TRAVERSAL. Typical CPU SIMD tracers do packet traversal partly because of lack of vector gather and because CPUs don't automatically handle branch divergence. The paper found that per-ray traversal is faster than packet traversal on the GTX285 in all cases, and for diffuse primary rays about 2x faster (even though per-ray traversal decreases memory coherency)!
Implementation
- 192 threads/block
- 21-25 registers/thread
- node data fetched via 1D texture fetch (cached)
- triangle data fetched via direct global memory access (non-cached)
- per-ray stacks via global memory access (non-cached)
- binary BVH traversal with maximum leaf size of 8
- tree built using greedy surface-area heuristic
- 64 bytes per the two BVH nodes (one full global memory segment access)
- pair of BVH nodes always fetched and tested together
- Woop's unit triangle intersection test
- 48 bytes per triangle (1 to 2 global memory segments per access)
- rays assigned in Morton order (good node texture coherency on primary rays)
I'm making some assumptions here (which I didn't see in the paper directly but should be obvious?), that the pairs of BVH nodes and 48 byte triangle leaf data (referred to below as a "data block") are fetched using 1-2 global memory transaction into shared memory.
(1.) Threads write address of data block into shared memory.
(2.) All threads share in loading the data block for one thread at a time.
(3.) Data blocks are loaded with increasing bank offsets.
This insures very fast memory transactions both globally (perfect coalescing on GT200), and no bank conflicts locally afterward when threads move from coordinated loading to individual parallel computation.
Notice stack data is also thread-local storage which is non-cached global memory accesses. I'm going to assume again that they insured good global memory segment coalescing (GT200) via keeping per thread stacks spaced out by a stride of 16. This way global memory segments required for stack update would be a function of divergence in stack size. So stacks of similar size would get serviced in a single global memory transaction (aside, this is exactly the same thing required on Larrabee to do fast parallel stacks since gather performance is a function of cache lines accessed).
Dedicated Warp-Wide Scan or Prefix Sum Instruction
This enables lower software overhead for fine thread granularity work regrouping on the GPU. IMO NVidia needs this (or something like Larrabee's 2 related instructions), covered it in prior blog posts, have hopes that this might be in NVidia's GT300 arch. The paper covers this in the section on "Improving the SIMD efficiency further".
Paper found that current software overhead for replacing terminated rays didn't yield a consistent performance update. However simulator suggested a max 20% improvement with the addition of two warp-wide hardware instructions (prefix sum and population count) assuming no memory bottleneck.
Paper also has some very rough speculation towards going to fully dynamic "Work queues" using the simulated new warp-wide instructions showing a max potential of 40% and 80% performance improvements in two cases assuming no memory bottleneck.
SIMD Width : 32-wide vs 16-wide vs 1-wide (Scaler)
Given impossibly infinitely fast memory, and identical computational power, simulator results showed 16-wide SIMD to be 6-19% faster, and scaler to be 30-144% faster than 32-wide SIMD.
by Timo Aila and Samuli Laine of NVidia Research
I really like this HPG2009 paper which explores the primary factors effecting performance of ray traversal on NVidia GPUs. They compared actual results with results from a simulator which had zero cost memory performance. So actuals vs upper bound on best (unobtainable) GPU performance of only ALU cycle counts.
They found work distribution yielding lower SIMD efficiency was the primary bottleneck for ray traversal and NOT the memory system. Quote from the paper, "We have also shown that, contrary to conventional wisdom, ray tracing performance of GTX285 is not significantly hampered by the lack of cache hierarchy. In fact, we can also expect good scaling to more complex scenes as a result of not relying on caches."
Upper End Result in the Paper on a GTX285
- 140M rays/sec for primary rays (coherent)
- 40M rays/sec for secondary randomly shuffled GI rays (non-coherent)
They found around (and often better than) a 2x performance increase via two primary optimizations. With these changes they managed between 77% to 91% of the theoretical maximum GPU performance assuming infinite memory speed.
(1.) PERSISTENT THREADS. Instead of launching a thread per pixel, launch enough threads to fill the GPU once, and have the threads pull work from a global pool. The problem is that ray traversal has varying execution time. The advantage of threads grabbing their own work is that long running warps only stall the warp instead of stalling the entire block (another block would have to wait until the longest running warp finished before being issued). Also note paper pulls work from the global pool into small local (shared memory) pools to reduce pressure on a single global atomic counter.
(2.) PER-RAY TRAVERSAL. Typical CPU SIMD tracers do packet traversal partly because of lack of vector gather and because CPUs don't automatically handle branch divergence. The paper found that per-ray traversal is faster than packet traversal on the GTX285 in all cases, and for diffuse primary rays about 2x faster (even though per-ray traversal decreases memory coherency)!
Implementation
- 192 threads/block
- 21-25 registers/thread
- node data fetched via 1D texture fetch (cached)
- triangle data fetched via direct global memory access (non-cached)
- per-ray stacks via global memory access (non-cached)
- binary BVH traversal with maximum leaf size of 8
- tree built using greedy surface-area heuristic
- 64 bytes per the two BVH nodes (one full global memory segment access)
- pair of BVH nodes always fetched and tested together
- Woop's unit triangle intersection test
- 48 bytes per triangle (1 to 2 global memory segments per access)
- rays assigned in Morton order (good node texture coherency on primary rays)
I'm making some assumptions here (which I didn't see in the paper directly but should be obvious?), that the pairs of BVH nodes and 48 byte triangle leaf data (referred to below as a "data block") are fetched using 1-2 global memory transaction into shared memory.
(1.) Threads write address of data block into shared memory.
(2.) All threads share in loading the data block for one thread at a time.
(3.) Data blocks are loaded with increasing bank offsets.
This insures very fast memory transactions both globally (perfect coalescing on GT200), and no bank conflicts locally afterward when threads move from coordinated loading to individual parallel computation.
Notice stack data is also thread-local storage which is non-cached global memory accesses. I'm going to assume again that they insured good global memory segment coalescing (GT200) via keeping per thread stacks spaced out by a stride of 16. This way global memory segments required for stack update would be a function of divergence in stack size. So stacks of similar size would get serviced in a single global memory transaction (aside, this is exactly the same thing required on Larrabee to do fast parallel stacks since gather performance is a function of cache lines accessed).
Dedicated Warp-Wide Scan or Prefix Sum Instruction
This enables lower software overhead for fine thread granularity work regrouping on the GPU. IMO NVidia needs this (or something like Larrabee's 2 related instructions), covered it in prior blog posts, have hopes that this might be in NVidia's GT300 arch. The paper covers this in the section on "Improving the SIMD efficiency further".
Paper found that current software overhead for replacing terminated rays didn't yield a consistent performance update. However simulator suggested a max 20% improvement with the addition of two warp-wide hardware instructions (prefix sum and population count) assuming no memory bottleneck.
Paper also has some very rough speculation towards going to fully dynamic "Work queues" using the simulated new warp-wide instructions showing a max potential of 40% and 80% performance improvements in two cases assuming no memory bottleneck.
SIMD Width : 32-wide vs 16-wide vs 1-wide (Scaler)
Given impossibly infinitely fast memory, and identical computational power, simulator results showed 16-wide SIMD to be 6-19% faster, and scaler to be 30-144% faster than 32-wide SIMD.
20090708
Alaska : Glacier Bay by Air
A 75 minute flight-seeing tour of Glacier Bay Area is the best way to see the park. We waited for a clear day before doing the tour and got rewarded by the ability to fly over the Brady Ice Field. Flight-seeing provides a way to see many glaciers up close and personal, or choose specific areas of the park to look at, such as the peaks of mountains.

Or the glaciers themselves.

Here are some of Margarie Glacier,



Unfortunately a majority of the shots taken didn't turn out well (mixture of new camera with auto-exposure not adapting to the snow and overexposing everything, and rather would see stuff instead of take photos). The experience was hugely better than what you will ever see from a set of photos or even videos.

Or the glaciers themselves.

Here are some of Margarie Glacier,



Unfortunately a majority of the shots taken didn't turn out well (mixture of new camera with auto-exposure not adapting to the snow and overexposing everything, and rather would see stuff instead of take photos). The experience was hugely better than what you will ever see from a set of photos or even videos.
Giga Voxels at Siggraph 2009
Beyond Triangles : GigaVoxels Effects In Video Games
Friday, 7 August | 3:45 PM | Room 260-262

Looks like a really neat talk,
Mixing with existing triangle scenes.
More efficient CUDA implementation of the pipeline.
Faster octree traversal.
Fully GPU side brick cache management.
Efficient soft shadows and depth-of-field approximation.
Instancing of millions of octree-based volumetric objects.
Image space method supporting transparency with overlapping instances.
Friday, 7 August | 3:45 PM | Room 260-262

Looks like a really neat talk,
Mixing with existing triangle scenes.
More efficient CUDA implementation of the pipeline.
Faster octree traversal.
Fully GPU side brick cache management.
Efficient soft shadows and depth-of-field approximation.
Instancing of millions of octree-based volumetric objects.
Image space method supporting transparency with overlapping instances.
20090706
Alaska : Glacier Bay by Boat
The marine wilderness of Glacier Bay includes tidewater glaciers, snow-capped mountain ranges, ocean coastlines, deep fjords, and freshwater rivers and lakes. This diverse land and seascape hosts a mosaic of plant communities and a variety of marine and terrestrial wildlife and presents many opportunities for adventure. Here is the Glacier Bay Area viewed from google,
View Larger Map
The park provides an 8 hour boat tour of the area which visits many of the glaciers in the bay. The tour starts from the park dock at Bartlet Cove (which can be seen by zooming in the map above East of Gustavus).
Kathryn and I decided to take the tour after seeing photos of the tour where people were getting on and off the boat. Turns out this was misleading, those photos were of picking up and dropping off kayakers and not off boat excursions as part of the tour as I expected. Eight hours is a long time even on the large boat, and I spend part of the tour (most of the return) sleeping. But in their defense, they did provide a great lunch, and the views in route were spectacular.

Stellar's Sea Lions (above), Brown Bear (aka Grizzly below), and other animals were seen on the tour. At one point we watched a Brown Bear traverse a patch of mostly vertical rock in route to the beach were we ended up picking up the first batch of kayakers. Apparently a few other kayakers were picked up before from the same spot with a bear taking a nap on the beach.


Glacier views are provided half way in the tour along with lunch. Below is Margarie Glacier plus detail shots. The gray silt when dispersed yields the turquoise water color which proliferates in the Glacier Bay area. We did see some calving at Margarie, but didn't get any good shots of it.



The Lamplugh Glacier currently sports an internal waterfall,

A far better way to see the bay is by air, and that post will have to wait until Kathryn gets the photos finished...
View Larger Map
The park provides an 8 hour boat tour of the area which visits many of the glaciers in the bay. The tour starts from the park dock at Bartlet Cove (which can be seen by zooming in the map above East of Gustavus).
Kathryn and I decided to take the tour after seeing photos of the tour where people were getting on and off the boat. Turns out this was misleading, those photos were of picking up and dropping off kayakers and not off boat excursions as part of the tour as I expected. Eight hours is a long time even on the large boat, and I spend part of the tour (most of the return) sleeping. But in their defense, they did provide a great lunch, and the views in route were spectacular.

Stellar's Sea Lions (above), Brown Bear (aka Grizzly below), and other animals were seen on the tour. At one point we watched a Brown Bear traverse a patch of mostly vertical rock in route to the beach were we ended up picking up the first batch of kayakers. Apparently a few other kayakers were picked up before from the same spot with a bear taking a nap on the beach.


Glacier views are provided half way in the tour along with lunch. Below is Margarie Glacier plus detail shots. The gray silt when dispersed yields the turquoise water color which proliferates in the Glacier Bay area. We did see some calving at Margarie, but didn't get any good shots of it.



The Lamplugh Glacier currently sports an internal waterfall,

A far better way to see the bay is by air, and that post will have to wait until Kathryn gets the photos finished...
Alaska : Mendenhall Glacier
This 12 mile long and 1.5 mile wide glacier is located in Mendenhall Valley, about 12 miles from Juneau.

Trails in the valley lead around the lake created by the receding ice mass. This lake is filled with very blue icebergs. In the glacial areas, ice takes on this blue color due to air bubbles trapped inside the ice getting squeezed out in the glacier.

Around the lake there are 2 primary trails leading towards the glacier, the shorter easy hike ends at a huge waterfall (picture below). A secondary trail loop off this waterfall hike was closed (probably due to the trail being destroyed by a landslide).

On the other end of the lake starting at the camp area is another longer 7 mile trail which leads to a view over the glacier. We would have liked to do the trail, but we had to catch our flight back to Chicago in a few hours, so we opted to cheat and take a helicopter tour which landed on the glacier itself.
Helicopter Tour
Looking back towards the visitors center. The trail which follows the shore leading to the huge waterfall (which is out of view) is on the left side of this image.

Detail of calved icebergs at the glacier's terminus,

Glacial detail viewed from the chopper was fantastic,

Looking towards the landing spot,

Similar view after touching down on the glacial surface,

And yes, glacial melt water is quite refreshing,

These melt water streams carve through the top of the glacier surface,

Views can be quite stunning,

But eventually you have to return to the helicopter holding your hat on, to return the ice boot covers they provided you. For some reason I felt compelled to duck my head down, perhaps as a subconscious reaction that I could self terminate by jumping upward. Also note how they position someone to help, so stupid tourists don't deviate into the rear tail rotor.

The helicopter tour was the most expensive excursion we did. Something to do with turbine motor vs piston motor and amortizing the cost of the flying machine over the number of tourists... It was totally a last minute decision made along the lines of "we haven't done this before, and we probably are not going back to get a second chance to land on a glacier, might have enough time to do this before flying back to Chicago". Alaska's coastal weather is unpredictable, after a morning of near ground level clouds, and seeing the overcast cloud ceiling lifting, we took off for the airport looking for a building with lots of helicopters...
Google Map View of the Area
View Larger Map

Trails in the valley lead around the lake created by the receding ice mass. This lake is filled with very blue icebergs. In the glacial areas, ice takes on this blue color due to air bubbles trapped inside the ice getting squeezed out in the glacier.

Around the lake there are 2 primary trails leading towards the glacier, the shorter easy hike ends at a huge waterfall (picture below). A secondary trail loop off this waterfall hike was closed (probably due to the trail being destroyed by a landslide).

On the other end of the lake starting at the camp area is another longer 7 mile trail which leads to a view over the glacier. We would have liked to do the trail, but we had to catch our flight back to Chicago in a few hours, so we opted to cheat and take a helicopter tour which landed on the glacier itself.
Helicopter Tour
Looking back towards the visitors center. The trail which follows the shore leading to the huge waterfall (which is out of view) is on the left side of this image.

Detail of calved icebergs at the glacier's terminus,

Glacial detail viewed from the chopper was fantastic,

Looking towards the landing spot,

Similar view after touching down on the glacial surface,

And yes, glacial melt water is quite refreshing,

These melt water streams carve through the top of the glacier surface,

Views can be quite stunning,

But eventually you have to return to the helicopter holding your hat on, to return the ice boot covers they provided you. For some reason I felt compelled to duck my head down, perhaps as a subconscious reaction that I could self terminate by jumping upward. Also note how they position someone to help, so stupid tourists don't deviate into the rear tail rotor.

The helicopter tour was the most expensive excursion we did. Something to do with turbine motor vs piston motor and amortizing the cost of the flying machine over the number of tourists... It was totally a last minute decision made along the lines of "we haven't done this before, and we probably are not going back to get a second chance to land on a glacier, might have enough time to do this before flying back to Chicago". Alaska's coastal weather is unpredictable, after a morning of near ground level clouds, and seeing the overcast cloud ceiling lifting, we took off for the airport looking for a building with lots of helicopters...
Google Map View of the Area
View Larger Map
Alaska : Eagles
Gustavus has a lot of Bald Eagles which hang out around the pier at low tide. One evening, a bunch of them were fighting for what was likely fisherman scraps,
One eagle managed to steal some skin, and flew away to an isolated post on the pier to consume it. Food flying into the mouth, not out...

The eagle was close enough to get a good video without the camera's sorry excuse of digital zoom for lack of telephoto,
One eagle managed to steal some skin, and flew away to an isolated post on the pier to consume it. Food flying into the mouth, not out...

The eagle was close enough to get a good video without the camera's sorry excuse of digital zoom for lack of telephoto,
20090705
Multicore IC for Realtime Ray Tracing
800TFLOPS Multicore IC for Realtime Ray Tracing describes a monster multi-core board dedicated to real-time ray tracing. Quick stats pulled from the diagram,
Board Target
- 1000W.
- 792 TFLOPs (only)!
- 73 cores on chip x 9 chips.
- 1920 x 1080 output resolution.
- Images traced using 35 channel color.
- Chips at 750MHz on 45nm process.
- 130 million gates/chip at 17mm footprint.
Totals Across 9 Chips on Board
- 864 KB L2 I$.
- 7056 KB L2 D$.
- 4608 KB L1 I$.
- 9216 KB L1 D$.
- 954 KB registers.
- 928 KB L0 memory.
- 2592 GB/s bandwidth.
Each Chip
- 8 clusters of 9 cores.
- 1 64-bit RISC CPU.
- 96 KB L2 I$.
- 784 KB L2 D$.
- 512 KB L1 I$ (total over 8 clusters).
- 1024 KB L1 D$ (total over 8 clusters).
- 106 KB registers (total over 8 clusters).
- 103 KB L0 memory (total over 8 clusters).
- 288 GB/s bandwidth (total over 8 clusters).
Each Cluster
- 64 KB L1 I$.
- 128 KB L1 D$.
- 13.25 KB registers.
- 12.9 KB L0 memory.
- 36 GB/s via 3 interconnect buses each 12 GB/s and 128-bit.
- 9 cores sharing registers (?).
CORE 0,
- Ray generation.
- Reserved register file.
- Register file : 8 bytes x 16 registers x 4 banks (512 B).
- 256 B L0 memory.
CORE 1,
- Pixel Traverse.
- 2-way SIMD.
- Register file : (4x2) bytes x 16 registers x 4 banks (512 B).
- ? B CAM (content accessible memory) L0 memory.
CORE 2,
- Bounding box processing.
- 2-way SIMD x 3.
- (4x2) bytes x 16 registers x 8 banks (1 KB)
- 4 KB L0 memory.
CORE 3,
- Intersection processing.
- 8-way SIMD x 24.
- (4x8) bytes x 16 registers x 16 banks (8 KB)
- 5.4 KB L0 memory.
CORE 4,
- Other ray-tracing processing.
- 4 bytes x 16 registers x 16 banks (256 B)
- 256 B L0 memory.
CORES 5, 6, AND 7,
- "Light Processing", write processing.
- 4 bytes x 16 registers x 16 banks (1 KB)
- 1 KB L0 memory.
CORE 8,
- Not shown in the diagram?
Board Target
- 1000W.
- 792 TFLOPs (only)!
- 73 cores on chip x 9 chips.
- 1920 x 1080 output resolution.
- Images traced using 35 channel color.
- Chips at 750MHz on 45nm process.
- 130 million gates/chip at 17mm footprint.
Totals Across 9 Chips on Board
- 864 KB L2 I$.
- 7056 KB L2 D$.
- 4608 KB L1 I$.
- 9216 KB L1 D$.
- 954 KB registers.
- 928 KB L0 memory.
- 2592 GB/s bandwidth.
Each Chip
- 8 clusters of 9 cores.
- 1 64-bit RISC CPU.
- 96 KB L2 I$.
- 784 KB L2 D$.
- 512 KB L1 I$ (total over 8 clusters).
- 1024 KB L1 D$ (total over 8 clusters).
- 106 KB registers (total over 8 clusters).
- 103 KB L0 memory (total over 8 clusters).
- 288 GB/s bandwidth (total over 8 clusters).
Each Cluster
- 64 KB L1 I$.
- 128 KB L1 D$.
- 13.25 KB registers.
- 12.9 KB L0 memory.
- 36 GB/s via 3 interconnect buses each 12 GB/s and 128-bit.
- 9 cores sharing registers (?).
CORE 0,
- Ray generation.
- Reserved register file.
- Register file : 8 bytes x 16 registers x 4 banks (512 B).
- 256 B L0 memory.
CORE 1,
- Pixel Traverse.
- 2-way SIMD.
- Register file : (4x2) bytes x 16 registers x 4 banks (512 B).
- ? B CAM (content accessible memory) L0 memory.
CORE 2,
- Bounding box processing.
- 2-way SIMD x 3.
- (4x2) bytes x 16 registers x 8 banks (1 KB)
- 4 KB L0 memory.
CORE 3,
- Intersection processing.
- 8-way SIMD x 24.
- (4x8) bytes x 16 registers x 16 banks (8 KB)
- 5.4 KB L0 memory.
CORE 4,
- Other ray-tracing processing.
- 4 bytes x 16 registers x 16 banks (256 B)
- 256 B L0 memory.
CORES 5, 6, AND 7,
- "Light Processing", write processing.
- 4 bytes x 16 registers x 16 banks (1 KB)
- 1 KB L0 memory.
CORE 8,
- Not shown in the diagram?
Odd Transparent Rendering Ideas
Transparent content tends to require a trade-off between overdraw and sharpness. The ideal solution in my mind is to able to have sharp features (high resolution) with low amounts of overdraw (more opaque) intermixed with smooth features (low resolution) and high amounts of overdraw (more transparent).
With 100% order-independent rendering, such as only additive transparency, one possible solution would be as follows,
(1.) Have a set of accumulation buffers at reducing resolutions.
(2.) Distribute draws to buffers based on sharpness and overdraw.
(3.) Combine all buffers.
Combine buffers requires repeated passes of up-sample and merge in order of the smallest buffer to the largest buffer. Note it might be a good idea to filter better in the up-sample passes to avoid the typical mip maximize artifacts. Also note these combine passes are going to be bandwidth bound, so its a good idea to pair with some ALU bound work in the same kernel if possible!
TANGENT ON BLEND MODES
Typical blend modes used in order-independent rendering are ADD and MUL. ADD to lighten, and MUL to darken. Often using both requires 2 passes, and when mixed with multiple resolutions, artifacts are found such as background MUL particles (say smoke) drawing over foreground ADD particles (say fire).
Another issue is gamma-correct (linear space) blending, are transparent surfaces combined in linear space or after the linear to gamma transform?
In the case of gamma space (non-linear) blending, I've never liked ADD blending at all for additive transparency. Perhaps this is because of years of photography where I go to complex lengths to avoid over-exposure. HDR in games is often (IMO) ridiculously overdone (really, should 25-50% of the frame be white). A visually better option (IMO) is Photoshop's SCREEN blend mode. SCREEN blending results in a soft approach to white, with blending limited at white. From a photography perspective, SCREEN works more like the way film responds to over-exposure. SCREEN is an inverted multiply,
d = 1 - ( (1-a) * (1-b) )
The inversion required by SCREEN can be factored out to doing MUL blending with an inverted framebuffer when attempting to port this over to DX or GL or GCM.
In the case of linear blending, at least in Photoshop, I've found MUL for darkening unacceptable with both 8-bit and 16-bit modes. To test this in Photoshop, one must actually build a linear color profile (which is quite easy). The problem is the loss of precision (banding, etc) when the MUL results in colors near black (even in 16-bit mode). Perhaps with 16-bit float framebuffer outside of Photoshop this isn't a problem.
BILLBOARDS AND SPHERICAL PARTICLES
Particle billboards suffer from the problem of hard intersection with solid geometry, and in the case of sorted billboards, pop when sort order changes.
Both of these problems have similar solutions, dynamically adjusting transparency across the particle to avoid the artifact. Spherical (or Z buffer aware) particles easily solves the first problem of intersecting with solid geometry. The second problem of pop on order change is more complex.
TRANSFORMING ORDER DEPENDENT TO ORDER INDEPENDENT?
This is a wild completely untested idea I haven't had time to try. Note the application of this would be for a non-triangle renderer which doesn't have standard opaque triangle surfaces. Application of this to a typical game engine is likely not worth it (requires two alphas, one for weight and one for transparency, or other complicated things)!
The idea is to solve the following problems in one go,
(1.) Removal of the order change pop on billboards.
(2.) Ability to easily work in multiple resolutions for performance.
(3.) Ability to do blend, lighten and darken.
(4.) Ability to do all this in one "pass".
The idea is similar in function to front first blending in which RGB accumulates color and A accumulates coverage. Except now RGB accumulates weighted color, and A accumulates weight. Actual RGB color is weighted RGB divided by weight A (assuming a FP16 buffer).
However, drawing would be order-independent, billboards rendered by computing a weight factor of visibility across the billboard's surface (or perhaps for smaller billboards, constant weight factor across the billboard's surface). This visibility factor would take into account both the billboard's transparency and expected occlusion by other billboards and surfaces. RGB would be billboard color times weight. Also for lighten and darken, weight stored in A could be adjusted.
Clearly this is adding another problem to solve (visibility estimation for weight factor). However I believe the weight factor could easily be computed at a lower effective frame rate and at a much lower resolution, possibly leading to better performance overall.
With 100% order-independent rendering, such as only additive transparency, one possible solution would be as follows,
(1.) Have a set of accumulation buffers at reducing resolutions.
(2.) Distribute draws to buffers based on sharpness and overdraw.
(3.) Combine all buffers.
Combine buffers requires repeated passes of up-sample and merge in order of the smallest buffer to the largest buffer. Note it might be a good idea to filter better in the up-sample passes to avoid the typical mip maximize artifacts. Also note these combine passes are going to be bandwidth bound, so its a good idea to pair with some ALU bound work in the same kernel if possible!
TANGENT ON BLEND MODES
Typical blend modes used in order-independent rendering are ADD and MUL. ADD to lighten, and MUL to darken. Often using both requires 2 passes, and when mixed with multiple resolutions, artifacts are found such as background MUL particles (say smoke) drawing over foreground ADD particles (say fire).
Another issue is gamma-correct (linear space) blending, are transparent surfaces combined in linear space or after the linear to gamma transform?
In the case of gamma space (non-linear) blending, I've never liked ADD blending at all for additive transparency. Perhaps this is because of years of photography where I go to complex lengths to avoid over-exposure. HDR in games is often (IMO) ridiculously overdone (really, should 25-50% of the frame be white). A visually better option (IMO) is Photoshop's SCREEN blend mode. SCREEN blending results in a soft approach to white, with blending limited at white. From a photography perspective, SCREEN works more like the way film responds to over-exposure. SCREEN is an inverted multiply,
d = 1 - ( (1-a) * (1-b) )
The inversion required by SCREEN can be factored out to doing MUL blending with an inverted framebuffer when attempting to port this over to DX or GL or GCM.
In the case of linear blending, at least in Photoshop, I've found MUL for darkening unacceptable with both 8-bit and 16-bit modes. To test this in Photoshop, one must actually build a linear color profile (which is quite easy). The problem is the loss of precision (banding, etc) when the MUL results in colors near black (even in 16-bit mode). Perhaps with 16-bit float framebuffer outside of Photoshop this isn't a problem.
BILLBOARDS AND SPHERICAL PARTICLES
Particle billboards suffer from the problem of hard intersection with solid geometry, and in the case of sorted billboards, pop when sort order changes.
Both of these problems have similar solutions, dynamically adjusting transparency across the particle to avoid the artifact. Spherical (or Z buffer aware) particles easily solves the first problem of intersecting with solid geometry. The second problem of pop on order change is more complex.
TRANSFORMING ORDER DEPENDENT TO ORDER INDEPENDENT?
This is a wild completely untested idea I haven't had time to try. Note the application of this would be for a non-triangle renderer which doesn't have standard opaque triangle surfaces. Application of this to a typical game engine is likely not worth it (requires two alphas, one for weight and one for transparency, or other complicated things)!
The idea is to solve the following problems in one go,
(1.) Removal of the order change pop on billboards.
(2.) Ability to easily work in multiple resolutions for performance.
(3.) Ability to do blend, lighten and darken.
(4.) Ability to do all this in one "pass".
The idea is similar in function to front first blending in which RGB accumulates color and A accumulates coverage. Except now RGB accumulates weighted color, and A accumulates weight. Actual RGB color is weighted RGB divided by weight A (assuming a FP16 buffer).
However, drawing would be order-independent, billboards rendered by computing a weight factor of visibility across the billboard's surface (or perhaps for smaller billboards, constant weight factor across the billboard's surface). This visibility factor would take into account both the billboard's transparency and expected occlusion by other billboards and surfaces. RGB would be billboard color times weight. Also for lighten and darken, weight stored in A could be adjusted.
Clearly this is adding another problem to solve (visibility estimation for weight factor). However I believe the weight factor could easily be computed at a lower effective frame rate and at a much lower resolution, possibly leading to better performance overall.
20090703
Alaska : Whale Watching
Hopefully flickr won't mind the deep linking...
Alaska was incredible, not the typical vacation location, and IMO perfect for the independent traveler. My wife Kathryn and I spent over a week in the Glacier Bay area, and over a few blog posts I'm going to share a little of the experience.
The TAZ
I'd highly suggest the Icy Straits afternoon whale watching excursion of Tod Seben's Cross Sound Express, also known as The TAZ in Gustavus. Easy to see how the Humpbacks were nearly hunted to extinction. They are simple to find if one knows where they feed, grab a set of binoculars and head towards the sprays. Being up-wind of the spray provides the opportunity to experience "whale's breath". It is amazing that something so foul smelling could travel so far without dispersing (more).

Watching the Humpbacks is something else indeed, especially watching a 40 foot animal jump out of the water. Besides breaching, they spy-hop, tail-lob, slap, and wave. At one point in the afternoon, a pod of whales turned and closed the distance to the boat while we were stationary and watching (regulations specify a minimum distance to approach the whales, however they can choose to approach the boat on their own). Size becomes apparent when seeing something the size of the boat swim by.
Well worth it. At one point, Tod let Kathryn pilot the TAZ,

Bubble Net Feeding Viewed via Air Taxi
Towns and cities in the Glacier Bay area are connected via air and water only. Roads don't exist except locally. Our taxi pilot mentioned that he had seen a pod of Humpbacks from the air on route, and said he would attempt to find them again as we took the air taxi from Gustavus to Juneau. By some act of random chance, he did indeed find the pod, and they were bubble net feeding, which according to our pilot must have been rare to see in the air because he had never see that before. We got a first hand look via a bunch of descending tight banked loops.

Below a shot of the pod after feeding. BTW, that wasn't telephoto, all we had was a small point and shoot, which apparently has some issues with white point computation given light angle to the water...
Alaska was incredible, not the typical vacation location, and IMO perfect for the independent traveler. My wife Kathryn and I spent over a week in the Glacier Bay area, and over a few blog posts I'm going to share a little of the experience.
The TAZ
I'd highly suggest the Icy Straits afternoon whale watching excursion of Tod Seben's Cross Sound Express, also known as The TAZ in Gustavus. Easy to see how the Humpbacks were nearly hunted to extinction. They are simple to find if one knows where they feed, grab a set of binoculars and head towards the sprays. Being up-wind of the spray provides the opportunity to experience "whale's breath". It is amazing that something so foul smelling could travel so far without dispersing (more).

Watching the Humpbacks is something else indeed, especially watching a 40 foot animal jump out of the water. Besides breaching, they spy-hop, tail-lob, slap, and wave. At one point in the afternoon, a pod of whales turned and closed the distance to the boat while we were stationary and watching (regulations specify a minimum distance to approach the whales, however they can choose to approach the boat on their own). Size becomes apparent when seeing something the size of the boat swim by.
Well worth it. At one point, Tod let Kathryn pilot the TAZ,

Bubble Net Feeding Viewed via Air Taxi
Towns and cities in the Glacier Bay area are connected via air and water only. Roads don't exist except locally. Our taxi pilot mentioned that he had seen a pod of Humpbacks from the air on route, and said he would attempt to find them again as we took the air taxi from Gustavus to Juneau. By some act of random chance, he did indeed find the pod, and they were bubble net feeding, which according to our pilot must have been rare to see in the air because he had never see that before. We got a first hand look via a bunch of descending tight banked loops.

Below a shot of the pod after feeding. BTW, that wasn't telephoto, all we had was a small point and shoot, which apparently has some issues with white point computation given light angle to the water...
20090701
iPhone 3GS GLBench
The GLBenchmark site provides a way to compare the GL performance of all the iPhone models including the 3GS! Results are very interesting indeed showing 3GS with 10x the triangle performance of the prior models, and 3x the texture fill rate.
Subscribe to:
Posts (Atom)