Humus has an excellent blog which I need to read more often. In this post he makes a great point about compute shaders, "And in practice you'll probably see the number of local registers needed go down since your previously duplicated registers can now be shared with some care from the developer. As a result, the register pressure would go down, potentially increasing performance."
This is right in line with castano's comment, that deferred shading should see great benefit from CUDA (DX11 CS and OpenCL as well). Something like combinding multiple lights (the Naughty Dog method) will see a large amount of duplicate state between pixels (or samples) which compute will have in shared registers, reducing register pressure for more complex shaders.
20090529
Cinematic Trailers
Early 2009 E3 coverage made me think of this awesome Cinematic Trailer (below). Why don't they just make a movie? CG this good creates impossibly high expectations for in game graphics. Oh and this CFD stuff is exactly why we need 1000 Watt GPUs (I'm being serious here).
Per Cluster Shared Shared-Memory
More GT300 Speculation
I've been rather dense in that I didn't bother to fully read the older GT300 rumors. I just saw the words "MIMD" and "scratch-cache" and immediately got distracted thinking about hardware support for fine grain DWF, which I don't think we will see any time soon!
However, now things are starting to make more sense. MIMD really means Multiple-Program-Multiple-Data (MPMD). Larrabee also would be MPMD under those terms. Also Theo referred to G80's shared memory as "scratch-cache" prior in something I had never seen.
Theo also talks about better data sharing via shared memory between cores in a given cluster. If I was to read between the lines, this says to me that cores (the 8-wide SIMD multiprocessors in GT200 which have 16KB of individual shared memory) now share a per-cluster shared memory. GT300 rumor numbers are 512 "cores" as in threads, and 16 clusters of 32 threads. If the clusters had a similar 8-wide hot clocked SIMD unit with single instruction issue for pair of half-warps, that would result in 4 multiprocessors/cluster (compared to the 3/cluster in GT200) sharing one much larger set of shared memory (shared memory is in fact actually a large banked shared-register file interfaced through a special crossbar). Working with the 32KB maximum shared memory per block in DX11, that would be a 4*32KB or 128KB shared memory block per cluster in GT300 (banking and porting of this shared register file would be a mess?).
So why a cluster shared shared-memory when CUDA and DX11 CS and OpenCL don't allow sharing of shared memory outside of the 32KB maximum per block?
DX11 has a lot more varied pipeline stages. Data needs to be passed (or shared) between the stages. In the tessellation stages, HS outputs need to be reused by a possible huge number of DS invocations. This in combination with Append/Consume, got me thinking in a previous post about general purpose queuing (GRAMPS style). Now with a shared-memory shared between cores in a cluster, one wouldn't need to queue, data doesn't get transferred! Going back to the previous example, HS runs, output ends up in shared memory, TS amplification creates huge amounts of DS threads which get packed into warps and run on the cores in the cluster. Each core can read the original HS outputs right from shared memory (since shared-memory is shared per cluster). Also perhaps cluster shared shared-memory could be used to share an Append/Consume buffer between many cores in the same cluster very efficiently.
As always I could be getting warmer here, or could just be dead cold wrong! Hopefully if I have any readers who already know, ie NVidia employees or NDAs, you can get a chuckle out of the extents some people will go to attempt to start planning for your future hardware before it is released! Unfortunately this is all in my free time, for personal projects which I don't have time to finish, because the day job is all about the order of magnitude slower and (in comparison to DX11) totally outdated console hardware... anyway,
With cluster shared shared-memory there could be some remarkable similarities (and very important differences) between NVidia's GT300 and Larrabee. Keep in mind that with NVidia's 8/9/200 series arch, scaler operations can be done by predicating out all but one thread in a warp (or something more efficient like MOG). So both NVidia and Larrabee offer similar scaler abilities (just LRB has a separate scaler register file). With warp sized blocks, in combination with MPMD and register indexable branching, GT300 could be quite "general purpose" indeed. CUDA 3.0 could be huge.
Also I wonder how exactly NVidia will handle shared memory allocation. Regardless if cluster shared shared-memory is true, (I think) MPMD most certainly is and likely blocks from multiple programs will not have matching amounts of shared memory per block. With Larrabee the cache would efficiently handle variable block sizes, the driver could just over allocate all the time.
Another Related Crazy Idea
Crazy in that nothing I've seen suggests that this would ever be the case for GT300.
How about one slow per cluster scaler unit pared with multiple SIMD units sharing the same shared-memory? Slow in that the scaler unit accesses shared-memory when the SIMD units have idle shared-memory access cycles. Global memory accesses make less sense here. Seems like with Larrabee you eat a huge loss in vector capacity when going "general purpose" with scaler x86 code (likewise with going scaler on the GT200). IMO scaler isn't needed much and shouldn't be used very often, but as GPUs get more CPU function complete, I wonder if a small scaler unit might get area efficient?
BTW, I'm NOT talking about having a separate program stream for the puny scaler ALU, but rather that this scaler ALU gets issued instructions from the same program which runs on the vector SIMD units, just when the instruction scheduler detects that all but one lane of a warp is predicated out, that instruction gets sent to the puny scaler ALU instead!
Perhaps this idea wouldn't make sense with regards to messy register porting!
I've been rather dense in that I didn't bother to fully read the older GT300 rumors. I just saw the words "MIMD" and "scratch-cache" and immediately got distracted thinking about hardware support for fine grain DWF, which I don't think we will see any time soon!
However, now things are starting to make more sense. MIMD really means Multiple-Program-Multiple-Data (MPMD). Larrabee also would be MPMD under those terms. Also Theo referred to G80's shared memory as "scratch-cache" prior in something I had never seen.
Theo also talks about better data sharing via shared memory between cores in a given cluster. If I was to read between the lines, this says to me that cores (the 8-wide SIMD multiprocessors in GT200 which have 16KB of individual shared memory) now share a per-cluster shared memory. GT300 rumor numbers are 512 "cores" as in threads, and 16 clusters of 32 threads. If the clusters had a similar 8-wide hot clocked SIMD unit with single instruction issue for pair of half-warps, that would result in 4 multiprocessors/cluster (compared to the 3/cluster in GT200) sharing one much larger set of shared memory (shared memory is in fact actually a large banked shared-register file interfaced through a special crossbar). Working with the 32KB maximum shared memory per block in DX11, that would be a 4*32KB or 128KB shared memory block per cluster in GT300 (banking and porting of this shared register file would be a mess?).
So why a cluster shared shared-memory when CUDA and DX11 CS and OpenCL don't allow sharing of shared memory outside of the 32KB maximum per block?
DX11 has a lot more varied pipeline stages. Data needs to be passed (or shared) between the stages. In the tessellation stages, HS outputs need to be reused by a possible huge number of DS invocations. This in combination with Append/Consume, got me thinking in a previous post about general purpose queuing (GRAMPS style). Now with a shared-memory shared between cores in a cluster, one wouldn't need to queue, data doesn't get transferred! Going back to the previous example, HS runs, output ends up in shared memory, TS amplification creates huge amounts of DS threads which get packed into warps and run on the cores in the cluster. Each core can read the original HS outputs right from shared memory (since shared-memory is shared per cluster). Also perhaps cluster shared shared-memory could be used to share an Append/Consume buffer between many cores in the same cluster very efficiently.
As always I could be getting warmer here, or could just be dead cold wrong! Hopefully if I have any readers who already know, ie NVidia employees or NDAs, you can get a chuckle out of the extents some people will go to attempt to start planning for your future hardware before it is released! Unfortunately this is all in my free time, for personal projects which I don't have time to finish, because the day job is all about the order of magnitude slower and (in comparison to DX11) totally outdated console hardware... anyway,
With cluster shared shared-memory there could be some remarkable similarities (and very important differences) between NVidia's GT300 and Larrabee. Keep in mind that with NVidia's 8/9/200 series arch, scaler operations can be done by predicating out all but one thread in a warp (or something more efficient like MOG). So both NVidia and Larrabee offer similar scaler abilities (just LRB has a separate scaler register file). With warp sized blocks, in combination with MPMD and register indexable branching, GT300 could be quite "general purpose" indeed. CUDA 3.0 could be huge.
Also I wonder how exactly NVidia will handle shared memory allocation. Regardless if cluster shared shared-memory is true, (I think) MPMD most certainly is and likely blocks from multiple programs will not have matching amounts of shared memory per block. With Larrabee the cache would efficiently handle variable block sizes, the driver could just over allocate all the time.
Another Related Crazy Idea
Crazy in that nothing I've seen suggests that this would ever be the case for GT300.
How about one slow per cluster scaler unit pared with multiple SIMD units sharing the same shared-memory? Slow in that the scaler unit accesses shared-memory when the SIMD units have idle shared-memory access cycles. Global memory accesses make less sense here. Seems like with Larrabee you eat a huge loss in vector capacity when going "general purpose" with scaler x86 code (likewise with going scaler on the GT200). IMO scaler isn't needed much and shouldn't be used very often, but as GPUs get more CPU function complete, I wonder if a small scaler unit might get area efficient?
BTW, I'm NOT talking about having a separate program stream for the puny scaler ALU, but rather that this scaler ALU gets issued instructions from the same program which runs on the vector SIMD units, just when the instruction scheduler detects that all but one lane of a warp is predicated out, that instruction gets sent to the puny scaler ALU instead!
Perhaps this idea wouldn't make sense with regards to messy register porting!
DX11 Generation Atomic Operations
Recap, my speculation is that NVidia/ATI global atomic operations will be done at the memory controllers (MCs) and not in the unified shader cores.
Humus on Beyond3D confirmed that DX11 does NOT have floating point atomics in the standard, integer only (and that atomic float doesn't make sense clearly because of the ordering issues). Atomic operations are (Add, Min, Max, Or, Xor, CompareExchange, Exchange, and the new CompareStore). If any of you know how InterlockedCompareStore() works, please leave a comment below!
Note all of those integer atomics except Add, Min, and Max work as is on floating point numbers, if they are passed in asInteger(). However, Min and Max work as is on asInteger() floats as long as floats are insured to all be positive.
I'm thinking that a nice side effect of not having floating point atomics in the standard is to insure a very simple global atomic ALU in the MCs.
The following ATI DX11 presentation states that global atomics which require the return value will have higher latency than those that don't. Clearly if the atomic ALU operation was done in the shader core then the return value (which is always the value fetched prior to the atomic operation) would be known with no more latency.
As per Larrabee, someone mentioned on the B3D Forums that hyperthread to hyperthread atomic communication utilized CMPXCHG without the LOCK prefix (because of in order cores), and for more complex stuff the standard x86 LOCK instruction prefix utilizing core to core cache line coherency. In my eyes one of the defining features of Intel's Core i7 arch was the per core L2s backed by a shared L3 in which the cores didn't need to snoop the other L2s for cache coherency. I think they got the L3 hit to 40 cycles? Not sure what the shared-L2 hit will be in Larrabee, but perhaps it will be a bit faster than my previous expectations. Also not sure what the cache coherency protocol will be on Larrabee, and I'm more rusty than I should be one stuff like MESI+F (forward state), but it seems to me that global atomics on Larrabee would be quite fast on cache lines marked as Exclusive (think "shared memory" atomics). Highly contended cases (like global shared queues) I'm guessing bad as current PC's, but less of a hit because LRB cores are in-order (less expensive memory barrier)?
What about DX11 Append/Consume?
Clearly this is software via with Larrabee (mix of atomic operations and COMPRESS/EXPAND special purpose opcodes). Who knows with ATI and NVidia. NVidia has special inc/dec atomics in GT200 which could be used for software ring buffers (those atomics have an operand which sets the wrap around). I wonder if either ATI or NVidia provides similar Larrabee like instructions to get around needing to manually scan (assuming Append/Consume queues are done in software).
My biggest question is if one can both Append and Consume from the same buffer in one shader? If you know, please post a comment!!!!
I can think of lots of usages of Append/Consume in the same shader. Such as reusing a pool of objects and "back-door dynamic warp formation (DWF)" (ie dynamic re-grouping for efficient SIMD computation). One concrete example would be doing deferred shading where a kernel dynamically groups samples to have full SIMD utilization when shading with a MSAA G-buffer. All really depends on how much overhead is involved!
Humus on Beyond3D confirmed that DX11 does NOT have floating point atomics in the standard, integer only (and that atomic float doesn't make sense clearly because of the ordering issues). Atomic operations are (Add, Min, Max, Or, Xor, CompareExchange, Exchange, and the new CompareStore). If any of you know how InterlockedCompareStore() works, please leave a comment below!
Note all of those integer atomics except Add, Min, and Max work as is on floating point numbers, if they are passed in asInteger(). However, Min and Max work as is on asInteger() floats as long as floats are insured to all be positive.
I'm thinking that a nice side effect of not having floating point atomics in the standard is to insure a very simple global atomic ALU in the MCs.
The following ATI DX11 presentation states that global atomics which require the return value will have higher latency than those that don't. Clearly if the atomic ALU operation was done in the shader core then the return value (which is always the value fetched prior to the atomic operation) would be known with no more latency.
As per Larrabee, someone mentioned on the B3D Forums that hyperthread to hyperthread atomic communication utilized CMPXCHG without the LOCK prefix (because of in order cores), and for more complex stuff the standard x86 LOCK instruction prefix utilizing core to core cache line coherency. In my eyes one of the defining features of Intel's Core i7 arch was the per core L2s backed by a shared L3 in which the cores didn't need to snoop the other L2s for cache coherency. I think they got the L3 hit to 40 cycles? Not sure what the shared-L2 hit will be in Larrabee, but perhaps it will be a bit faster than my previous expectations. Also not sure what the cache coherency protocol will be on Larrabee, and I'm more rusty than I should be one stuff like MESI+F (forward state), but it seems to me that global atomics on Larrabee would be quite fast on cache lines marked as Exclusive (think "shared memory" atomics). Highly contended cases (like global shared queues) I'm guessing bad as current PC's, but less of a hit because LRB cores are in-order (less expensive memory barrier)?
What about DX11 Append/Consume?
Clearly this is software via with Larrabee (mix of atomic operations and COMPRESS/EXPAND special purpose opcodes). Who knows with ATI and NVidia. NVidia has special inc/dec atomics in GT200 which could be used for software ring buffers (those atomics have an operand which sets the wrap around). I wonder if either ATI or NVidia provides similar Larrabee like instructions to get around needing to manually scan (assuming Append/Consume queues are done in software).
My biggest question is if one can both Append and Consume from the same buffer in one shader? If you know, please post a comment!!!!
I can think of lots of usages of Append/Consume in the same shader. Such as reusing a pool of objects and "back-door dynamic warp formation (DWF)" (ie dynamic re-grouping for efficient SIMD computation). One concrete example would be doing deferred shading where a kernel dynamically groups samples to have full SIMD utilization when shading with a MSAA G-buffer. All really depends on how much overhead is involved!
Bad Industry Humor
Really you should probably avoid this post, unless you have a very dry sense of humor.
It is interesting that a majority of the (small) number of people I (think) know well doing game development (I believe) would rather be programming on the PC with its newer GPUs, but are stuck with the consoles simply because piracy in the process of killing the PC AAA (high budget) non-MMO game market. Like all energy conserving systems, this is causing all the PC guys (thats Epic, Id, etc) to flood the console market, slowly killing off wacky special purpose non-PC like console hardware (think about who prefers the 360 to the PS3) because honestly they get much better results on hardware which they can more easily port to from their PC development platforms.
BTW, PC ports of console games don't even get close to what is possible on the newer GPUs. The console and PC guys don't say this because the lot of them (including me) are still trying to fully grasp what is possible. This gets even more extreme with DX11. Seems as if DX11 is going to enter this PC bubble seemingly without any massively awesome DX11 only launch titles to help sell the crap out of new DX11 GPUs. Where is the buzz on a DX11 title? Unfortunately most of the best people who could be making a no-compromise DX11 title are busy making a living.
I think Microsoft is letting the mighty PC slide! With the piracy thing out of control in the minds of those who would fund development (backed with hard numbers from declining sales), the high upgrade incentive applications (games being big on that list) don't get built. People don't upgrade, therefore there isn't a large enough market with newer hardware/OS (like DX10 Vista requirement) to enable developers to make use the hardware. It looks like a rather nasty cycle to me. Looks like a cycle which is bound to be having a negative effect on their bottom line.
Microsoft, while remarkably good at destroying the PC, the 360 is like a shining star. PIX kicks ass! Yes I still like the PS3 better, but man you did a good job all around with the 360.
Back to the ever increasing hassle of using a PC. At 31, as someone who programmed on PCs as early as the 8086, I guess I'm already a dinosaur. To me needing to have a crippling mix of virus protecting, backup, defender, firewall, constant updates and reboots, etc, running in the background 100% of the time is just insane. For young kids, it's just the norm and that is really sad.
And as a dinosaur, what is with this insanity of code/complexity bloat? What is the use of Moore's Law when bloat growth is faster than performance growth? Bloat growth is more universal than taxes. The triad of Microsoft, Apple, and the sorry Linux group have caught this disease.
Yes I said sorry Linux group, as an avid Linux user, I know how bad it is. About the only people doing it right are those doing low level stuff. Like the awesome NVidia drivers, or the Linux kernel people. Nearly everything API or GUI side is in a sad state of attempted Windows emulation. Ironically, Windows is now in a sorry state of Windows emulation also, and it is in fact much easier to install Linux (if you can find the right distribution). My awesome father-in-law is living proof of this, one day I went to visit and wow there was Linux on a computer. Then they got a Mac (and the rest is history).
For equal opportunity bashing this rant now moves to Apple. With a history of disregard for gaming, they sure were caught off guard with the iPhone app store, but have capitalized on it quite well. I know they were caught off guard because it has been weeks since I've been half way through the iPhone registered developer process and they don't have time to get back to me. Besides anyone who couldn't realize that developers might actually want to use source control and easily share a signed XCode project between multiple developers, isn't thinking about gaming.
However, Apple does understand a great many things quite well. Like forcing upgrades, I've got a love/hate relationship with that one. When they placed NVidia 8 series cards in the Mac line, I hated having to pay for a new one, but when OpenCL is released, I'll be mightly thankful that it won't be long before everyone else is forced to upgrade so the market exists for OpenCL applications. They also get the "keep everything easy for the user" thing. As in I can install/uninstall an application by dragging it to/from the applications folder. Guess that would be too complicated for a Windows user? And they really get the piracy thing, the iPhone store is proof of that. Another love/hate relationship, sure signing is a hassle, but having a market to sell which the developer can make money, others need to start taking notes and fast.
Also Apple you should take some notes and fast, games sell systems and good games make people want to upgrade. You've got GL3 and OpenCL on the way, and a very easy to use low level interface to get access to the hardware (OpenGL full screen in MacOSX is like a paragraph of code last I checked). Where is your draw dropping OpenGL3/OpenCL AAA launch title! MacOSX is awesome, you PC devs really are missing out, Unix/BSD is insanely more multi-thread and OS friendly (try using Sleep() in Windows). However PC devs see the Mac as an alien, need to break that. Having page-up/page-down and home/end default to move the view and not the cursor is like cutting off a finger. Changing that to the one and only true way, the PC way, is likely to expand your user base more than anything else you could do. Also there is a very strange correlation between the API level and the quality of documentation. High level, ie stuff game devs want to avoid, has high quality documentation, low level, ie stuff game devs eat for breakfast, has low quality documentation. If only you could see how good PS3 and 360 system docs are. Lastly, games need something beyond the lower level GPUs in a majority of the systems, well unless you Wii.
How about Sony? Wish you'd built something like a GPU shader system as an interface for the Cell SPUs, to make it dead simple to program for (OpenCL is here way too late). Could have tricked the PC devs into actually liking the system. Still, PS3 games are and will continue to look better than the 360 titles. Something to do with not having EDRAM to mess up deferred shading with shadows? I don't know, all I know is that I bought my PS3 because I was told by a bunch of developers that 90% of the power of the PS3 was left unused (something to do with SPUs), so I thought I could make up the 360/PS3 price difference in my electric bill over the 10 year console cycle.
And Intel? Please don't x86 SOC (system on chip) everything and put everybody out of business. I know I'd be doing the same thing if I could, but really competition is great for the market. As for Larrabee, it is the most beautiful and yet ugly design (x86). The vector ISA is really really good, not wild about the cache coherent memory architecture, but likely that doesn't matter, I know you are going to provide some great OpenCL drivers!
ATI? Got OpenGL3 drivers (check), got OpenCL drivers (soon check), got shared memory for DX11 (R800 soon check). What's not to love, except always thinking in quads playing with shader code.
NVidia, saving my favorite company for last. My grandpa told me that the Christians believe that god built man in his image ... no surprise then that DX11 CS5 and OpenCL look just like CUDA, except that Charlie of Inquirer tells all the sheeple that any company without a piece of the x86 pie will suffer a very mortal death. Not sure if I should trust the company that somehow manages to provide FreeBSD drivers when there is probably not a single FreeBSD OpenGL3 user on the planet, over the words of an internet tabloid. Ok to be fair when I porting Glide to FreeBSD from Linux back in the day only required a few changes, and you are cheating with a mostly unified driver architecture. Well actually this blatant violation of the Moore's "Complexity Bloat" Law might be a sign of impending doom.
NVidia, if only you had something like a HalfLife2 or Doom3 Era PC user to supply graphics cards to (you know the kind that actually purchased games, and bought the new GPUs match). Guess supporting a flagship PC game named "Crysis" was just too literal in the new piracy age. Don't even tell me that none of you thought about changing the logo to "NVidia : The Way it Was Ment to be Pirated".
Also what's with all this hardware support for tessellation. First I thought that this was just ATIs way of getting you back for WDDM2 and DX10.1, then I realized that you were in bed with the idea. Don't you know that developers hate triangles? Perhaps you've never had to write a scalable mesh simplifier. Err scratch that, guess you are already employing most of those guys.
NVidia you need to lengthen the stick that you are poking Microsoft with to get a Microsoft App Store on the Tegra Zune HD, or help Apple remove the MBX problem on the iPhone (16K tri/frame without usable alpha test is sad). Or better yet do something wicked awesome like an NVidia console with a SOC ARM + GT380 combo. Guess you'd have a problem with a few buyers getting confused when they felt compelled to try to plug it into a PCI-E slot in their PC, only to realize that it only had a network cable, this massive power adapter, USB2 controller inputs, and a HDMI output. Sure 99.99999% of developers would be clueless how to program for it, but anyone not doing physics, AI, game logic and scene traversal on the GPU with DX11 is an idiot (man hope I can do that). It's just like that Car-Mac character saying that anyone who couldn't just thread their code was an idiot.
Anyway I'm still waiting for the day when I plug my tiny PC daughter card into my massive GT480 computer. Note on that box the logo will say, "NVidia : The Way it Was Ment to be Done!".
Not that any of this matters, because according to Tim Sweeney, in 10-15 years the GPUs will be able to simulate photo realism, so you could just emulate the screen of your PC in engine.
Ok good night, err good morning now!
It is interesting that a majority of the (small) number of people I (think) know well doing game development (I believe) would rather be programming on the PC with its newer GPUs, but are stuck with the consoles simply because piracy in the process of killing the PC AAA (high budget) non-MMO game market. Like all energy conserving systems, this is causing all the PC guys (thats Epic, Id, etc) to flood the console market, slowly killing off wacky special purpose non-PC like console hardware (think about who prefers the 360 to the PS3) because honestly they get much better results on hardware which they can more easily port to from their PC development platforms.
BTW, PC ports of console games don't even get close to what is possible on the newer GPUs. The console and PC guys don't say this because the lot of them (including me) are still trying to fully grasp what is possible. This gets even more extreme with DX11. Seems as if DX11 is going to enter this PC bubble seemingly without any massively awesome DX11 only launch titles to help sell the crap out of new DX11 GPUs. Where is the buzz on a DX11 title? Unfortunately most of the best people who could be making a no-compromise DX11 title are busy making a living.
I think Microsoft is letting the mighty PC slide! With the piracy thing out of control in the minds of those who would fund development (backed with hard numbers from declining sales), the high upgrade incentive applications (games being big on that list) don't get built. People don't upgrade, therefore there isn't a large enough market with newer hardware/OS (like DX10 Vista requirement) to enable developers to make use the hardware. It looks like a rather nasty cycle to me. Looks like a cycle which is bound to be having a negative effect on their bottom line.
Microsoft, while remarkably good at destroying the PC, the 360 is like a shining star. PIX kicks ass! Yes I still like the PS3 better, but man you did a good job all around with the 360.
Back to the ever increasing hassle of using a PC. At 31, as someone who programmed on PCs as early as the 8086, I guess I'm already a dinosaur. To me needing to have a crippling mix of virus protecting, backup, defender, firewall, constant updates and reboots, etc, running in the background 100% of the time is just insane. For young kids, it's just the norm and that is really sad.
And as a dinosaur, what is with this insanity of code/complexity bloat? What is the use of Moore's Law when bloat growth is faster than performance growth? Bloat growth is more universal than taxes. The triad of Microsoft, Apple, and the sorry Linux group have caught this disease.
Yes I said sorry Linux group, as an avid Linux user, I know how bad it is. About the only people doing it right are those doing low level stuff. Like the awesome NVidia drivers, or the Linux kernel people. Nearly everything API or GUI side is in a sad state of attempted Windows emulation. Ironically, Windows is now in a sorry state of Windows emulation also, and it is in fact much easier to install Linux (if you can find the right distribution). My awesome father-in-law is living proof of this, one day I went to visit and wow there was Linux on a computer. Then they got a Mac (and the rest is history).
For equal opportunity bashing this rant now moves to Apple. With a history of disregard for gaming, they sure were caught off guard with the iPhone app store, but have capitalized on it quite well. I know they were caught off guard because it has been weeks since I've been half way through the iPhone registered developer process and they don't have time to get back to me. Besides anyone who couldn't realize that developers might actually want to use source control and easily share a signed XCode project between multiple developers, isn't thinking about gaming.
However, Apple does understand a great many things quite well. Like forcing upgrades, I've got a love/hate relationship with that one. When they placed NVidia 8 series cards in the Mac line, I hated having to pay for a new one, but when OpenCL is released, I'll be mightly thankful that it won't be long before everyone else is forced to upgrade so the market exists for OpenCL applications. They also get the "keep everything easy for the user" thing. As in I can install/uninstall an application by dragging it to/from the applications folder. Guess that would be too complicated for a Windows user? And they really get the piracy thing, the iPhone store is proof of that. Another love/hate relationship, sure signing is a hassle, but having a market to sell which the developer can make money, others need to start taking notes and fast.
Also Apple you should take some notes and fast, games sell systems and good games make people want to upgrade. You've got GL3 and OpenCL on the way, and a very easy to use low level interface to get access to the hardware (OpenGL full screen in MacOSX is like a paragraph of code last I checked). Where is your draw dropping OpenGL3/OpenCL AAA launch title! MacOSX is awesome, you PC devs really are missing out, Unix/BSD is insanely more multi-thread and OS friendly (try using Sleep() in Windows). However PC devs see the Mac as an alien, need to break that. Having page-up/page-down and home/end default to move the view and not the cursor is like cutting off a finger. Changing that to the one and only true way, the PC way, is likely to expand your user base more than anything else you could do. Also there is a very strange correlation between the API level and the quality of documentation. High level, ie stuff game devs want to avoid, has high quality documentation, low level, ie stuff game devs eat for breakfast, has low quality documentation. If only you could see how good PS3 and 360 system docs are. Lastly, games need something beyond the lower level GPUs in a majority of the systems, well unless you Wii.
How about Sony? Wish you'd built something like a GPU shader system as an interface for the Cell SPUs, to make it dead simple to program for (OpenCL is here way too late). Could have tricked the PC devs into actually liking the system. Still, PS3 games are and will continue to look better than the 360 titles. Something to do with not having EDRAM to mess up deferred shading with shadows? I don't know, all I know is that I bought my PS3 because I was told by a bunch of developers that 90% of the power of the PS3 was left unused (something to do with SPUs), so I thought I could make up the 360/PS3 price difference in my electric bill over the 10 year console cycle.
And Intel? Please don't x86 SOC (system on chip) everything and put everybody out of business. I know I'd be doing the same thing if I could, but really competition is great for the market. As for Larrabee, it is the most beautiful and yet ugly design (x86). The vector ISA is really really good, not wild about the cache coherent memory architecture, but likely that doesn't matter, I know you are going to provide some great OpenCL drivers!
ATI? Got OpenGL3 drivers (check), got OpenCL drivers (soon check), got shared memory for DX11 (R800 soon check). What's not to love, except always thinking in quads playing with shader code.
NVidia, saving my favorite company for last. My grandpa told me that the Christians believe that god built man in his image ... no surprise then that DX11 CS5 and OpenCL look just like CUDA, except that Charlie of Inquirer tells all the sheeple that any company without a piece of the x86 pie will suffer a very mortal death. Not sure if I should trust the company that somehow manages to provide FreeBSD drivers when there is probably not a single FreeBSD OpenGL3 user on the planet, over the words of an internet tabloid. Ok to be fair when I porting Glide to FreeBSD from Linux back in the day only required a few changes, and you are cheating with a mostly unified driver architecture. Well actually this blatant violation of the Moore's "Complexity Bloat" Law might be a sign of impending doom.
NVidia, if only you had something like a HalfLife2 or Doom3 Era PC user to supply graphics cards to (you know the kind that actually purchased games, and bought the new GPUs match). Guess supporting a flagship PC game named "Crysis" was just too literal in the new piracy age. Don't even tell me that none of you thought about changing the logo to "NVidia : The Way it Was Ment to be Pirated".
Also what's with all this hardware support for tessellation. First I thought that this was just ATIs way of getting you back for WDDM2 and DX10.1, then I realized that you were in bed with the idea. Don't you know that developers hate triangles? Perhaps you've never had to write a scalable mesh simplifier. Err scratch that, guess you are already employing most of those guys.
NVidia you need to lengthen the stick that you are poking Microsoft with to get a Microsoft App Store on the Tegra Zune HD, or help Apple remove the MBX problem on the iPhone (16K tri/frame without usable alpha test is sad). Or better yet do something wicked awesome like an NVidia console with a SOC ARM + GT380 combo. Guess you'd have a problem with a few buyers getting confused when they felt compelled to try to plug it into a PCI-E slot in their PC, only to realize that it only had a network cable, this massive power adapter, USB2 controller inputs, and a HDMI output. Sure 99.99999% of developers would be clueless how to program for it, but anyone not doing physics, AI, game logic and scene traversal on the GPU with DX11 is an idiot (man hope I can do that). It's just like that Car-Mac character saying that anyone who couldn't just thread their code was an idiot.
Anyway I'm still waiting for the day when I plug my tiny PC daughter card into my massive GT480 computer. Note on that box the logo will say, "NVidia : The Way it Was Ment to be Done!".
Not that any of this matters, because according to Tim Sweeney, in 10-15 years the GPUs will be able to simulate photo realism, so you could just emulate the screen of your PC in engine.
Ok good night, err good morning now!
20090527
GPU Cache Between MC and DRAM
Recap from previous post, shot down my own idea of general purpose hardware supported queuing in the form of direct core to core data flow. Looking for better options...
The core idea of the GPU in my eyes is latency hiding via massive hyperthreading. In terms of Larrabee is it latency hiding via a combination of 4 way hardware hyperthreading and N way software hyperthreading via fibers. In terms of GT200 it is variable N<=32 way hardware hyperthreading. NVidia suggests to shoot for 256 threads per block which would provide for 4 blocks running per core (syncing threads in a block stall an entire block). This is somewhat like one of Larrabee's hyperthreads stalling (having only 4 hyperthreads is more acceptable for Larrabee when the likely stall in the software hyperthreading is less do to the cache). So while the hardware is different between both vendors the end result could be similiar. This could be a very good thing for portability via OpenCL and DX11 CS.
Given that GPUs have multiple parallel memory controllers (or interfaces to DRAM) and assuming the case where NVidia sticks to the variable N<=32 hardware hyperthreading for latency hiding, I think a possible good option for writable cache design is to place the caches between the memory controller and DRAM interface, and also to do atomic operations in these individual MC+cache+DRAM units.
Since memory requests are divided by address before the MC, there is no need for coherency between the caches. Since memory requests post MC are serialized for DRAM, atomic operations could be done without stall of needing to route to/from the multiprocessors.
The advantage of this system would be (1.) lower DRAM bandwidth utilization, (2.) definitely much lower latency for global memory accesses when accesses have good locality, and (3.) no problems of cache coherency and atomic operations stalling the "CPUs". The latency hiding of the N<=32 way hyperthreading handles keeping the ALU units busy, very high throughput and parallel friendly design!
The smaller per multiprocessor L1 caches (texture, constant, instruction) could be read only by design and only coherent (via invalidation of full L1 cache) to the MC+cache+DRAM state at very coarse grain points (draw call transitions or compute kernel transitions). This matches up quite well to the DX11 CS and OpenCL standards.
Perhaps in theory the hardware could make use of this post MC caching to handle the on-chip queuing between the variable output DX11 pipeline stages and the new DX11 Append/Consume buffers?
Could this be a huge part of the "radically new GT300 architecture"? Will this be a part of the R800 architecture?
I hope so.
The core idea of the GPU in my eyes is latency hiding via massive hyperthreading. In terms of Larrabee is it latency hiding via a combination of 4 way hardware hyperthreading and N way software hyperthreading via fibers. In terms of GT200 it is variable N<=32 way hardware hyperthreading. NVidia suggests to shoot for 256 threads per block which would provide for 4 blocks running per core (syncing threads in a block stall an entire block). This is somewhat like one of Larrabee's hyperthreads stalling (having only 4 hyperthreads is more acceptable for Larrabee when the likely stall in the software hyperthreading is less do to the cache). So while the hardware is different between both vendors the end result could be similiar. This could be a very good thing for portability via OpenCL and DX11 CS.
Given that GPUs have multiple parallel memory controllers (or interfaces to DRAM) and assuming the case where NVidia sticks to the variable N<=32 hardware hyperthreading for latency hiding, I think a possible good option for writable cache design is to place the caches between the memory controller and DRAM interface, and also to do atomic operations in these individual MC+cache+DRAM units.
Since memory requests are divided by address before the MC, there is no need for coherency between the caches. Since memory requests post MC are serialized for DRAM, atomic operations could be done without stall of needing to route to/from the multiprocessors.
The advantage of this system would be (1.) lower DRAM bandwidth utilization, (2.) definitely much lower latency for global memory accesses when accesses have good locality, and (3.) no problems of cache coherency and atomic operations stalling the "CPUs". The latency hiding of the N<=32 way hyperthreading handles keeping the ALU units busy, very high throughput and parallel friendly design!
The smaller per multiprocessor L1 caches (texture, constant, instruction) could be read only by design and only coherent (via invalidation of full L1 cache) to the MC+cache+DRAM state at very coarse grain points (draw call transitions or compute kernel transitions). This matches up quite well to the DX11 CS and OpenCL standards.
Perhaps in theory the hardware could make use of this post MC caching to handle the on-chip queuing between the variable output DX11 pipeline stages and the new DX11 Append/Consume buffers?
Could this be a huge part of the "radically new GT300 architecture"? Will this be a part of the R800 architecture?
I hope so.
Taking Advantage of Core to Core Data Flow?
Is it practical on the GPU to avoid the round trip to DRAM and instead directly take advantage of shared caches or fast core to core communication?
MULTI-CORE CPU WITH A SHARED L2 CACHE - Directly take advantage of the shared L2 (or L3) cache. Core to core communication through the L2 cache. OS thread scheduling can work against the tight synchronization necessary to make this possible. Also have to be very smart when dealing with cache lines which are shared between cores. If possible, lines should be only shared in read but not in write.
CELL SPU TO SPU DMA - Directly take advantage of the much higher SPU to SPU DMA bandwidth compared to DRAM to SPU DMA bandwidth.
--VS--
GPU - Direct core to core data flow not available on current GPUs in a general purpose form. Should it be, and does it make sense to do so?
This post is attempting to shoot holes in my previous general purpose queuing in hardware idea post and/or similar ideas such as GRAMPS using direct core to core data flow. Note I'm NOT attempting to shoot holes in the idea of using a queuing based programming or pipeline model, because that would indeed be possible with or without fast low level support for core to core data flow.
Problems,
BUFFER SPACE TRADE-OFF - Clearly there is a trade off between the space available for core to core communication buffering vs holding data for computation. Space being either in cache, in local store, or area on the die (in case of fixed function trade off).
REQUIREMENT OF TIGHT SYNCHRONIZATION - The less tight the synchronization, the more buffer space required for core to core communication. Tight synchronization can easily work against the needs of massively parallel systems (depending on design).
OVERFLOW / STALL ISSUE - Core to core communication needs to deal with the problem of buffer overflow. Which means either buffering data to a larger storage like DRAM (ironic) or stalling the warp (or CPU hyperthread) attempting to send data to another core.
LOAD BALANCING / WORK DISTRIBUTION - Even the simple case of two kernels communicating through a queue is a problem when the run-time of both kernels is different or variable. The kernel with lower run-time needs to have other work to do when idle (such as running the first kernel to self feed).
Detail on the Work Distribution Problem
Work distribution requires running kernels of different types on the same core. An immediate problem with this is allocation of shared memory for incoming work groups of different types of kernels.
1. If incoming work groups could be issued and retired in order, then ring buffer allocation would work fine. This will not work efficiently if work group run time is variable. Probably not practical.
2. If all work-groups of different kernels used the same amount of shared memory, then page allocation would work. Could just divide up shared memory into the maximum shared memory context size per block for 2 or more kernels. Cost of under utilized shared memory.
3. If a core could be dedicated to one type of kernel, then shared memory allocation becomes easy (page allocation). Note one could dynamically change core to execute different kernels on the fly by waiting for the first page(s) of shared memory to open up enough for the next type of kernel (possible execution bubbles). Or worst case, waiting for the entire core to finish all jobs (full bubble).
4. Having set pinned area(s) of shared memory could be used to support persistent kernel(s) along with dynamic kernels. If pinned areas are not actively utilized then latency hiding efficiency will be reduced.
With Larrabee, this problem of shared memory allocation is in a different form. There is no dedicated shared memory, size isn't bound by the DX11 32KB limit (however performance will be bound by fitting in the cache). Still will want to allocate per hyperthread shared memory in parallel for performance (a 32 core LRB might get 128 pools to allocate from), and also likely to need a really fast allocator (don't want to eat the 200 or so cycles from something like dlmalloc for the per pool allocator).
Part Epic Fail?
Yes, shot enough holes in my idea that it now seems more special purpose than general purpose or that it seems too complicated to easily make use of.
Related Idea of Pairing ALU and MEM Heavy Kernels
The problems mentioned above also make this idea (distributing different ALU and MEM heavy kernels to cores for simultaneous interleaved execution to maximize utilization of core resources) more impractical.
This could has some important side effects for future algorithm design. Namely that it is good to insure kernels themselves have a good ALU/MEM balance. If a kernel has free ALU, it might be a good idea to manually re-distribute work (and re-engineer the program) to fill that free capacity! Keep in mind that now one can choose to manually pair two kernels into one kernel if they have the same issue domain (grid). I think early on someone was talking about this on the CUDA forums, but this is a special purpose solution.
Continued in the next blog post...
MULTI-CORE CPU WITH A SHARED L2 CACHE - Directly take advantage of the shared L2 (or L3) cache. Core to core communication through the L2 cache. OS thread scheduling can work against the tight synchronization necessary to make this possible. Also have to be very smart when dealing with cache lines which are shared between cores. If possible, lines should be only shared in read but not in write.
CELL SPU TO SPU DMA - Directly take advantage of the much higher SPU to SPU DMA bandwidth compared to DRAM to SPU DMA bandwidth.
--VS--
GPU - Direct core to core data flow not available on current GPUs in a general purpose form. Should it be, and does it make sense to do so?
This post is attempting to shoot holes in my previous general purpose queuing in hardware idea post and/or similar ideas such as GRAMPS using direct core to core data flow. Note I'm NOT attempting to shoot holes in the idea of using a queuing based programming or pipeline model, because that would indeed be possible with or without fast low level support for core to core data flow.
Problems,
BUFFER SPACE TRADE-OFF - Clearly there is a trade off between the space available for core to core communication buffering vs holding data for computation. Space being either in cache, in local store, or area on the die (in case of fixed function trade off).
REQUIREMENT OF TIGHT SYNCHRONIZATION - The less tight the synchronization, the more buffer space required for core to core communication. Tight synchronization can easily work against the needs of massively parallel systems (depending on design).
OVERFLOW / STALL ISSUE - Core to core communication needs to deal with the problem of buffer overflow. Which means either buffering data to a larger storage like DRAM (ironic) or stalling the warp (or CPU hyperthread) attempting to send data to another core.
LOAD BALANCING / WORK DISTRIBUTION - Even the simple case of two kernels communicating through a queue is a problem when the run-time of both kernels is different or variable. The kernel with lower run-time needs to have other work to do when idle (such as running the first kernel to self feed).
Detail on the Work Distribution Problem
Work distribution requires running kernels of different types on the same core. An immediate problem with this is allocation of shared memory for incoming work groups of different types of kernels.
1. If incoming work groups could be issued and retired in order, then ring buffer allocation would work fine. This will not work efficiently if work group run time is variable. Probably not practical.
2. If all work-groups of different kernels used the same amount of shared memory, then page allocation would work. Could just divide up shared memory into the maximum shared memory context size per block for 2 or more kernels. Cost of under utilized shared memory.
3. If a core could be dedicated to one type of kernel, then shared memory allocation becomes easy (page allocation). Note one could dynamically change core to execute different kernels on the fly by waiting for the first page(s) of shared memory to open up enough for the next type of kernel (possible execution bubbles). Or worst case, waiting for the entire core to finish all jobs (full bubble).
4. Having set pinned area(s) of shared memory could be used to support persistent kernel(s) along with dynamic kernels. If pinned areas are not actively utilized then latency hiding efficiency will be reduced.
With Larrabee, this problem of shared memory allocation is in a different form. There is no dedicated shared memory, size isn't bound by the DX11 32KB limit (however performance will be bound by fitting in the cache). Still will want to allocate per hyperthread shared memory in parallel for performance (a 32 core LRB might get 128 pools to allocate from), and also likely to need a really fast allocator (don't want to eat the 200 or so cycles from something like dlmalloc for the per pool allocator).
Part Epic Fail?
Yes, shot enough holes in my idea that it now seems more special purpose than general purpose or that it seems too complicated to easily make use of.
Related Idea of Pairing ALU and MEM Heavy Kernels
The problems mentioned above also make this idea (distributing different ALU and MEM heavy kernels to cores for simultaneous interleaved execution to maximize utilization of core resources) more impractical.
This could has some important side effects for future algorithm design. Namely that it is good to insure kernels themselves have a good ALU/MEM balance. If a kernel has free ALU, it might be a good idea to manually re-distribute work (and re-engineer the program) to fill that free capacity! Keep in mind that now one can choose to manually pair two kernels into one kernel if they have the same issue domain (grid). I think early on someone was talking about this on the CUDA forums, but this is a special purpose solution.
Continued in the next blog post...
Irony
Irony, on my drive to work this morning,
1. I saw a post office vehicle parked in the local liquor store.
2. On a commercial street which is frequently patrolled by cops giving out speeding tickets, a caravan of hippies added another vehicle to the group which has been illegally parked for over three weeks in front land zoned as public utility (they are living out of their parked vehicles).
3. Had to call the Illinois Toll Authority to contest an invalid $230 violation due to their faulty iPass system (my wife gave me the iPass as a present to avoid this BS ... we work in two states in jobs that are >150 miles apart, so we both do a lot of driving on the weekends).
1. I saw a post office vehicle parked in the local liquor store.
2. On a commercial street which is frequently patrolled by cops giving out speeding tickets, a caravan of hippies added another vehicle to the group which has been illegally parked for over three weeks in front land zoned as public utility (they are living out of their parked vehicles).
3. Had to call the Illinois Toll Authority to contest an invalid $230 violation due to their faulty iPass system (my wife gave me the iPass as a present to avoid this BS ... we work in two states in jobs that are >150 miles apart, so we both do a lot of driving on the weekends).
20090520
Project Trico
Bought the Shadow of the Colossus a few weeks ago because I never played through the game, will be on to ICO next. Really liked the SotC tech paper (here is a translated version), and was wondering when I'd see what they had in store for the PS3. Seems as if more Project Trico media is to be exposed at E3? Animation looks incredible!
More Giga Voxels Videos
Cyril Crassin has posted two more youtube videos, the first showing DoF, and the second showing softness variation in real-time shadowing. Both algorithms make use of the volume mip chain pre-filtering for efficient computation. Softer shadows and DoF areas with a larger circle of confusion run faster do to only needing a shorter traversal through the volume. One can imagine using this to quite an advantage when providing a shadowing light which has an increasingly soft shadow as a function of increasing distance from the light.
20090519
GPU REYES
Interactive REYES rendering for multiple GPUs, the RenderAnts Technical Report and video can be found on Kun Zhou's Website, as well as An Efficient GPU-based Approach for Interactive Global Illumination and other impressive papers...
20090514
Giga Voxels + Ray-traced Shadows
Always excited to see new progress on Cyril Crassin's Giga Voxels Project, here is a youtube video with ray-traced shadows,
Also looks like an image of instanced models here,
Also looks like an image of instanced models here,
20090513
Gaussian KD-Trees and CUDA Implementation
Gaussian KD-Trees for Fast High-Dimensional Filtering
Quite interesting paper (great results!). Their CUDA implementation solves the divergent and recursive tree traversal problem with a form of software "dynamic warp formation". Each block maintains a shared structure of arguments to pending function calls. When threads in the block are idle they pull from this shared structure. When a thread's traversal needs to diverge, it places the divergent path arguments onto the shared pending work structure and continues with one path.
Quite interesting paper (great results!). Their CUDA implementation solves the divergent and recursive tree traversal problem with a form of software "dynamic warp formation". Each block maintains a shared structure of arguments to pending function calls. When threads in the block are idle they pull from this shared structure. When a thread's traversal needs to diverge, it places the divergent path arguments onto the shared pending work structure and continues with one path.
PhyreEngine Deferred Lighting and Post Processing
This great presentation can be found on the SCEE Research and Development Presentations Page in 2009 Conferences section. Some highlights,
(1.) Many titles take 33%+ of GPU time in post processing. Post processing samples uncompressed textures (usually most of the screen) and can be either TEX cache bound, or possibly TEX (filtering/addressing) throughput bound in the case of heavy TEX cache reuse. Multiple full screen passes (instead of combining functionality into one pass) is a huge problem here...
Enter SPUs, the 2nd "GPU" found on the PS3 hardware,
(2.) SPUs have better branch granularity than the GPU. SPUs have flexibility at the expense of not having TEX hardware. Take advantage of flexibility to do conditional framebuffer processing at small tile granularity.
(3.) Deferred Lighting on SPUs. GPU does g-buffer creation, SPUs apply all lights hitting a tile in one pass (per tile optimizations for things like MSAA). Results are awesome, 3 shadowed lights + 100 point lights at 2x MSAA at 720p at 60fps+ using 3 SPUs for 11ms!!!
Really shows that a lot of life is left in the PS3 in terms of creative thinking and pushing the hardware!!!
(1.) Many titles take 33%+ of GPU time in post processing. Post processing samples uncompressed textures (usually most of the screen) and can be either TEX cache bound, or possibly TEX (filtering/addressing) throughput bound in the case of heavy TEX cache reuse. Multiple full screen passes (instead of combining functionality into one pass) is a huge problem here...
Enter SPUs, the 2nd "GPU" found on the PS3 hardware,
(2.) SPUs have better branch granularity than the GPU. SPUs have flexibility at the expense of not having TEX hardware. Take advantage of flexibility to do conditional framebuffer processing at small tile granularity.
(3.) Deferred Lighting on SPUs. GPU does g-buffer creation, SPUs apply all lights hitting a tile in one pass (per tile optimizations for things like MSAA). Results are awesome, 3 shadowed lights + 100 point lights at 2x MSAA at 720p at 60fps+ using 3 SPUs for 11ms!!!
Really shows that a lot of life is left in the PS3 in terms of creative thinking and pushing the hardware!!!
Larrabee Die Shot
For those who aren't Beyond3D Forum regulars, the forum thread Analyzing the Larrabee Die Shot is quite interesting. 32 cores, 8 texture units, 4 memory controller connections?

EDIT: More from a link in the forum, expected 2010 Q1 release date, 45nm process, up to 32 cores (so that's the high end chip above) for first release (possible 64 core 32nm later), no public mention of clocks yet...

EDIT: More from a link in the forum, expected 2010 Q1 release date, 45nm process, up to 32 cores (so that's the high end chip above) for first release (possible 64 core 32nm later), no public mention of clocks yet...
20090512
CUDA Compute 1.3 Global Atomics Profiled
Got the GT275 in yesterday, was very exciting to move up to a card which has what might be 8x the ALU capacity of what I normally have at home and at work (I use a GPU with similar performance to the PS3 and 360). First test I ran was my point based Atom engine, and sure enough only a 2x frame rate improvement. As I expected, likely the engine is fully triangle setup (for points) bound, and couldn't make use of the added bandwidth and ALU capacity. Hence the interest in CUDA/OpenCL/DX11 CS, I'm currently limited by the fixed function hardware on high-end cards.
Global Atomics
The Compute 1.3 vs 1.1 test (GT275 vs 8600 GTS). Read the previous blog post for details on the test. Quick refresher, I'm doing a predicated first lane only (of each half-warp) atomicAdd, or (volatile load/store pair). This is to simulate the typical queue head pointer update when appending or consuming data from a queue. I do a forced NON-unrolled loop of atomic operations in order to get better cycle counts. Code logically functions like this in the atomicAdd() case,
unsigned int offset = blockIdx.x * overlap;
...
value = atomicAdd(buffer + offset, value);
When overlap=0, which causes 100% address collisions, performance between the two cards was nearly identical. Performance when overlap varied between (16,8,4,2,1) was quite different, and drastically better for the compute 1.3 card, coalescing (edit: or increase in memory units capable of doing atomics) made a huge difference. I only have VERY ROUGH numbers, but I can see the following trends in the number of clock cycles per atomic or non-atomic overlapping operations in my test (between these two cards and using CUDA 2.1),
- GTX 275 global atomic operations about 3x faster than 8600 GTS.
- Non-atomic operations about 4x faster than 1.1.
- 8600 GTS atomics ranged from about 8-16x more expensive than load/store pair.
- 8600 GTS atomics got slower the more collisions.
- GTX 275 atomics ranged from about 14-5x more expensive than load/store pair.
- GTX 275 non-atomics got much faster the lower the number of collisions.
- GTX 275 atomics seemed relatively faster on a higher number of collisions.
Conclusions which seem safe to draw from the very rough results,
(1.) Still wise to provide parallel queues, instead of one global queue. Worst case (100% of all atomicAdds colliding to same address) was around 337 cycles on average which is around 84 instructions of latency.
(2.) Atomics are not free, but relatively very fast on the GTX 275. Sustained best case in my tests was around 29 cycles on average of latency for an atomicAdd (keep in mind only the first thread of each half-warp was doing the global atomicAdd). Which is about 7 to 8 instructions (4 cycles per instruction/warp). That is damn fast!
There is always the danger of the compiler optimizing out operations which are being profiled, and while I believe I managed to avoid that, I'd really like to hear from anyone else who has any results they have gathered from their own testing to compare/contrast to what I've seen...
Edit: Thanks Jawed from Beyond3D for pointing out a few things about this test!
Global Atomics
The Compute 1.3 vs 1.1 test (GT275 vs 8600 GTS). Read the previous blog post for details on the test. Quick refresher, I'm doing a predicated first lane only (of each half-warp) atomicAdd, or (volatile load/store pair). This is to simulate the typical queue head pointer update when appending or consuming data from a queue. I do a forced NON-unrolled loop of atomic operations in order to get better cycle counts. Code logically functions like this in the atomicAdd() case,
unsigned int offset = blockIdx.x * overlap;
...
value = atomicAdd(buffer + offset, value);
When overlap=0, which causes 100% address collisions, performance between the two cards was nearly identical. Performance when overlap varied between (16,8,4,2,1) was quite different, and drastically better for the compute 1.3 card, coalescing (edit: or increase in memory units capable of doing atomics) made a huge difference. I only have VERY ROUGH numbers, but I can see the following trends in the number of clock cycles per atomic or non-atomic overlapping operations in my test (between these two cards and using CUDA 2.1),
- GTX 275 global atomic operations about 3x faster than 8600 GTS.
- Non-atomic operations about 4x faster than 1.1.
- 8600 GTS atomics ranged from about 8-16x more expensive than load/store pair.
- 8600 GTS atomics got slower the more collisions.
- GTX 275 atomics ranged from about 14-5x more expensive than load/store pair.
- GTX 275 non-atomics got much faster the lower the number of collisions.
- GTX 275 atomics seemed relatively faster on a higher number of collisions.
Conclusions which seem safe to draw from the very rough results,
(1.) Still wise to provide parallel queues, instead of one global queue. Worst case (100% of all atomicAdds colliding to same address) was around 337 cycles on average which is around 84 instructions of latency.
(2.) Atomics are not free, but relatively very fast on the GTX 275. Sustained best case in my tests was around 29 cycles on average of latency for an atomicAdd (keep in mind only the first thread of each half-warp was doing the global atomicAdd). Which is about 7 to 8 instructions (4 cycles per instruction/warp). That is damn fast!
There is always the danger of the compiler optimizing out operations which are being profiled, and while I believe I managed to avoid that, I'd really like to hear from anyone else who has any results they have gathered from their own testing to compare/contrast to what I've seen...
Edit: Thanks Jawed from Beyond3D for pointing out a few things about this test!
20090511
GRAMPS
GRAMPS Paper
GRAMPS is similar in design to what my previous "CUDA 3.0 and GT300 and Future Predictions" post was suggesting. In GRAMPS shaders are guaranteed to be able to run to completion without blocking via queue aware scheduling. GRAMPS also provides thread stages which are statefull (unlike shader stages) for task parallel jobs. GRAMPS has queue sets which provide the functionality of parallel distribution of work into a new output domain. Their results show peak queue sizes in the range of 1.5MB or so for GPU simulator.
Perhaps any hope of this in hardware is jumping ahead a few years...
GRAMPS is similar in design to what my previous "CUDA 3.0 and GT300 and Future Predictions" post was suggesting. In GRAMPS shaders are guaranteed to be able to run to completion without blocking via queue aware scheduling. GRAMPS also provides thread stages which are statefull (unlike shader stages) for task parallel jobs. GRAMPS has queue sets which provide the functionality of parallel distribution of work into a new output domain. Their results show peak queue sizes in the range of 1.5MB or so for GPU simulator.
Perhaps any hope of this in hardware is jumping ahead a few years...
Uncharted 2
Uncharted 2 videos seem to be sporting SSAO in shadows only, and lightmaps? Still want them to add one very low tech improvement however, would be really nice if they modified all the "gray" particles to instead pickup the global ambient tint (ie more blue at night, etc) to make them fit in better. Can grab the multi-player vid from Gamersyde.
20090508
CUDA 3.0 and GT300 and Future Predictions
As always I'm trying to wrap my mind around the direction the industry is heading. The irony of all this is that people much smarter than I already have a good idea of where it should go (in fact now developing future for years ahead), but are stuck waiting on people like me and you to get with the program and begin to use the hardware. They also can only tell us small hints of where they are going. What is perhaps even more ironic is that game developers (that's you and me) are probably the most thick headed and backwards minded. We have enormous legacy code bases, set in stone development pipelines and tools, and a relatively fixed way of thinking about solving the problem of game development with a very serial CPU mindset. Big changes take lots of time, and hardware is starting to get faster at a pace which is too fast for many to adapt to and push to the limits.
Q: What is next?
Q: What is general purpose mean for massively parallel processing?
Q: What as a developer, what do I want in the hardware?
A: General Purpose Hardware Queues and Job Distribution!
Much of the work in parallel processing is just grouping, moving data around, compacting, expanding, and other data routing problems. When you do all this using a CPU like coherent cache with a large vector processor, data routing becomes expensive. When you attempt to do all this routing manually with dedicated local memory and high throughput global memory, it is still expensive, just less expensive.
The real work happens after you are done grouping for good data locality, when you gather, compute (process/transform), and then transfer data to the next step.
A few blogs posts ago I talked about how GPUs don't have this middle tear memory system (meaning big L2, or core to core DMA). Instead GPUs have dedicated hardware for routing huge amounts of information in the fixed function pipeline. GPUs do dynamic grouping for efficient SIMD computation (think of packing vertexes which miss on the post transform cache, think about collecting groups of pixel quads for fragment shading). DX11 doubles the number of pipeline stages and adds direct support for queues in Append buffers, sure seems like a good time to generalize data routing.
To reactively preempt or proactively schedule?
The CPU style is to reactively preempt jobs when the job detects that it needs to stall, or the processor decides to limit a job's time slice. The problem with this model is that you have extra state which needs to be flushed to some slower memory to make room for a new job. Also timing and synchronization gets destroyed by non-deterministic behavior. To make matters worse, in many cases the next job starts with cold memory and tlb caches (caches are filled with data for the last task not the new job). The CPU way to start new jobs is to do a very expensive kernel call to start a new thread (or do less expensive user-space threads/fibers).
The GPU way is to setup a pipeline, and flow control inlet rate to load balance. For example verts get issued slowly when heavy pixel shading. The GPU way to start jobs is to schedule groups of independent jobs between dependent state changes from a master control stream (push buffer / command buffer). Conditional rendering and the DrawAuto draw call provide a starting point to on-the-fly modify the job list. DX11 extends this by enabling Compute Grid dimensions to come from device memory.
Now mix GPU style proactive scheduling with general purpose data routing.
I'm suggesting GPU state now includes a new first class member, the queue. Queues are associated with kernels. The hardware proactively manages jobs by reading from the queues, setting up thread blocks, routing data to an available core, and then starting execution of the thread block using the kernel associated with the queue. Queues are in hardware. No cold cache, data is always hot and ready. Hardware would also provide some fixed function data routing based on a new domain coordinate associated with entries in the queue.
This system would hardware accelerate the core basis of a majority of parallel programs: data compaction, expansion, routing and regrouping. Kernels which end up with forms of divergence (such as branch divergence or data locality divergence) could output the threads to new queues with a new domain coordinate to insure a new good grouping for continued computation.
What About GT300 and CUDA 3.0
Perhaps I am still thinking ahead by a few GPU generations, but I'm hoping that this is what NVidia is up to, that GT300's giant leap forward is indeed something similar to what I am describing in this blog post, and that CUDA 3.0 will provide support for this feature set.
Hopefully if you read my blog you realize that I have no fear of making a guess and expanding on an idea to some far out conclusions. So I could be way off here!
Q: What is next?
Q: What is general purpose mean for massively parallel processing?
Q: What as a developer, what do I want in the hardware?
A: General Purpose Hardware Queues and Job Distribution!
Much of the work in parallel processing is just grouping, moving data around, compacting, expanding, and other data routing problems. When you do all this using a CPU like coherent cache with a large vector processor, data routing becomes expensive. When you attempt to do all this routing manually with dedicated local memory and high throughput global memory, it is still expensive, just less expensive.
The real work happens after you are done grouping for good data locality, when you gather, compute (process/transform), and then transfer data to the next step.
A few blogs posts ago I talked about how GPUs don't have this middle tear memory system (meaning big L2, or core to core DMA). Instead GPUs have dedicated hardware for routing huge amounts of information in the fixed function pipeline. GPUs do dynamic grouping for efficient SIMD computation (think of packing vertexes which miss on the post transform cache, think about collecting groups of pixel quads for fragment shading). DX11 doubles the number of pipeline stages and adds direct support for queues in Append buffers, sure seems like a good time to generalize data routing.
To reactively preempt or proactively schedule?
The CPU style is to reactively preempt jobs when the job detects that it needs to stall, or the processor decides to limit a job's time slice. The problem with this model is that you have extra state which needs to be flushed to some slower memory to make room for a new job. Also timing and synchronization gets destroyed by non-deterministic behavior. To make matters worse, in many cases the next job starts with cold memory and tlb caches (caches are filled with data for the last task not the new job). The CPU way to start new jobs is to do a very expensive kernel call to start a new thread (or do less expensive user-space threads/fibers).
The GPU way is to setup a pipeline, and flow control inlet rate to load balance. For example verts get issued slowly when heavy pixel shading. The GPU way to start jobs is to schedule groups of independent jobs between dependent state changes from a master control stream (push buffer / command buffer). Conditional rendering and the DrawAuto draw call provide a starting point to on-the-fly modify the job list. DX11 extends this by enabling Compute Grid dimensions to come from device memory.
Now mix GPU style proactive scheduling with general purpose data routing.
I'm suggesting GPU state now includes a new first class member, the queue. Queues are associated with kernels. The hardware proactively manages jobs by reading from the queues, setting up thread blocks, routing data to an available core, and then starting execution of the thread block using the kernel associated with the queue. Queues are in hardware. No cold cache, data is always hot and ready. Hardware would also provide some fixed function data routing based on a new domain coordinate associated with entries in the queue.
This system would hardware accelerate the core basis of a majority of parallel programs: data compaction, expansion, routing and regrouping. Kernels which end up with forms of divergence (such as branch divergence or data locality divergence) could output the threads to new queues with a new domain coordinate to insure a new good grouping for continued computation.
What About GT300 and CUDA 3.0
Perhaps I am still thinking ahead by a few GPU generations, but I'm hoping that this is what NVidia is up to, that GT300's giant leap forward is indeed something similar to what I am describing in this blog post, and that CUDA 3.0 will provide support for this feature set.
Hopefully if you read my blog you realize that I have no fear of making a guess and expanding on an idea to some far out conclusions. So I could be way off here!
CUDA Compute 1.1 Global Atomics Profiled
The GT275 is to arrive on Monday, so I decided to do a few tests on Global Atomic performance with the soon to be retired Compute 1.1 8600 GTS.
TEST METHOD --- 256 threads/block. I predicated so that all the CUDA code ran in the first thread of each halfwarp, and all other threads of the halfwarp were unused. The kernel was clocked using clock() which returns a multiprocessor clock cycle counter (2 cycles/halfwarp, 4 cycles/warp, and captures the thread's time both active and asleep). So to get total time, I took the min and max of all the clock() times of all the threads in the grid (min/max computed CPU side).
TEST --- Did a repeated value = atomicAdd(address, value) (128 times, forced CUDA not to unroll the loop). Tested performance with various amounts of address segment collisions as a function of blockIdx. Tested interleaving ALU work. Tested doing a non-atomic load and volatile store instead of the atomic add (to get a baseline on performance).
The results with very rough numbers,
(1.) 11:1 to 12:1 ALU to MEM ratio (ALU thread/clk to MEM 32-bit word/clk).
(2.) Non-atomic paired global load/store performance varied based on number of segment (segment = aligned 64-bytes) collisions between running threads. When each block accessed a separate segment, performance was as (1.) above (note all non-predicated threads were accessing the same address per block). When all threads of the entire grid always accessed the same segment, performance cut in half. Suggests to me that this card has 2 memory controllers (hence the 50% reduction in performance). It will be interesting to see how this test scales as I move up to a card with almost an order of magnitude of more cores.
(3.) Global atomic operations had 1/8 the throughput of a paired global load/store in the case of blocks accessing separate segments. As the number of segment collisions increased to the worst case of all threads accessing the same segment, throughput dropped to 1/16 of the same test done without atomic operations.
Note, even though the throughput of atomic operations compared to non-atomic operations was large, the multiprocessors were able to keep doing ALU work in the background. I verified this by adding ALU operations until I saw an increase in total cycle time.
There are a few important things I haven't tested yet, and will likely wait until the GT275 arrives to do so,
(1.) Didn't test how much atomic operations effects the total capacity to utilize global memory bandwidth. So does the reduced throughput of atomic operations also result in a reduced throughput of memory bandwidth if one is doing multiple atomic and non-atomic global accesses? Do atomic operations reduce (and if so, by how much) the quantity of global memory accesses which can be issued? Or are atomic operations bottle-necked by a different system.
(2.) Didn't test various block sizes, effects of adding in the other threads of the half-warp.
CUDA Compute 1.1 Global Atomics vs OpenGL Point Drawing
I dug up some of my old GL point drawing stats for the same card. Unfortunately this isn't going to be apples to apples because I was likely ALU bound in the point drawing in the vertex shader! And a lot of my points (maybe 7/8 of them) ended up colliding to the same output pixel. Also I don't have the transfer time of point drawing with the same bits per operation.
- At 128-bit/point -> ~287 M points/sec (GL).
- At 256-bit/point -> ~170 M points/sec (GL).
- At 512-bit/segment -> ~64 M segments/sec (CUDA one atomic operation).
- At 512-bit/segment -> ~486 M segments/sec (CUDA non-atomic read/write).
Shows just how fast the fixed function pipeline is. With CUDA Compute 1.1 on this 8600 GTS, there is no way for me to get close to the fine granularity global atomic scatter performance (ie simulate a Z buffer) I get with the "doing all the work in the vertex shader" point scatter approach. However, for non-atomic scatter, I think Compute 1.1 would easily win.
So for what I was doing, GPGPU techniques proved faster than Compute 1.1.
Compute 1.2
Will be testing this next week to get some answers. Going to do the same global tests to get a baseline to get an idea of the following,
(1.) Should I work with separate global queues or will one queue be fast enough. How do colliding global atomics perform with many more cores and the new coalescing memory unit?
(2.) Should I do cross warp scan/scatter for the queues, or should I do cross block scan/scatter with added required __sync_threads() for the queues. This will depend on just how fast atomics are.
My current plan is to queue points per tile and fast shared memory scatter in the tile if I need the same fine granularity point drawing I got with OpenGL. Luckily my input data stream will have good tile locality so this should minimize global atomics (if necessary).
Compute 1.1 Global Atomics vs CPU Atomics
One can view the 8600 GTS like a CPU as a 1460 MHz 4 core device (I think mine is slightly overclocked). Best case atomics in my tests had somewhere around a 90 cycle throughput. That number is close to numbers I've seen documented for multi-core PCs (unfortunately I don't know the conditions which resulted in those cycle counts). However the big, and huge, difference here is that the GPU keeps on crunching while the atomic operation is serviced in the background, and the CPU eats the expense.
Future of GPU Global Atomics
My current view of global atomics is mostly a coarse (perhaps only upwards of around a few hundred million op/sec max) and UNORDERED way for global communication. Actually one will get ordered sections in the unordered communication when doing things like queues. This is in contrast to the sparse view that I have of atomic usage on CPUs.
Seems like most usages of global atomics are for data routing and things like stream compaction/expansion. The exception to this rule would be crazy stuff like fully random data transfer via global scatter that doesn't make sense to try and coarse bin / fine scatter using shared memory.
A few days ago on a Beyond 3D post I was thinking I wanted to see the following in hardware in the case where each memory controller had a bank of full 32-bit (or 64-bit) words,
(1.) C computing multiprocessors transfer global memory transactions (both atomic and non-atomic) to the GPU's internal routing (ring, torus, crossbar or whatever).
(2.) These transactions in the atomic case include the atomic operands (so naturally global atomic throughput will be 2x or 3x lower because of the operands taking up bandwidth on the internal network.
(3.) Memory transactions get distributed based on address to the proper one of the M memory controllers.
(4.) The memory controller does coalescing/re-ordering and also does the atomic operation via dedicated parallel ALU unit.
Performance in this case would likely be a function of,
(1.) C to M collisions. Clearly if all C multiprocessors do an atomic operation hitting only one M, throughput will suffer.
(2.) If all atomic operations hit the same address in M, then reduce throughput by the latency to do the ALU operations.
(3.) If in C all the threads (of a SIMD group) issue global atomic operations which hit the same address, then throughput is again reduced by the serialized ALU operations in M.
(4.) Extra cost of sending atomic operands to the memory controllers.
Other memory configurations are possible, such as banking memory by something other than a complete word, in which case the configuration of the above example would change. In any case it is easy to see that even in this somewhat fantasy best case, that atomic operations are always going to be some factor slower than non-atomic global interactions.
TEST METHOD --- 256 threads/block. I predicated so that all the CUDA code ran in the first thread of each halfwarp, and all other threads of the halfwarp were unused. The kernel was clocked using clock() which returns a multiprocessor clock cycle counter (2 cycles/halfwarp, 4 cycles/warp, and captures the thread's time both active and asleep). So to get total time, I took the min and max of all the clock() times of all the threads in the grid (min/max computed CPU side).
TEST --- Did a repeated value = atomicAdd(address, value) (128 times, forced CUDA not to unroll the loop). Tested performance with various amounts of address segment collisions as a function of blockIdx. Tested interleaving ALU work. Tested doing a non-atomic load and volatile store instead of the atomic add (to get a baseline on performance).
The results with very rough numbers,
(1.) 11:1 to 12:1 ALU to MEM ratio (ALU thread/clk to MEM 32-bit word/clk).
(2.) Non-atomic paired global load/store performance varied based on number of segment (segment = aligned 64-bytes) collisions between running threads. When each block accessed a separate segment, performance was as (1.) above (note all non-predicated threads were accessing the same address per block). When all threads of the entire grid always accessed the same segment, performance cut in half. Suggests to me that this card has 2 memory controllers (hence the 50% reduction in performance). It will be interesting to see how this test scales as I move up to a card with almost an order of magnitude of more cores.
(3.) Global atomic operations had 1/8 the throughput of a paired global load/store in the case of blocks accessing separate segments. As the number of segment collisions increased to the worst case of all threads accessing the same segment, throughput dropped to 1/16 of the same test done without atomic operations.
Note, even though the throughput of atomic operations compared to non-atomic operations was large, the multiprocessors were able to keep doing ALU work in the background. I verified this by adding ALU operations until I saw an increase in total cycle time.
There are a few important things I haven't tested yet, and will likely wait until the GT275 arrives to do so,
(1.) Didn't test how much atomic operations effects the total capacity to utilize global memory bandwidth. So does the reduced throughput of atomic operations also result in a reduced throughput of memory bandwidth if one is doing multiple atomic and non-atomic global accesses? Do atomic operations reduce (and if so, by how much) the quantity of global memory accesses which can be issued? Or are atomic operations bottle-necked by a different system.
(2.) Didn't test various block sizes, effects of adding in the other threads of the half-warp.
CUDA Compute 1.1 Global Atomics vs OpenGL Point Drawing
I dug up some of my old GL point drawing stats for the same card. Unfortunately this isn't going to be apples to apples because I was likely ALU bound in the point drawing in the vertex shader! And a lot of my points (maybe 7/8 of them) ended up colliding to the same output pixel. Also I don't have the transfer time of point drawing with the same bits per operation.
- At 128-bit/point -> ~287 M points/sec (GL).
- At 256-bit/point -> ~170 M points/sec (GL).
- At 512-bit/segment -> ~64 M segments/sec (CUDA one atomic operation).
- At 512-bit/segment -> ~486 M segments/sec (CUDA non-atomic read/write).
Shows just how fast the fixed function pipeline is. With CUDA Compute 1.1 on this 8600 GTS, there is no way for me to get close to the fine granularity global atomic scatter performance (ie simulate a Z buffer) I get with the "doing all the work in the vertex shader" point scatter approach. However, for non-atomic scatter, I think Compute 1.1 would easily win.
So for what I was doing, GPGPU techniques proved faster than Compute 1.1.
Compute 1.2
Will be testing this next week to get some answers. Going to do the same global tests to get a baseline to get an idea of the following,
(1.) Should I work with separate global queues or will one queue be fast enough. How do colliding global atomics perform with many more cores and the new coalescing memory unit?
(2.) Should I do cross warp scan/scatter for the queues, or should I do cross block scan/scatter with added required __sync_threads() for the queues. This will depend on just how fast atomics are.
My current plan is to queue points per tile and fast shared memory scatter in the tile if I need the same fine granularity point drawing I got with OpenGL. Luckily my input data stream will have good tile locality so this should minimize global atomics (if necessary).
Compute 1.1 Global Atomics vs CPU Atomics
One can view the 8600 GTS like a CPU as a 1460 MHz 4 core device (I think mine is slightly overclocked). Best case atomics in my tests had somewhere around a 90 cycle throughput. That number is close to numbers I've seen documented for multi-core PCs (unfortunately I don't know the conditions which resulted in those cycle counts). However the big, and huge, difference here is that the GPU keeps on crunching while the atomic operation is serviced in the background, and the CPU eats the expense.
Future of GPU Global Atomics
My current view of global atomics is mostly a coarse (perhaps only upwards of around a few hundred million op/sec max) and UNORDERED way for global communication. Actually one will get ordered sections in the unordered communication when doing things like queues. This is in contrast to the sparse view that I have of atomic usage on CPUs.
Seems like most usages of global atomics are for data routing and things like stream compaction/expansion. The exception to this rule would be crazy stuff like fully random data transfer via global scatter that doesn't make sense to try and coarse bin / fine scatter using shared memory.
A few days ago on a Beyond 3D post I was thinking I wanted to see the following in hardware in the case where each memory controller had a bank of full 32-bit (or 64-bit) words,
(1.) C computing multiprocessors transfer global memory transactions (both atomic and non-atomic) to the GPU's internal routing (ring, torus, crossbar or whatever).
(2.) These transactions in the atomic case include the atomic operands (so naturally global atomic throughput will be 2x or 3x lower because of the operands taking up bandwidth on the internal network.
(3.) Memory transactions get distributed based on address to the proper one of the M memory controllers.
(4.) The memory controller does coalescing/re-ordering and also does the atomic operation via dedicated parallel ALU unit.
Performance in this case would likely be a function of,
(1.) C to M collisions. Clearly if all C multiprocessors do an atomic operation hitting only one M, throughput will suffer.
(2.) If all atomic operations hit the same address in M, then reduce throughput by the latency to do the ALU operations.
(3.) If in C all the threads (of a SIMD group) issue global atomic operations which hit the same address, then throughput is again reduced by the serialized ALU operations in M.
(4.) Extra cost of sending atomic operands to the memory controllers.
Other memory configurations are possible, such as banking memory by something other than a complete word, in which case the configuration of the above example would change. In any case it is easy to see that even in this somewhat fantasy best case, that atomic operations are always going to be some factor slower than non-atomic global interactions.
20090507
20090506
Electromagnetic Spectrum and Rendering
I find it interesting that as humans working on rendering technology, we are obsessed with the visible spectrum when there is so much more,
Gamma Rays and X-Rays
Wavelength of Atomic Nuclei to Atoms. Below is a VACIS gamma-ray image of stowaways in a truck. Also check out Nick Veasey's X-ray Photography. Would be interesting to do real-time graphics like this, but dynamically adjusting per pixel exposure in order to capture views of large scenes...

Mid-Infrared
Wavelength slightly smaller than a really thin human hair. Real-time rendering done before in that awesome AC-130 level of Call of Duty 4 (screen shot below). The level had this analog real visual feel you don't get from our misguided attempts to simulate the visible spectrum in real-time.

I'd like to see a mix of mid-infrared and x-ray rendered in real-time...
Gamma Rays and X-Rays
Wavelength of Atomic Nuclei to Atoms. Below is a VACIS gamma-ray image of stowaways in a truck. Also check out Nick Veasey's X-ray Photography. Would be interesting to do real-time graphics like this, but dynamically adjusting per pixel exposure in order to capture views of large scenes...
Mid-Infrared
Wavelength slightly smaller than a really thin human hair. Real-time rendering done before in that awesome AC-130 level of Call of Duty 4 (screen shot below). The level had this analog real visual feel you don't get from our misguided attempts to simulate the visible spectrum in real-time.

I'd like to see a mix of mid-infrared and x-ray rendered in real-time...
20090505
PTX Simulator Paper
BTW, going from CUDA Compute 1.2 theory land to practice, just registered to be a CUDA developer, and my GT275 is on order! Beyond the efficiency I expect to gain with CUDA, this new card offers 4x the bandwidth and 7x the ALU capacity of my old 8600 GTS (time to get a new power supply).
Analyzing CUDA Workloads Using a Detailed GPU Simulator - Software PTX simulator running a bunch of CUDA apps, but no applications using global memory atomics for cross CTA communication. Results where quite interesting: performance more sensitive to interconnection network bisection bandwidth rather than latency, addition of caching hardware for global memory accesses caused a performance degradation for a bunch of the benchmarks, aggressive coalescing yielded up to a 41% improvement, and something I need to think about, lower CTAs/core sometimes improved performance.
Analyzing CUDA Workloads Using a Detailed GPU Simulator - Software PTX simulator running a bunch of CUDA apps, but no applications using global memory atomics for cross CTA communication. Results where quite interesting: performance more sensitive to interconnection network bisection bandwidth rather than latency, addition of caching hardware for global memory accesses caused a performance degradation for a bunch of the benchmarks, aggressive coalescing yielded up to a 41% improvement, and something I need to think about, lower CTAs/core sometimes improved performance.
CUDA Compute 1.2 Mega Post!
CUDA Compute 1.2 (GT260/GT280 and friends excluding GT250) provides two very important hardware features, shared-memory atomic operations, and coalescing for much better global memory performance. Below is a collection of some of my notes working through ideas to port over my Atom engine work...
Compute 1.2 GPU Described as CPU
- 30 cores on some higher end cards.
- Variable amount of hyperthreading up to 32-way.
- Fixed 32-wide SIMD (8-wide in hardware).
- 4 clock cycles per 32-wide SIMD instruction.
- 24 clock cycle instruction latency? (guessed from 192 thread min).
- Variable number of registers per hyperthread, minimum 16.
- Variable number of addressable shared register per hyperthread, minimum 4.
In theory, if given 200 CUDA threads clocked at 1 GHz, and 100 GBs bandwidth,
- Best case have 8 instructions per thread per 32-bit global load or store.
- Bad case have >128 instructions per thread per 32-bit global load or store.
- Bad case is non-coalesced scatter operation!
Compute 1.2 Configurations
Here are the configurations which make sense to me for maximum latency hiding. These suggestions keep with the multiple of 64 threads suggestion in the docs. BTW, if you are using __sync_threads() then forget about using 1 block per core (also likely not a good idea for other reasons)!
blks -> blocks per core
thrd -> threads used per core
t/blk -> threads per block
w/blk -> warps per block
reg/t -> registers per thread
s/blk -> shared memory WORDS (32-bit) per block
s/wrp -> shared memory words per warp
s/thr -> shared memory words per thread
blks thrd t/blk w/blk reg/t s/blk s/wrp s/thr
---- ---- ----- ----- ----- ----- ----- -----
1... 1024 1024. 32... 16... 4096. 128.. 4....
2... 1024 512.. 16... 16... 2048. 128.. 4....
3... 960. 320.. 10... 16... 1280. 128.. 4....
4... 1024 256.. 8.... 16... 1024. 128.. 4....
5... 960. 192.. 6.... 16... 768.. 128.. 4....
Dealing With Register Pressure
There are 16 registers per thread maximum for best latency hiding ability. If running out of room, and latency bound, one can attempt to trade ALU cycles for less registers, if data can be stored compressed in the registers (or shared memory), and decompressed when needed.
In Warp Parallel Scan
Starting with something simple, doing an in warp parallel scan without bank conflicts. This one uses 2x the shared memory as an optimization. I'm using an 8 wide warp in this example because the full warp description won't easily fit in a blog post. The description I'm using below is a little odd, the pair of characters represents the start and end of the terms in the sum per thread. So 'aa' means 'a', 'ab' means 'a+b', 'ac' means 'a+b+c', and so on. Start with registers having {a,b,c,d,e,f,g,h} in our virtual 8-wide warp,
aa bb cc dd ee ff gg hh
Write an extra warp to shared memory of all zeros. Or if doing this often, then just reuse the zeros. Actually the first half of the zero warp can be used for something else (the '__'s) because they never get read in the this algorithm. Write registers out to the next warp region in shared memory.
__ __ __ __ 00 00 00 00 aa bb cc dd ee ff gg hh
Now do log2(warp_size) sums each time shifting the address over by (1 << iteration_count) each iteration.
aa bb cc dd ee ff gg hh
00 aa bb cc dd ee ff gg + <- add -1 to address
-----------------------
aa ab bc cd de ef fg gh =
00 00 aa ab bc cd de ef + <- add -2 to address
-----------------------
aa ab ac ad be cf dg eh =
00 00 00 00 aa ab ac ad + <- add -4 to address
-----------------------
aa ab ac ad ae af ag ah = <- result
The total sum ends up in the last thread, it is easy to reverse this if need be. Disadvantage of this is the required zero space in shared memory. An alternative is to predicate each instruction. The advantage of an in-warp scan is that there is no need to do a __sync_threads(), just make sure to use volatile instead!
In Warp Parallel Parallel Scans
One can use the above algorithm to do 2 (or more) scans in parallel, if one knows that the total range of the scans will fit in a smaller number of bits. For example to scan both s and t at the same time where s and t both have the range of 16-bit integers, for each thread run scan on {(s << 16) + t}.
Atomic Operations
My first theory is that shared-memory atomics are dedicated ALU instructions. An alternative would be if the hardware enforced a group of instructions to go in series to simulate various atomic operations (hardware doesn't support floating point atomic operations so likely this isn't the case).
My second theory is that global-memory atomics are done outside the ALU units and perhaps done by the hardware which handles global memory access coalescing. First, 64-bit atomics are only supported for global memory (if the ALUs were doing the global atomic operation, then one would think that 64-bit shared memory atomics would also work). Also note, 64-bit atomics are only supported on a subset (add, exch, cas) of the supported 32-bit atomics (add, sub, exch, inc, dec, min, max, cas, and, or xor). Second, it makes the most sense for parallel performance to do global atomics after serialization of global memory transactions from all the cores. In this case, global atomics would be quite fast and probably only suffer slowdown in cases of atomic address collision in a given parallel transaction (which would cause a serialization of the atomic operations for the colliding addresses in a parallel request).
If my second theory holds true, atomics on the GPU would be damn fast and many times more useful than atomics on the CPU (CPUs ALU stalls, CPUs like PPC do a retry loop on cache contention during an atomic operation, CPUs like x86 >= P6 are faster, but still quite slow when cache lines are touched by multiple cores). GPU would be able to do huge amounts of atomic operations with cores sharing addresses with almost no slow down. This theory is supported by a least one paper I've listed before that found global atomics to only be 15% slower than non-atomic global writes.
Usage cases,
Variable Size Output with Global Destination
A variable amount of threads of a warp writing into a global list can be done with the following process,
1. Each thread marks 0=nowrite, 1=write into shared memory.
2. Run a scan of this shared memory.
3. Fetch the count of threads set to write from the scan.
4. Predicated warp thread=0 does an atomicAdd() of count to global head.
5. The old return of atomicAdd() is written to shared memory.
6. This provides the base address to scatter to.
7. The scan provides the offset.
8. Have each thread predicate write to list given address and offset.
Note this should be quite fast because scattered output will at worst case straddle one segment boundary (2 global memory transactions). In the case of frequent all-write or no-write cases, one can use the warp vote functions to attempt to avoid work (such as the scan). In the case of possible list overflow, one can either right shift to enable output to loop back to the beginning, or coherent branch to skip write on detected overflow.
Second option would be to do a per block scan and scatter to better amortize the global atomic (64-byte memory transaction), but at the cost of a __sync_threads()!
Variable Size Input with Global Source
Same structure as above, except add a negative number to the head, and gather instead of scatter. Note that one cannot share the same data structure for input and output as a stack in the same kernel invocation because head update and scatter to list are not atomic together.
Global Dynamic Object Pool Allocator
Combine variable size global output with variable size global input (with separate buffers, for reasons stated above) and you have a global dynamic object pool allocator. Note before using the lists in a kernel, make sure to atomic clamp the head in case of previous over or underflow.
Use two lists, and have the usage of the lists switch every kernel call. One list is used to collect freed objects, the other list is used to allocate objects from.
At this point, taken a very basic parallel primitive of scan mixed with global atomic operations, and the result is a form of CUDA parallel friendly dynamic memory allocation from a double buffered pre-allocated pool. This construct can help form the basis of any kind of dynamically modified persistent data structure. Next step is to look at efficient ways to store and access objects.
Efficient Scatter/Gather of Objects
Smallest memory transaction size is 32 bytes. So objects can be efficiently scattered and gathered in memory as long as they are a multiple of 32 bytes and aligned on an object sized boundary. Note this is in contrast to standard OpenGL GPGPU data storage which splits up objects into separate render targets. Threads cooperate to load objects into shared memory. For example for 32-byte objects, each pair of threads is used to do a global load/store of a 16-byte of part of one object. One can also use texture fetch to load objects.
Methods to avoid any AOS to SOA conversion,
1. All threads of a half-warp fetch from the same shared gathered object. This uses shared-memory broadcast ability. This is dead simple, and has a hugely good ALU to MEM ratio.
2. If individual threads need to fetch and work with different objects (meaning cannot all access the same object each time) store each object starting on a different memory bank. See the pattern below (where '.' spacer words are free for some other usage). Each word offset in each object is on a different bank, so when the threads run in lockstep (as they do in CUDA), there will be no bank conflicts!
0123456789abcdef <- bank
----------------
00000000.1111111 <- objects arranged in shared memory
1.22222222.33333 <- with number number representing thread
333.44444444.555 <- of half-warp associated with object
55555.66666666.7
7777777.88888888
.99999999.aaaaaa
aa.bbbbbbbb.cccc
cccc.dddddddd.ee
eeeeee.ffffffff.
Combine efficient ability to both allocate and use objects. It is easy using the above contructs to have threads walk a data structure in memory, allocate an object index, and do a fast scatter store of the new object.
Inefficient Global Scatter
Given the minimum segment size, semi-random scattering small 32-bit words into a global data structure is generally a bad idea. However there are options here,
1. Eat the cost of the inefficient scatter. If ~150 cycles is available for ALU bound work, might even be able to hide the scatter cost with actual work.
2. Reformulate scatter into gather. See the hystopyramid paper for a good example of this using classic GPGPU techniques.
3. Reformulate scatter into coarse gather, with a fine scatter step into local shared memory. For all groups of items which intersect a tile in shared memory, do a predicated scatter into the tile.
4. Do a coarse scatter into per tile queues, followed by a fine scatter step into local shared memory. See Variable Size Output with Global Destination above for an example of how to the queuing. This option would work good when a large majority of threads in a warp have scatter outputs which would end up in the same queues. One can use the In Warp Parallel Parallel Scans to attempt to do multiple scans (for multiple queues) at the same time. Will have to profile see if the extra bitpacking ALU is worth it!
Secondary Domain Binning by Highest Priority
The idea is to take a domain of nodes, apply a function {bin, priority} = f(node), which computes a new bin index in a new domain, and a priority for node to be in that bin. This for each bin find the node id of the highest priority node in the bin.
Example usages of this construct would be for spacial hashing, or to draw points to a display.
I use this construct in my GPGPU code (point scatter with all work in vertex shader) to scatter node indexes into a domain that has good data locality for many future computations. In the new domain, I run a pass which uses the node index to fetch node data and produce and store out a compressed representation of the attributes of the node I need for future computation (this is effectively a way to optimize for future pass repeated point sampled texture gathering).
Below is the binning algorithm using the "eat the cost" option from Inefficient Global Scatter above. This is likely a worst case way to do this, and in many cases the wrong way to do this (with global scatter), better options are described below.
1. Priority is a 32-bit unsigned integer with the top bit always set (1).
2. NodeId is a 32-bit unsigned integer with the top bit never set (0).
3. BinAddress, is an address to a 32-bit word in the new domain array.
Pass 1,
1. Before starting, insure new domain array is cleared to zero.
2. Scatter Priority using atomicMax() to the BinAddress for every thread.
3. Critical, do some other work per thread to hide scatter cost!!!!
Pass 2 (separate kernel!),
1. Do a second scatter atomicCAS(BinAddress, Priority, NodeId) for every thread.
2. Critical, do some other work per thread to hide scatter cost!!!!
Both Priority and NodeId are effectively 31-bit integers, with the hi bit insuring on the second pass that only one swap is successful. This is to correct for the following (very remote possibility) case happening to one bin,
atomicCAS(, Priority=2, NodeId=23) -> is successful
atomicCAS(, Priority=23, NodeId=56) -> would be successful, but should not be!!!
How about some variations on this idea,
VARIATION 1 : SINGLE PASS --- If Priority and NodeId can be packed into a single 32-bit word,
then only one pass is required. Just scatter {(Priority << ?) + NodeId}. Then later mask out NodeId when reading from the new domain. Unfortunately CUDA doesn't have a 64-bit atomicMin()
so this trick is limited to 32-bit words. This will work quite fast in the case of local shared-memory only scatter.
VARIATION 1B : GLOBAL SINGLE PASS + 3-BITS OF PRIORITY --- For case 1, one can get an extra 3-bits for Priority for the same global memory scatter cost at the expense of using 8 times the memory, and some extra ALU work. Scatter {(Priority << ?) + (NodeId >> 3)} to {(BinAddress << 3) + (NodeId & 7)}. The location of the word within each 32-byte bin provides the lower 3-bits of NodeId.
VARIATION 2 : NO COLLISION SPLIT WORD --- This case will only work correctly for scatter if you can insure no Priority collisions in a given bin. Also requires Priority to use a less than 32-bits (will varry). The idea is to reduce the number of required passes, in trade for more ALU work, and more memory consumption. Break up NodeId into multiple pieces and atomicMax scatter pairs of words in the form of something like this, {(Priority << 20) + (NodeId >> 12), (Priority << 20) + (NodeID & 4095)}. Then later fetch the pair and bit extract the NodeId. Make sure the pairs are aligned to go in one memory bus transaction. Clearly a bin collision of NodeIds associated with the same priority will cause this method to fail in cases where the low and hi halves of the bin+priority colliding NodeIds don't sort in the same order!
VARIATION 2B : SINGLE COLLISION DETECTION --- Case 2 can be made a little more robust
by scattering a second pair with the bits of NodeId inverted. Single bin collision can be detected and corrected with this extra information.
VARIATION 3 : MAX AND MIN PRIORITY BINNING --- One can get both the maximum and minimum Priority NodeIds per bin at the same time, by grouping a second set with priority inverted
(the Max of an inverted Priority is the Min). This will of course require a little more ALU work and double the memory, but will use the same bandwidth for scatter in the global scatter case!
VARIATION 3A : MULTI-PRIORITY BINNING --- One can take variation 3 and extend it to more than one type of Priority. An example case could be scattering particles out into a spacial hash and getting particles with the highest X, Y, and Z component of velocity or momentum all in one bin during the same global scatter operation.
VARIATION 4 : 8 IDs/BIN by RANDOM SCATTER PERMUTATION --- One can end up with up to 8 NodeIDs per bin with the same scatter global bandwidth cost, very low ALU cost, and without atomic scatter, by scattering {NodeID} to {(Bin << 3) + (HashFunction(Node) & 7)}. The HashFunction() should provide some semi-random value from 0-7 based on node properties and some other random factors. Clearly in this case each bin is 32-bytes wide.
Dynamic Global Parallel Tree Structures
Persistent dynamic structures can be built. Here is an example of a dynamic tree which matches some of the needs I have for Atom.
Constraints,
- Nodes have a fixed number of children.
- Nodes have an ID to parent.
- Nodes have a child index in parent (0 to max_children-1).
- Nodes are likely bit-packed into many integers (multiple of 32-bytes).
- Nodes do not directly link to children (not needed in my case).
- Nodes have a bitmask of child occupancy.
Usage,
- Leaf nodes (bitmask=0) can self prune by atomicOr() into parent bitmask.
- Free nodes can attach to new parent by atomicAnd() into new parent bitmask.
- Free nodes attaching also have to self update with new parent ID.
The most important rule when using a structure like this is to insure that the algorithm does not have more than one thread adding to the same child bin in the parent. Also nodes shouldn't be added to a node which is going to be freed.
... more next time ...
Compute 1.2 GPU Described as CPU
- 30 cores on some higher end cards.
- Variable amount of hyperthreading up to 32-way.
- Fixed 32-wide SIMD (8-wide in hardware).
- 4 clock cycles per 32-wide SIMD instruction.
- 24 clock cycle instruction latency? (guessed from 192 thread min).
- Variable number of registers per hyperthread, minimum 16.
- Variable number of addressable shared register per hyperthread, minimum 4.
In theory, if given 200 CUDA threads clocked at 1 GHz, and 100 GBs bandwidth,
- Best case have 8 instructions per thread per 32-bit global load or store.
- Bad case have >128 instructions per thread per 32-bit global load or store.
- Bad case is non-coalesced scatter operation!
Compute 1.2 Configurations
Here are the configurations which make sense to me for maximum latency hiding. These suggestions keep with the multiple of 64 threads suggestion in the docs. BTW, if you are using __sync_threads() then forget about using 1 block per core (also likely not a good idea for other reasons)!
blks -> blocks per core
thrd -> threads used per core
t/blk -> threads per block
w/blk -> warps per block
reg/t -> registers per thread
s/blk -> shared memory WORDS (32-bit) per block
s/wrp -> shared memory words per warp
s/thr -> shared memory words per thread
blks thrd t/blk w/blk reg/t s/blk s/wrp s/thr
---- ---- ----- ----- ----- ----- ----- -----
1... 1024 1024. 32... 16... 4096. 128.. 4....
2... 1024 512.. 16... 16... 2048. 128.. 4....
3... 960. 320.. 10... 16... 1280. 128.. 4....
4... 1024 256.. 8.... 16... 1024. 128.. 4....
5... 960. 192.. 6.... 16... 768.. 128.. 4....
Dealing With Register Pressure
There are 16 registers per thread maximum for best latency hiding ability. If running out of room, and latency bound, one can attempt to trade ALU cycles for less registers, if data can be stored compressed in the registers (or shared memory), and decompressed when needed.
In Warp Parallel Scan
Starting with something simple, doing an in warp parallel scan without bank conflicts. This one uses 2x the shared memory as an optimization. I'm using an 8 wide warp in this example because the full warp description won't easily fit in a blog post. The description I'm using below is a little odd, the pair of characters represents the start and end of the terms in the sum per thread. So 'aa' means 'a', 'ab' means 'a+b', 'ac' means 'a+b+c', and so on. Start with registers having {a,b,c,d,e,f,g,h} in our virtual 8-wide warp,
aa bb cc dd ee ff gg hh
Write an extra warp to shared memory of all zeros. Or if doing this often, then just reuse the zeros. Actually the first half of the zero warp can be used for something else (the '__'s) because they never get read in the this algorithm. Write registers out to the next warp region in shared memory.
__ __ __ __ 00 00 00 00 aa bb cc dd ee ff gg hh
Now do log2(warp_size) sums each time shifting the address over by (1 << iteration_count) each iteration.
aa bb cc dd ee ff gg hh
00 aa bb cc dd ee ff gg + <- add -1 to address
-----------------------
aa ab bc cd de ef fg gh =
00 00 aa ab bc cd de ef + <- add -2 to address
-----------------------
aa ab ac ad be cf dg eh =
00 00 00 00 aa ab ac ad + <- add -4 to address
-----------------------
aa ab ac ad ae af ag ah = <- result
The total sum ends up in the last thread, it is easy to reverse this if need be. Disadvantage of this is the required zero space in shared memory. An alternative is to predicate each instruction. The advantage of an in-warp scan is that there is no need to do a __sync_threads(), just make sure to use volatile instead!
In Warp Parallel Parallel Scans
One can use the above algorithm to do 2 (or more) scans in parallel, if one knows that the total range of the scans will fit in a smaller number of bits. For example to scan both s and t at the same time where s and t both have the range of 16-bit integers, for each thread run scan on {(s << 16) + t}.
Atomic Operations
My first theory is that shared-memory atomics are dedicated ALU instructions. An alternative would be if the hardware enforced a group of instructions to go in series to simulate various atomic operations (hardware doesn't support floating point atomic operations so likely this isn't the case).
My second theory is that global-memory atomics are done outside the ALU units and perhaps done by the hardware which handles global memory access coalescing. First, 64-bit atomics are only supported for global memory (if the ALUs were doing the global atomic operation, then one would think that 64-bit shared memory atomics would also work). Also note, 64-bit atomics are only supported on a subset (add, exch, cas) of the supported 32-bit atomics (add, sub, exch, inc, dec, min, max, cas, and, or xor). Second, it makes the most sense for parallel performance to do global atomics after serialization of global memory transactions from all the cores. In this case, global atomics would be quite fast and probably only suffer slowdown in cases of atomic address collision in a given parallel transaction (which would cause a serialization of the atomic operations for the colliding addresses in a parallel request).
If my second theory holds true, atomics on the GPU would be damn fast and many times more useful than atomics on the CPU (CPUs ALU stalls, CPUs like PPC do a retry loop on cache contention during an atomic operation, CPUs like x86 >= P6 are faster, but still quite slow when cache lines are touched by multiple cores). GPU would be able to do huge amounts of atomic operations with cores sharing addresses with almost no slow down. This theory is supported by a least one paper I've listed before that found global atomics to only be 15% slower than non-atomic global writes.
Usage cases,
Variable Size Output with Global Destination
A variable amount of threads of a warp writing into a global list can be done with the following process,
1. Each thread marks 0=nowrite, 1=write into shared memory.
2. Run a scan of this shared memory.
3. Fetch the count of threads set to write from the scan.
4. Predicated warp thread=0 does an atomicAdd() of count to global head.
5. The old return of atomicAdd() is written to shared memory.
6. This provides the base address to scatter to.
7. The scan provides the offset.
8. Have each thread predicate write to list given address and offset.
Note this should be quite fast because scattered output will at worst case straddle one segment boundary (2 global memory transactions). In the case of frequent all-write or no-write cases, one can use the warp vote functions to attempt to avoid work (such as the scan). In the case of possible list overflow, one can either right shift to enable output to loop back to the beginning, or coherent branch to skip write on detected overflow.
Second option would be to do a per block scan and scatter to better amortize the global atomic (64-byte memory transaction), but at the cost of a __sync_threads()!
Variable Size Input with Global Source
Same structure as above, except add a negative number to the head, and gather instead of scatter. Note that one cannot share the same data structure for input and output as a stack in the same kernel invocation because head update and scatter to list are not atomic together.
Global Dynamic Object Pool Allocator
Combine variable size global output with variable size global input (with separate buffers, for reasons stated above) and you have a global dynamic object pool allocator. Note before using the lists in a kernel, make sure to atomic clamp the head in case of previous over or underflow.
Use two lists, and have the usage of the lists switch every kernel call. One list is used to collect freed objects, the other list is used to allocate objects from.
At this point, taken a very basic parallel primitive of scan mixed with global atomic operations, and the result is a form of CUDA parallel friendly dynamic memory allocation from a double buffered pre-allocated pool. This construct can help form the basis of any kind of dynamically modified persistent data structure. Next step is to look at efficient ways to store and access objects.
Efficient Scatter/Gather of Objects
Smallest memory transaction size is 32 bytes. So objects can be efficiently scattered and gathered in memory as long as they are a multiple of 32 bytes and aligned on an object sized boundary. Note this is in contrast to standard OpenGL GPGPU data storage which splits up objects into separate render targets. Threads cooperate to load objects into shared memory. For example for 32-byte objects, each pair of threads is used to do a global load/store of a 16-byte of part of one object. One can also use texture fetch to load objects.
Methods to avoid any AOS to SOA conversion,
1. All threads of a half-warp fetch from the same shared gathered object. This uses shared-memory broadcast ability. This is dead simple, and has a hugely good ALU to MEM ratio.
2. If individual threads need to fetch and work with different objects (meaning cannot all access the same object each time) store each object starting on a different memory bank. See the pattern below (where '.' spacer words are free for some other usage). Each word offset in each object is on a different bank, so when the threads run in lockstep (as they do in CUDA), there will be no bank conflicts!
0123456789abcdef <- bank
----------------
00000000.1111111 <- objects arranged in shared memory
1.22222222.33333 <- with number number representing thread
333.44444444.555 <- of half-warp associated with object
55555.66666666.7
7777777.88888888
.99999999.aaaaaa
aa.bbbbbbbb.cccc
cccc.dddddddd.ee
eeeeee.ffffffff.
Combine efficient ability to both allocate and use objects. It is easy using the above contructs to have threads walk a data structure in memory, allocate an object index, and do a fast scatter store of the new object.
Inefficient Global Scatter
Given the minimum segment size, semi-random scattering small 32-bit words into a global data structure is generally a bad idea. However there are options here,
1. Eat the cost of the inefficient scatter. If ~150 cycles is available for ALU bound work, might even be able to hide the scatter cost with actual work.
2. Reformulate scatter into gather. See the hystopyramid paper for a good example of this using classic GPGPU techniques.
3. Reformulate scatter into coarse gather, with a fine scatter step into local shared memory. For all groups of items which intersect a tile in shared memory, do a predicated scatter into the tile.
4. Do a coarse scatter into per tile queues, followed by a fine scatter step into local shared memory. See Variable Size Output with Global Destination above for an example of how to the queuing. This option would work good when a large majority of threads in a warp have scatter outputs which would end up in the same queues. One can use the In Warp Parallel Parallel Scans to attempt to do multiple scans (for multiple queues) at the same time. Will have to profile see if the extra bitpacking ALU is worth it!
Secondary Domain Binning by Highest Priority
The idea is to take a domain of nodes, apply a function {bin, priority} = f(node), which computes a new bin index in a new domain, and a priority for node to be in that bin. This for each bin find the node id of the highest priority node in the bin.
Example usages of this construct would be for spacial hashing, or to draw points to a display.
I use this construct in my GPGPU code (point scatter with all work in vertex shader) to scatter node indexes into a domain that has good data locality for many future computations. In the new domain, I run a pass which uses the node index to fetch node data and produce and store out a compressed representation of the attributes of the node I need for future computation (this is effectively a way to optimize for future pass repeated point sampled texture gathering).
Below is the binning algorithm using the "eat the cost" option from Inefficient Global Scatter above. This is likely a worst case way to do this, and in many cases the wrong way to do this (with global scatter), better options are described below.
1. Priority is a 32-bit unsigned integer with the top bit always set (1).
2. NodeId is a 32-bit unsigned integer with the top bit never set (0).
3. BinAddress, is an address to a 32-bit word in the new domain array.
Pass 1,
1. Before starting, insure new domain array is cleared to zero.
2. Scatter Priority using atomicMax() to the BinAddress for every thread.
3. Critical, do some other work per thread to hide scatter cost!!!!
Pass 2 (separate kernel!),
1. Do a second scatter atomicCAS(BinAddress, Priority, NodeId) for every thread.
2. Critical, do some other work per thread to hide scatter cost!!!!
Both Priority and NodeId are effectively 31-bit integers, with the hi bit insuring on the second pass that only one swap is successful. This is to correct for the following (very remote possibility) case happening to one bin,
atomicCAS(, Priority=2, NodeId=23) -> is successful
atomicCAS(, Priority=23, NodeId=56) -> would be successful, but should not be!!!
How about some variations on this idea,
VARIATION 1 : SINGLE PASS --- If Priority and NodeId can be packed into a single 32-bit word,
then only one pass is required. Just scatter {(Priority << ?) + NodeId}. Then later mask out NodeId when reading from the new domain. Unfortunately CUDA doesn't have a 64-bit atomicMin()
so this trick is limited to 32-bit words. This will work quite fast in the case of local shared-memory only scatter.
VARIATION 1B : GLOBAL SINGLE PASS + 3-BITS OF PRIORITY --- For case 1, one can get an extra 3-bits for Priority for the same global memory scatter cost at the expense of using 8 times the memory, and some extra ALU work. Scatter {(Priority << ?) + (NodeId >> 3)} to {(BinAddress << 3) + (NodeId & 7)}. The location of the word within each 32-byte bin provides the lower 3-bits of NodeId.
VARIATION 2 : NO COLLISION SPLIT WORD --- This case will only work correctly for scatter if you can insure no Priority collisions in a given bin. Also requires Priority to use a less than 32-bits (will varry). The idea is to reduce the number of required passes, in trade for more ALU work, and more memory consumption. Break up NodeId into multiple pieces and atomicMax scatter pairs of words in the form of something like this, {(Priority << 20) + (NodeId >> 12), (Priority << 20) + (NodeID & 4095)}. Then later fetch the pair and bit extract the NodeId. Make sure the pairs are aligned to go in one memory bus transaction. Clearly a bin collision of NodeIds associated with the same priority will cause this method to fail in cases where the low and hi halves of the bin+priority colliding NodeIds don't sort in the same order!
VARIATION 2B : SINGLE COLLISION DETECTION --- Case 2 can be made a little more robust
by scattering a second pair with the bits of NodeId inverted. Single bin collision can be detected and corrected with this extra information.
VARIATION 3 : MAX AND MIN PRIORITY BINNING --- One can get both the maximum and minimum Priority NodeIds per bin at the same time, by grouping a second set with priority inverted
(the Max of an inverted Priority is the Min). This will of course require a little more ALU work and double the memory, but will use the same bandwidth for scatter in the global scatter case!
VARIATION 3A : MULTI-PRIORITY BINNING --- One can take variation 3 and extend it to more than one type of Priority. An example case could be scattering particles out into a spacial hash and getting particles with the highest X, Y, and Z component of velocity or momentum all in one bin during the same global scatter operation.
VARIATION 4 : 8 IDs/BIN by RANDOM SCATTER PERMUTATION --- One can end up with up to 8 NodeIDs per bin with the same scatter global bandwidth cost, very low ALU cost, and without atomic scatter, by scattering {NodeID} to {(Bin << 3) + (HashFunction(Node) & 7)}. The HashFunction() should provide some semi-random value from 0-7 based on node properties and some other random factors. Clearly in this case each bin is 32-bytes wide.
Dynamic Global Parallel Tree Structures
Persistent dynamic structures can be built. Here is an example of a dynamic tree which matches some of the needs I have for Atom.
Constraints,
- Nodes have a fixed number of children.
- Nodes have an ID to parent.
- Nodes have a child index in parent (0 to max_children-1).
- Nodes are likely bit-packed into many integers (multiple of 32-bytes).
- Nodes do not directly link to children (not needed in my case).
- Nodes have a bitmask of child occupancy.
Usage,
- Leaf nodes (bitmask=0) can self prune by atomicOr() into parent bitmask.
- Free nodes can attach to new parent by atomicAnd() into new parent bitmask.
- Free nodes attaching also have to self update with new parent ID.
The most important rule when using a structure like this is to insure that the algorithm does not have more than one thread adding to the same child bin in the parent. Also nodes shouldn't be added to a node which is going to be freed.
... more next time ...
20090501
32-bpp HDR Blending Idea
The goal is to get 16-bit per channel HDR with only 32-bit per pixel and still have an additive blendable framebuffer. This enables proper linear HDR blending, something which isn't possible with 8-bits per channel without seeing horrid banding. One alternative option would be a 10:10:10:2 format, however this format often haves 1/2 the blend speed of RGBA8, and 10:10:10:2 formats don't offer much in dynamic range before banding. One obvious application of this is the standard forward multipass lighting.
The trick is to output to a single FP16 RG or INT16 RG (depending on hardware support) dual channel framebuffer (32-bit per pixel), but every other scan line swap the red and blue channels before writing to the frame buffer. This swap can be done conditional using VPOS in the pixel shader. Output will then be logically like this for a 4x4 pixel block.
RG RG RG RG
BG BG BG BG
RG RG RG RG
BG BG BG BG
The result is one missing color component per pixel. Note green has the higher visual importance in terms of luminance, so green is kept for each pixel. The result is about a 1/2 area reduction in chrominance, but effectively full resolution luminance.
At some point later in the pipeline apply a demosaicing algorithm of your choice to compute the missing color component. Often the conversion point will be prior or during post processing, likely to an 8-bit RGBM format for fast filterable sampling.
Variation For Alpha Blending
If alpha is needed (for example doing particle blending), then a different color channel decimation will be required such as this one.
GA RA GA RA
BA GA BA GA
GA RA GA RA
BA GA BA GA
This is now much closer to the Bayer grid seen on common digital cameras. Remember to keep 2 greens per one blue and one red. The result here is 1/4 area chrominance.
Not For Older Hardware
Unfortunately this trick requires a render target format (16-bit dual channel) not available on a bunch of older hardware (like Geforce 7 series, PS3, etc). Will work on 360 however using an 16-bit integer format.
The trick is to output to a single FP16 RG or INT16 RG (depending on hardware support) dual channel framebuffer (32-bit per pixel), but every other scan line swap the red and blue channels before writing to the frame buffer. This swap can be done conditional using VPOS in the pixel shader. Output will then be logically like this for a 4x4 pixel block.
RG RG RG RG
BG BG BG BG
RG RG RG RG
BG BG BG BG
The result is one missing color component per pixel. Note green has the higher visual importance in terms of luminance, so green is kept for each pixel. The result is about a 1/2 area reduction in chrominance, but effectively full resolution luminance.
At some point later in the pipeline apply a demosaicing algorithm of your choice to compute the missing color component. Often the conversion point will be prior or during post processing, likely to an 8-bit RGBM format for fast filterable sampling.
Variation For Alpha Blending
If alpha is needed (for example doing particle blending), then a different color channel decimation will be required such as this one.
GA RA GA RA
BA GA BA GA
GA RA GA RA
BA GA BA GA
This is now much closer to the Bayer grid seen on common digital cameras. Remember to keep 2 greens per one blue and one red. The result here is 1/4 area chrominance.
Not For Older Hardware
Unfortunately this trick requires a render target format (16-bit dual channel) not available on a bunch of older hardware (like Geforce 7 series, PS3, etc). Will work on 360 however using an 16-bit integer format.
Subscribe to:
Posts (Atom)