Been busy... too much interesting stuff, too little time.
Logarithmic Z Buffer
Journal of Lethargic Programmers Post thanks to tweet from Jerome at Q-Games (@blackjero). Provides large Z-buffer range in 24-bits at the expense of destroying correct interpolation, might be something to revisit with DX11 tessellation (tessellate to the point where interpolation artifacts are not a problem).
Assembly 2009 Seminars
Direct link to all. Watched the Technology Behind Love presentation. Interesting tech, smart use of OpenGL buffer objects, etc, but I'm not wild about the min filter as a post process (looks better with those thin fine details).
Carmack Keynote
Watch vid here. Refreshing to hear someone just speak their mind and not care, even if I don't agree with all of it. Not knowing people on the bottom floor, was priceless! Seems like they are taking a big hit with a mess of single thread OO game code forcing adoption of full frame of latency on all parallel job results. Sounds awfully painful to me.
Best way forward IMO is to go as wide (parallel) as possible with everything including game code! Tight pipeline of independent jobs ordered by common sets of a small number of dependencies, enables low enough latency to make use of parallel results in-frame. If one is doing this right, jobs won't look like this (example with 5 cores),
AAA.BBBCDD.FFFF.
AAA.BB.CDE.FFFF.
AA..BB.CDEEFFF..
AA..B..CDE.FFFF.
GGGGGGGGGGG.HIIJ
Note the bubbles in the pipeline, and the clear sync points between groups of jobs, and one large serial operation (G) which isn't latency friendly. Instead the ideal result would look something like this,
AABCDDEEFIJJ
AAACDDEEFIJJ
ABBCCDEEGIIJ
AABBCDDEEEIJ
ABBCCDEEHIIJ
Note how enough independent work is interleaved to handle run-time variation. For example job B can start before A is finished (D might be the first job depending on results from A). Also note the task parallel G and H, as well as how jobs go as wide as possible to keep the latency down! Dead simple, dead fast!
I know, can be tough (and impossible) for some to do given the constrains of legacy code base. Plus integration of 3rd party middleware can be a real problem (and with regards to this, clearly I'm not a fan of SPURS). But sooner or later, everything will need to work this way to scale on future hardware.
Light Propagation Volumes in CryEngine 3
Grab the presentation off the Crytek Technology Page. Really appreciate the PS3 and 360 run-time numbers. Impressive in the trade-offs made for real-time performance. Locking the volume to world grid points (CCSM style) is a great idea. This hierarchical irradiance volume method might very well just be the beginning of all sorts of interesting approximations ... just wait until someone revisits this with DX11!
GDC Europe: Cevat on The Future of Gaming Graphics
Sweeney HPG Keynote
Slides here. Like functional programming, but NOT into either software transactional memory and cache coherency. Also the cost slide is way misleading: STMD shader programming model is one of the most productive and easy programming models ever, traditional C/C++ and vector intrinsics is many times more costly given a programmer who actually know how to solve problems using both methods! Lets not forget that the STMD shader/kernel model has been critical for the rapid evolution of the GPU: everything scales, everything is portable across quite different implementations!
Giga Voxels
Siggraph 2009 Slides. It was nice to briefly talk to Cyril at Siggraph. Cyril keep up the great work!
20090819
20090813
Siggraph 2009: Thursday - Frostbite
Beyond Programmable Shading I Talk,
Parallel Graphics in Frostbite - Current & Future by Johan Andersson
The deferred lighting part of the presentation provides a glimpse into the types of things possible with DX11. Running the light intersection step in the same kernel as the lighting is an excellent example of task parallel computation in a data parallel kernel. Note this has the advantage of keeping intermediate data on-chip!
Optimization
In the presentation, the depth min/max computation and the parallel lighting tests, are good examples of the types of problems one will need to solve using group share memory (or shared memory for those in CUDA). At some point one will want to optimize the interactions between threads communicating through this shared memory.
Below I've included some simple CUDA test results on a GTX 275 which provides an example on how to go about optimizing shared memory interaction between threads.
Note, I don't have details on DX11 hardware, so the logic below may or may not carry over to actual DX11 hardware!
MIN/MAX REDUCTION TESTS
Numbers below represent overall run time in GPU clock cycles as reported by the CUDA clock() function (took the minimum of a few tests since run-time even on the GPU varies), so lower is better. The kernel returns the GPU clock() at the end point in the kernel, so I use a CPU side min/max reduction to get a rough idea of GPU time. The results are good for general trends, but one would have to test each in a physical application to really know (and to fully verify actual results, which I only did roughly via eyeballing a few blocks).
32x8 block size (256 threads, width matches CUDA warp length).
32x32 block grid (1024x256 threads).
CLOCKS .... TEST
----------------
88286 ..... no min/max reduction, but syncthreads (overhead of test)
113880 .... non-atomic log-time horz only
174712 .... non-atomic log-time horz, with atomic min/max vertical
338372 .... non-atomic linear-time horz only
397648 .... non-atomic linear-time horz, with atomic min/max vert
408048 .... non-atomic linear-time horz, with non-atomic linear-time vert
1028708 ... full colliding atomic min/max
I'm going to stress this again: it is important to note here that run time does vary (the difference between atomic and non-atomic linear-time vertical reductions should be logically the same), and the test reporting itself can effect run times (recording GPU clock cycles requires writing results back to memory, uses registers, as well as has some ALU overhead, for example a 64-bit to 32-bit conversion if I remember right). The only thing gathered from these numbers should be general trends, programmers will have to test in an actual application to see side effects of varring shared memory requirements and register usage!
The "full colliding atomic min/max" represents the baseline case in the presentation. It was neat to see even this initial "not production tested or optimized" (see page 21) deferred lighting demo running with 1000 very large intersecting lights running quite a bit faster than the needs for a real-time game...
With the tests, I did not try all the possibilities (haven't had the time), for example doing the vertical stages first, and also I did not try a log-time vertical reduction (because it would have required lots of syncthreads).
TERMS AND RESULTS
horizontal - means across threads in a warp
vertical - means across warps
Trends verify that colliding shared memory atomics are indeed serialized on NVidia's GT200 chipset (it is only logical that they would be). Best performance I found was a non-atomic log-time parallel reduction horizontally (5 steps for a 32-wide block, each step writes results back to shared memory for use in the next step), followed by a barrier and then a vertical reduction. Knowing the hardware, it should be possible to do the vertical reduction steps without atomic operations, and without group share scatter. Without knowing the DX11 in detail, I can only guess that this path would work in CS4 and also be the fast path in on CS5 hardware.
Parallel Graphics in Frostbite - Current & Future by Johan Andersson
The deferred lighting part of the presentation provides a glimpse into the types of things possible with DX11. Running the light intersection step in the same kernel as the lighting is an excellent example of task parallel computation in a data parallel kernel. Note this has the advantage of keeping intermediate data on-chip!
Optimization
In the presentation, the depth min/max computation and the parallel lighting tests, are good examples of the types of problems one will need to solve using group share memory (or shared memory for those in CUDA). At some point one will want to optimize the interactions between threads communicating through this shared memory.
Below I've included some simple CUDA test results on a GTX 275 which provides an example on how to go about optimizing shared memory interaction between threads.
Note, I don't have details on DX11 hardware, so the logic below may or may not carry over to actual DX11 hardware!
MIN/MAX REDUCTION TESTS
Numbers below represent overall run time in GPU clock cycles as reported by the CUDA clock() function (took the minimum of a few tests since run-time even on the GPU varies), so lower is better. The kernel returns the GPU clock() at the end point in the kernel, so I use a CPU side min/max reduction to get a rough idea of GPU time. The results are good for general trends, but one would have to test each in a physical application to really know (and to fully verify actual results, which I only did roughly via eyeballing a few blocks).
32x8 block size (256 threads, width matches CUDA warp length).
32x32 block grid (1024x256 threads).
CLOCKS .... TEST
----------------
88286 ..... no min/max reduction, but syncthreads (overhead of test)
113880 .... non-atomic log-time horz only
174712 .... non-atomic log-time horz, with atomic min/max vertical
338372 .... non-atomic linear-time horz only
397648 .... non-atomic linear-time horz, with atomic min/max vert
408048 .... non-atomic linear-time horz, with non-atomic linear-time vert
1028708 ... full colliding atomic min/max
I'm going to stress this again: it is important to note here that run time does vary (the difference between atomic and non-atomic linear-time vertical reductions should be logically the same), and the test reporting itself can effect run times (recording GPU clock cycles requires writing results back to memory, uses registers, as well as has some ALU overhead, for example a 64-bit to 32-bit conversion if I remember right). The only thing gathered from these numbers should be general trends, programmers will have to test in an actual application to see side effects of varring shared memory requirements and register usage!
The "full colliding atomic min/max" represents the baseline case in the presentation. It was neat to see even this initial "not production tested or optimized" (see page 21) deferred lighting demo running with 1000 very large intersecting lights running quite a bit faster than the needs for a real-time game...
With the tests, I did not try all the possibilities (haven't had the time), for example doing the vertical stages first, and also I did not try a log-time vertical reduction (because it would have required lots of syncthreads).
TERMS AND RESULTS
horizontal - means across threads in a warp
vertical - means across warps
Trends verify that colliding shared memory atomics are indeed serialized on NVidia's GT200 chipset (it is only logical that they would be). Best performance I found was a non-atomic log-time parallel reduction horizontally (5 steps for a 32-wide block, each step writes results back to shared memory for use in the next step), followed by a barrier and then a vertical reduction. Knowing the hardware, it should be possible to do the vertical reduction steps without atomic operations, and without group share scatter. Without knowing the DX11 in detail, I can only guess that this path would work in CS4 and also be the fast path in on CS5 hardware.
20090812
Stochastic Visibility in Distorted Fisheye
Talked about this before, "Screenshot would have been included if computer was alive". Finally got around to doing a few screen grabs.
Quick refresher, the engine is a modified version of the one in this video. The screens below do NOT show a rendered view, instead they show a direct rendering of the {x,y,z} coordinates (modified by a transform to color them) of the nodes themselves. Yes the scene tree is physically stored in the distorted fisheye projection, parent nodes in the hierarchy show up as the salt and pepper noise in the screen grabs.
The persistent scene is 1024x1024 or 1M nodes (minus the edges of the fisheye projection) of which their 8M children are processed each frame. I bumped up the number of roots in the scene tree to 64K (from the 4K I mentioned in the previous blog post).
It was awesome to see the stochastic visibility algorithm work on all sorts of horrid cases like insane amounts of object to object overlap (overlap being both from rooted objects overlapping, as well as the root's tree nodes self overlapping).
I've been distracted with the CUDA version, so I haven't had time to try the collision improvement. I am keeping this alive however, just in case I get surprised by the triangle rate on DX11 hardware...
View of 64K nodes from outside the structure,

BTW, background color is a function of eye position,

Inside,

View of nodes self intersecting from outside,

And from inside the structure (note the corners of the distorted fisheye are not part of the image, this is a 360 degree view),
Other shots also with different fractal rules (different amounts of overlap, etc),

Quick refresher, the engine is a modified version of the one in this video. The screens below do NOT show a rendered view, instead they show a direct rendering of the {x,y,z} coordinates (modified by a transform to color them) of the nodes themselves. Yes the scene tree is physically stored in the distorted fisheye projection, parent nodes in the hierarchy show up as the salt and pepper noise in the screen grabs.
The persistent scene is 1024x1024 or 1M nodes (minus the edges of the fisheye projection) of which their 8M children are processed each frame. I bumped up the number of roots in the scene tree to 64K (from the 4K I mentioned in the previous blog post).
It was awesome to see the stochastic visibility algorithm work on all sorts of horrid cases like insane amounts of object to object overlap (overlap being both from rooted objects overlapping, as well as the root's tree nodes self overlapping).
I've been distracted with the CUDA version, so I haven't had time to try the collision improvement. I am keeping this alive however, just in case I get surprised by the triangle rate on DX11 hardware...
View of 64K nodes from outside the structure,

BTW, background color is a function of eye position,

Inside,

View of nodes self intersecting from outside,

And from inside the structure (note the corners of the distorted fisheye are not part of the image, this is a 360 degree view),
Other shots also with different fractal rules (different amounts of overlap, etc),

GPU Address Space Mapping II
This is somewhat a continuation of the topics from this prior post.
Decided to look at the issues of GPU memory from a different perspective, from the perspective of the packets of information which need to be routed around on chip. I'm using the CUDA Compute 1.3 model as a rough guide, and building an oversimplified hypothetical GPU model which probably is different from the actual hardware.
The aim of this post is to cover some of the reasons GPUs have a different architecture than CPUs, and to provide a natural intuition on how to go about solving problems on the GPU.
Hypothetical GPU Model
The GPU modeled in this post has a bunch of independent ALU units connected to a bunch of independent MC (memory controllers) via a very fast on-chip interconnect network. If there are N MCs then in this example, global memory is banked N-way, so that memory requests can be load-balanced between the MCs. The individual bank width is large (perhaps 32 bytes) to amortize the bits used for address per read/write transaction (for efficient on-chip routing and GDDR access).
INTERCONNECT PACKETS FOR GLOBAL MEMORY ACCESS
Since Compute 1.3 supports 32B (B=byte) segments, the hypothetical hardware model routes 32B data segments to and from the MCs. Each routed data packet also includes extra bits such as segment address, type of transaction (read, write, or atomic operation), mask bits (to mask words of the operation), packet routing header, etc.
Note I'm working under the assumption that global atomic ALU ops are done at the MC (it might not be, or it might be done at the ROP in real hardware).
Using this model, larger segment transactions (and atomics which have 2 operands, or work on 64-bit values), require multiple packets to service a single request. Below is a rough idea of packets,
transaction ........ {send} -- {receive}
-------------------- -------------------
write .............. {extra bits, 32B segment} -- {}
read ............... {extra bits} -- {return extra bits, 32B segment}
atomic ............. {extra bits, 32B atomic operand(s)} -- {}
atomic w/ return ... {extra bits, 32B atomic operand(s)} -- {extra, 32B old values}
FINE VS COARSE ROUTING
Under the simple model above it becomes easy to see why fine granularity scatter/gather is not efficient.
1. Want to stick to large packets to amortize data payload per extra bits. Having an address per word of data would likely cut the amount of data in a fixed size packet in half.
2. Even if the packet did include an address per data word, the packet itself is routed to one MC. Given memory is banked to MCs, scatter/gather to random addresses (even if this random set of addresses contains good data locality) would likely need to go to different MCs, again cutting the amount of data per packet.
Given a wide interconnect bus (large packets), an attempt to solve (2) would require coalescing buffers per MC on each ALU unit (buffer space to collect enough requests per MC to fill a large packet). As the number of MCs increase the coalescing buffer space per MC would likely decrease (unfriendly scaling for a set buffer size). Fixed function buffer space would likely trade area from ALU, local store or register area.
Note here that if packets switch from being routed to/from MCs to ALUs (which might be the case if the GPU had hardware support for queues for a programmable pipeline), the coalescing buffer problem remains. Attempting to work around this problem might require either a smaller packet size, or some kind of hierarchical network (more on this below in GPU Problem Solving).
TEXTURES, FRAMEBUFFERS, AND FIXED FUNCTION
Texture units provide fine granularity gather. Continuing the above model, the compromise of having a huge amount of addressing/extra bits per result payload, is partly mitigated by only locally routing requests to a single cluster shared texture unit per ALU. However the cost of the texture units in area is relatively large.
Framebuffer ROP units provide fine granularity scatter. The compromise of high address/extra to payload data ratio is partly mitigated by only routing requests to a single ROP for a full tile of data computed at one ALU unit. The cost of ROP units in area is also relatively large!
See the common pattern, fine granularity gather/scatter, but only efficient for tight data locality. Both solve the network problem by routing to/from one location. For the ROPs, having an address mapping which maps 2D tiles to ROPs key.
ROPs in conjunction with fixed function raster hardware, also provide a combined atomic operation (min or max) conditional scatter of a packet of data (think in terms of multiple render targets with Z and stencil). This is a tremendously powerful construct which is very hard to emulate efficiently in compute.
TANGENT ON DX11
A good amount of chip area is used for fixed function, might as well use both fixed function and compute to maximize what can be done with the GPUs of the next generation. DX11 API offers some interesting possibilities depending on what decisions IVHs made with the hardware (assuming I'm understanding DX11 correctly)...
Two possibilities include,
1. Higher fixed function triangle setup throughput. The fast path with the low to mid-end cards of the DX10 generation for data scatter was to do point scatter with all the computation done in the vertex shader and use the fixed function hardware to scatter via the ROP units. Towards higher end cards, triangle rates did NOT scale enough to make this the fast path.
2. RT UAVs. Render Target Unordered Access Views could be an interesting construct to play with, if this path accesses memory via a tiled mapping (possibly backed by a little cache). Meaning in theory RT UAVs might support data local scatter/gather better than via global memory access. Better could mean either lower bandwidth or lower ALU work (which would be required to simulate scatter via compute methods).
Going to have to wait for DX11 cards to go public to know for sure.
Essence of the GPU
MAKING THE COMMON CASE FAST
For a program to scale, the majority of work must be independently parallel.
1. The CPU does not provide an avenue to solve problems which require high parallel throughput.
2. However the GPU provides both parallel throughput and the option to trade parallel throughput for emulation of non-parallel constructs.
With high parallel throughput, other less used constructs (less used because they would not scale, for example memory coherency, fine granularity random access, etc) can be emulated in software on parallel hardware. This also provides the programmer ability to use extra program specific knowledge to tailor a better (than general purpose CPU) solution to the problem.
THE LIE THAT GPU PROGRAMMING IS INHERENTLY HARD
Had the situation been reversed, and a majority of programmers were programming on highly parallel machines, given a serial machine, the common programmer would have a really hard time solving problems!
Intuition on GPU Problem Solving
BUILD A SCALABLE NETWORK, AND YOU HAVE THE SOLUTION TO YOUR PROBLEM!
Best technique I've found is to describe the problem in terms of a network (or graph) of transforms on the data. Common problems, such as the following, all have a similar solution.
1. Data transfer between nodes in the network is too fine grained.
2. Need to scatter to too many nodes, or gather from too many nodes.
3. The node requires too much local store space.
The solution is to build some kind of hierarchical network and divide the problem into smaller parts. This type of problem solving is the parallel version of the divide and conquer method common to solving classic serial CPU (problems like sorting).
Parallel machines force you to factor your problem. How the problem is factored (or how the cost of computation or communication is amortized) can be seen directly in the network (or graph) that is the solution to the problem!
With serial CPU programming, for the most part, programmers do not even think about the cost of gathering data to process because of the giant fixed function CPU and memory hierarchy. With a parallel machine, the cost to gather data to process is the cost to route that data through the network. With current GPUs, the data transferred between distant nodes in the network (or graph) gets transferred to and from global memory. Near nodes (threads) communicate through a local store (small shared memory).
LINK TO BRAIDED PARALLELISM
Given the ability for the hierarchical elements of the network to scale in width (amount of parallelism) and height (of a reduction tree for example), then the solution can scale to either problem size or machine width.
Two core options for the network/graph (can mix/match these),
1. Fixed routing, jobs are draw calls (actual draws or compute invocations). For a fixed graph, run time can be bound to a relatively fixed cost (easy to meet frame rate requirements for example).
2. Braided parallelism via variable routing, uber-kernel with "locked" threads. Enables the network/graph to adapt at run-time to the varying needs of the program.
In Closing
HYPOTHETICAL GPU MODEL VS REAL HARDWARE
Quick CUDA tests on a GTX 275 of a simple non-atomic global read-increment-write kernel, with varying amounts of possible segment bank collisions show quite different run times. Performance when all blocks access the same segment is many times worse than when blocks access different memory segments. This result is predicted well by the hypothetical GPU model described at the start of this post.
Note actual global memory performance on the GTX 275 is quite complex. For example, when running more than one block per multiprocessor, in at least one test it appeared that it is more important that the blocks access different segment banks, than if the individual warps in a given block access different banks (suggests scheduling effects).
Decided to look at the issues of GPU memory from a different perspective, from the perspective of the packets of information which need to be routed around on chip. I'm using the CUDA Compute 1.3 model as a rough guide, and building an oversimplified hypothetical GPU model which probably is different from the actual hardware.
The aim of this post is to cover some of the reasons GPUs have a different architecture than CPUs, and to provide a natural intuition on how to go about solving problems on the GPU.
Hypothetical GPU Model
The GPU modeled in this post has a bunch of independent ALU units connected to a bunch of independent MC (memory controllers) via a very fast on-chip interconnect network. If there are N MCs then in this example, global memory is banked N-way, so that memory requests can be load-balanced between the MCs. The individual bank width is large (perhaps 32 bytes) to amortize the bits used for address per read/write transaction (for efficient on-chip routing and GDDR access).
INTERCONNECT PACKETS FOR GLOBAL MEMORY ACCESS
Since Compute 1.3 supports 32B (B=byte) segments, the hypothetical hardware model routes 32B data segments to and from the MCs. Each routed data packet also includes extra bits such as segment address, type of transaction (read, write, or atomic operation), mask bits (to mask words of the operation), packet routing header, etc.
Note I'm working under the assumption that global atomic ALU ops are done at the MC (it might not be, or it might be done at the ROP in real hardware).
Using this model, larger segment transactions (and atomics which have 2 operands, or work on 64-bit values), require multiple packets to service a single request. Below is a rough idea of packets,
transaction ........ {send} -- {receive}
-------------------- -------------------
write .............. {extra bits, 32B segment} -- {}
read ............... {extra bits} -- {return extra bits, 32B segment}
atomic ............. {extra bits, 32B atomic operand(s)} -- {}
atomic w/ return ... {extra bits, 32B atomic operand(s)} -- {extra, 32B old values}
FINE VS COARSE ROUTING
Under the simple model above it becomes easy to see why fine granularity scatter/gather is not efficient.
1. Want to stick to large packets to amortize data payload per extra bits. Having an address per word of data would likely cut the amount of data in a fixed size packet in half.
2. Even if the packet did include an address per data word, the packet itself is routed to one MC. Given memory is banked to MCs, scatter/gather to random addresses (even if this random set of addresses contains good data locality) would likely need to go to different MCs, again cutting the amount of data per packet.
Given a wide interconnect bus (large packets), an attempt to solve (2) would require coalescing buffers per MC on each ALU unit (buffer space to collect enough requests per MC to fill a large packet). As the number of MCs increase the coalescing buffer space per MC would likely decrease (unfriendly scaling for a set buffer size). Fixed function buffer space would likely trade area from ALU, local store or register area.
Note here that if packets switch from being routed to/from MCs to ALUs (which might be the case if the GPU had hardware support for queues for a programmable pipeline), the coalescing buffer problem remains. Attempting to work around this problem might require either a smaller packet size, or some kind of hierarchical network (more on this below in GPU Problem Solving).
TEXTURES, FRAMEBUFFERS, AND FIXED FUNCTION
Texture units provide fine granularity gather. Continuing the above model, the compromise of having a huge amount of addressing/extra bits per result payload, is partly mitigated by only locally routing requests to a single cluster shared texture unit per ALU. However the cost of the texture units in area is relatively large.
Framebuffer ROP units provide fine granularity scatter. The compromise of high address/extra to payload data ratio is partly mitigated by only routing requests to a single ROP for a full tile of data computed at one ALU unit. The cost of ROP units in area is also relatively large!
See the common pattern, fine granularity gather/scatter, but only efficient for tight data locality. Both solve the network problem by routing to/from one location. For the ROPs, having an address mapping which maps 2D tiles to ROPs key.
ROPs in conjunction with fixed function raster hardware, also provide a combined atomic operation (min or max) conditional scatter of a packet of data (think in terms of multiple render targets with Z and stencil). This is a tremendously powerful construct which is very hard to emulate efficiently in compute.
TANGENT ON DX11
A good amount of chip area is used for fixed function, might as well use both fixed function and compute to maximize what can be done with the GPUs of the next generation. DX11 API offers some interesting possibilities depending on what decisions IVHs made with the hardware (assuming I'm understanding DX11 correctly)...
Two possibilities include,
1. Higher fixed function triangle setup throughput. The fast path with the low to mid-end cards of the DX10 generation for data scatter was to do point scatter with all the computation done in the vertex shader and use the fixed function hardware to scatter via the ROP units. Towards higher end cards, triangle rates did NOT scale enough to make this the fast path.
2. RT UAVs. Render Target Unordered Access Views could be an interesting construct to play with, if this path accesses memory via a tiled mapping (possibly backed by a little cache). Meaning in theory RT UAVs might support data local scatter/gather better than via global memory access. Better could mean either lower bandwidth or lower ALU work (which would be required to simulate scatter via compute methods).
Going to have to wait for DX11 cards to go public to know for sure.
Essence of the GPU
MAKING THE COMMON CASE FAST
For a program to scale, the majority of work must be independently parallel.
1. The CPU does not provide an avenue to solve problems which require high parallel throughput.
2. However the GPU provides both parallel throughput and the option to trade parallel throughput for emulation of non-parallel constructs.
With high parallel throughput, other less used constructs (less used because they would not scale, for example memory coherency, fine granularity random access, etc) can be emulated in software on parallel hardware. This also provides the programmer ability to use extra program specific knowledge to tailor a better (than general purpose CPU) solution to the problem.
THE LIE THAT GPU PROGRAMMING IS INHERENTLY HARD
Had the situation been reversed, and a majority of programmers were programming on highly parallel machines, given a serial machine, the common programmer would have a really hard time solving problems!
Intuition on GPU Problem Solving
BUILD A SCALABLE NETWORK, AND YOU HAVE THE SOLUTION TO YOUR PROBLEM!
Best technique I've found is to describe the problem in terms of a network (or graph) of transforms on the data. Common problems, such as the following, all have a similar solution.
1. Data transfer between nodes in the network is too fine grained.
2. Need to scatter to too many nodes, or gather from too many nodes.
3. The node requires too much local store space.
The solution is to build some kind of hierarchical network and divide the problem into smaller parts. This type of problem solving is the parallel version of the divide and conquer method common to solving classic serial CPU (problems like sorting).
Parallel machines force you to factor your problem. How the problem is factored (or how the cost of computation or communication is amortized) can be seen directly in the network (or graph) that is the solution to the problem!
With serial CPU programming, for the most part, programmers do not even think about the cost of gathering data to process because of the giant fixed function CPU and memory hierarchy. With a parallel machine, the cost to gather data to process is the cost to route that data through the network. With current GPUs, the data transferred between distant nodes in the network (or graph) gets transferred to and from global memory. Near nodes (threads) communicate through a local store (small shared memory).
LINK TO BRAIDED PARALLELISM
Given the ability for the hierarchical elements of the network to scale in width (amount of parallelism) and height (of a reduction tree for example), then the solution can scale to either problem size or machine width.
Two core options for the network/graph (can mix/match these),
1. Fixed routing, jobs are draw calls (actual draws or compute invocations). For a fixed graph, run time can be bound to a relatively fixed cost (easy to meet frame rate requirements for example).
2. Braided parallelism via variable routing, uber-kernel with "locked" threads. Enables the network/graph to adapt at run-time to the varying needs of the program.
In Closing
HYPOTHETICAL GPU MODEL VS REAL HARDWARE
Quick CUDA tests on a GTX 275 of a simple non-atomic global read-increment-write kernel, with varying amounts of possible segment bank collisions show quite different run times. Performance when all blocks access the same segment is many times worse than when blocks access different memory segments. This result is predicted well by the hypothetical GPU model described at the start of this post.
Note actual global memory performance on the GTX 275 is quite complex. For example, when running more than one block per multiprocessor, in at least one test it appeared that it is more important that the blocks access different segment banks, than if the individual warps in a given block access different banks (suggests scheduling effects).
20090811
20090810
Macton’s Posterous
Mike Acton: Recent Sketches on Concurrency ... almost lost this set of awesome presentations in the twitter sphere!
Siggraph 2009: Thursday - State
Continuing Beyond Programmable Shading topics...
GPU vs CPU State
TLB/etc on CPU provide virtual memory via address mapping. GPU state (such as buffers, textures and render targets) serves a similar purpose as the TLB/etc in terms of providing a special address mapping, however GPU state is more useful in that GPU state provides the information which enables an address mapping designed for efficient 2D and 3D data locality.
Not sure if GPUs will someday get virtual memory, I can see great arguments both ways.
CPU provides ISA which enables changing TLB, page tables, and other system state. In contrast GPUs provide a fixed function command buffer interface to change GPU state.
At some point the graphics API (and possibly the GPU) will need the ability to change GPU state without the CPU in order to keep on evolving...
GPU vs CPU State
TLB/etc on CPU provide virtual memory via address mapping. GPU state (such as buffers, textures and render targets) serves a similar purpose as the TLB/etc in terms of providing a special address mapping, however GPU state is more useful in that GPU state provides the information which enables an address mapping designed for efficient 2D and 3D data locality.
Not sure if GPUs will someday get virtual memory, I can see great arguments both ways.
CPU provides ISA which enables changing TLB, page tables, and other system state. In contrast GPUs provide a fixed function command buffer interface to change GPU state.
At some point the graphics API (and possibly the GPU) will need the ability to change GPU state without the CPU in order to keep on evolving...
20090809
Siggraph 2009: Thursday - Braided Parallelism II
Continuing Beyond Programmable Shading topics...
Braided Parallelism II
Along with braided parallelism is the issue of programs with more variable run-times. Variable run-time programs implies the need to run more than one kind of program at the same time on the same core (perhaps through some future MPMD machine, or emulated MPMD via an uber-binary in a SPMD machine).
Core to both these issues is the possible problem of register allocation between running programs. Static allocation of chunks of registers to a program could become inefficient when simultaneously running multiple different programs which have different amounts of register occupancy or running programs which vastly vary in the number of live registers over the course of execution.
Not sure how much of a problem this will be in practice on typical usage cases.
The GPU wants to maximize the number of running programs which fit in the local register and shared-register space (also called group share or shared memory). This is important to maximize the amount of latency hiding capacity.
Larrabee skirts this issue with a smaller register file and a cache hierarchy. The program's stack tends to provide a good description of data locality (top of the stack is more likely to get used). I'm going to assume that register files are less area efficient than a block accessed local memory (because of the crossbar(s), etc). However having a coherent cache adds other complexities such as a TLB (TLB issues covered in another post).
Will be interesting to see how ATI and NVIDIA GPUs evolve with regards to this issue...
Braided Parallelism II
Along with braided parallelism is the issue of programs with more variable run-times. Variable run-time programs implies the need to run more than one kind of program at the same time on the same core (perhaps through some future MPMD machine, or emulated MPMD via an uber-binary in a SPMD machine).
Core to both these issues is the possible problem of register allocation between running programs. Static allocation of chunks of registers to a program could become inefficient when simultaneously running multiple different programs which have different amounts of register occupancy or running programs which vastly vary in the number of live registers over the course of execution.
Not sure how much of a problem this will be in practice on typical usage cases.
The GPU wants to maximize the number of running programs which fit in the local register and shared-register space (also called group share or shared memory). This is important to maximize the amount of latency hiding capacity.
Larrabee skirts this issue with a smaller register file and a cache hierarchy. The program's stack tends to provide a good description of data locality (top of the stack is more likely to get used). I'm going to assume that register files are less area efficient than a block accessed local memory (because of the crossbar(s), etc). However having a coherent cache adds other complexities such as a TLB (TLB issues covered in another post).
Will be interesting to see how ATI and NVIDIA GPUs evolve with regards to this issue...
Siggraph 2009: Thursday - Issue of Coherent Cache
Continuing Beyond Programmable Shading topics...
Coherent Caches
1. Given equal estimated {area, process, power}, where is the performance advantage of switching chip area from ALU to CACHE, and what is the right balance?
2. Given a performance advantage is found in (1), is the advantage from simply having a larger local store, or is cache coherency a critical factor?
3. How do algorithms which make use of cache coherency scale as GPUs continue to grow ever wider and more parallel?
Anyone else notice that a majority the examples Intel has shown on their rendering pipeline treats the cache as a program managed local store?
Coherent Caches
1. Given equal estimated {area, process, power}, where is the performance advantage of switching chip area from ALU to CACHE, and what is the right balance?
2. Given a performance advantage is found in (1), is the advantage from simply having a larger local store, or is cache coherency a critical factor?
3. How do algorithms which make use of cache coherency scale as GPUs continue to grow ever wider and more parallel?
Anyone else notice that a majority the examples Intel has shown on their rendering pipeline treats the cache as a program managed local store?
Siggraph 2009: Thursday - Braided Parallelism
For me Thursday was a full day of Beyond Programmable Shading I and II, slides can be found here. There were a bunch of rather interesting talks and topics, and I'm going to attempt to elaborate on some of the core themes in a bunch of posts starting with this one...
Braided Parallelism
Scalability of a data and task parallel work mix is limited by data dependencies. It is the programmers responsibility to insure there is enough independent work to do between dependencies.
One important difference between task and data parallel work is that task parallel work cannot fill the entire width (all cores across the entire chip) of the machine at one time, and often data parallel work can.
Given this crucial observation, the real meaning of braided parallelism becomes clear: efficient braided parallelism implies that a machine can pipeline and execute (without idle ALU units) jobs which do NOT fill the entire parallel width of the machine.
Task parallel work likely involves data parallel elements, so perhaps a new term "task data parallel" should be used. As a machine gets wider, more and more work will likely fall into the task data parallel category.
A CPU like architecture is NOT needed for braided parallelism, and NVidia's OptiX ray engine is a good example of this. Seems safe to predict that future GPUs (and perhaps DX11 GPUs) will continue to get better at pipelining small jobs, and thus more efficient at braided parallelism.
Will be covering issues regarding the efficiency of braided parallelism on GPUs in later posts...
Braided Parallelism
Scalability of a data and task parallel work mix is limited by data dependencies. It is the programmers responsibility to insure there is enough independent work to do between dependencies.
One important difference between task and data parallel work is that task parallel work cannot fill the entire width (all cores across the entire chip) of the machine at one time, and often data parallel work can.
Given this crucial observation, the real meaning of braided parallelism becomes clear: efficient braided parallelism implies that a machine can pipeline and execute (without idle ALU units) jobs which do NOT fill the entire parallel width of the machine.
Task parallel work likely involves data parallel elements, so perhaps a new term "task data parallel" should be used. As a machine gets wider, more and more work will likely fall into the task data parallel category.
A CPU like architecture is NOT needed for braided parallelism, and NVidia's OptiX ray engine is a good example of this. Seems safe to predict that future GPUs (and perhaps DX11 GPUs) will continue to get better at pipelining small jobs, and thus more efficient at braided parallelism.
Will be covering issues regarding the efficiency of braided parallelism on GPUs in later posts...
Siggraph 2009: Thursday - Id Tech 5
Id Tech 5 Challenges
Highlights of the Beyond Programmable Shading presentation from Id.
Completely off topic, but I really like the sky-box seen in the screen shots. Great atmospheric diffusion, and a contrast range which photographers get through careful use of a polarizing filter to darken the sky in combination with proper exposure control (perhaps with a graduated neutral density filter) to keep the highlights from burning out too much.
VIRTUAL TEXTURING
The screen shots show how they map their UV texture space into the 128K x 128K texels. One can see a mix of world textures and character face textures in the virtual texture pages. Surely there is a relatively high importance in keeping good spacial locality for large multi-tile reads to minimize seek time.
They are not using tri-linear filtering, and instead use bi-linear only. In order to handle the LOD pop issue (caused by optical media seek latency going from a low resolution to high resolution tile), they first up-sample the lower mip, then continuously re-transcode (blend up-sampled tile with decoded tile from higher disc compression, then re-compress to DXT format and update the physical tile on the GPU). Should provide a visually seemless experience, in which detail smoothly blends in.
Over-subscription of the physical texture space is solved via adjusting a global LOD bias slider until the entire working set fits.
JOB SYSTEM
Their job system has job lists (which of which are fully independent) with signal and block on signal constructs to enforce scheduling dependencies. By design, job processing is given 1 frame of latency to complete.
They provided some approximate job costs,
2 ms - animation blending
2 ms - sorting for transparency
4 ms - collision detection
4 ms - obstacle avoidance
4 ms - misc
4 ms - audio
8 ms - virtual texturing
10 ms - rendering
RENDERING
They have sorted alpha blended foliage and particles. They also have a system for detail model generation which items such as rocks and pebbles (perhaps similar to a more advanced shrub system). This is an interesting blend between unique texturing, and instanced stuff to add tiny details.
Not in the presentation, but mentioned after the talk, they ultimately decided not to use Edge or SPU geometry processing. Seems as if the trade off was between higher performance (especially for a vertex bound smaller virtual texture feedback render target) at the cost of needing to keep vertex and index data CPU side for SPU processing, and likely they had free GPU memory (the physical texture for virtual texturing is limited by GPU texture size limits) but were pushing the limit on CPU memory.
GPU JOBS
They mentioned that they are starting to look into running some jobs GPU side, investing in some forward looking usage of the compute capacity of current and new hardware.
Highlights of the Beyond Programmable Shading presentation from Id.
Completely off topic, but I really like the sky-box seen in the screen shots. Great atmospheric diffusion, and a contrast range which photographers get through careful use of a polarizing filter to darken the sky in combination with proper exposure control (perhaps with a graduated neutral density filter) to keep the highlights from burning out too much.
VIRTUAL TEXTURING
The screen shots show how they map their UV texture space into the 128K x 128K texels. One can see a mix of world textures and character face textures in the virtual texture pages. Surely there is a relatively high importance in keeping good spacial locality for large multi-tile reads to minimize seek time.
They are not using tri-linear filtering, and instead use bi-linear only. In order to handle the LOD pop issue (caused by optical media seek latency going from a low resolution to high resolution tile), they first up-sample the lower mip, then continuously re-transcode (blend up-sampled tile with decoded tile from higher disc compression, then re-compress to DXT format and update the physical tile on the GPU). Should provide a visually seemless experience, in which detail smoothly blends in.
Over-subscription of the physical texture space is solved via adjusting a global LOD bias slider until the entire working set fits.
JOB SYSTEM
Their job system has job lists (which of which are fully independent) with signal and block on signal constructs to enforce scheduling dependencies. By design, job processing is given 1 frame of latency to complete.
They provided some approximate job costs,
2 ms - animation blending
2 ms - sorting for transparency
4 ms - collision detection
4 ms - obstacle avoidance
4 ms - misc
4 ms - audio
8 ms - virtual texturing
10 ms - rendering
RENDERING
They have sorted alpha blended foliage and particles. They also have a system for detail model generation which items such as rocks and pebbles (perhaps similar to a more advanced shrub system). This is an interesting blend between unique texturing, and instanced stuff to add tiny details.
Not in the presentation, but mentioned after the talk, they ultimately decided not to use Edge or SPU geometry processing. Seems as if the trade off was between higher performance (especially for a vertex bound smaller virtual texture feedback render target) at the cost of needing to keep vertex and index data CPU side for SPU processing, and likely they had free GPU memory (the physical texture for virtual texturing is limited by GPU texture size limits) but were pushing the limit on CPU memory.
GPU JOBS
They mentioned that they are starting to look into running some jobs GPU side, investing in some forward looking usage of the compute capacity of current and new hardware.
20090805
Siggraph 2009: Wednesday
Alternative Rendering Pipelines in CUDA
Slides Off This Page
The first talk was about Ray tracing and REYES in CUDA. The second on OptiX.
There was one very important overwhelming theme NVidia presented which marks a distinct shift in how to program in CUDA (and get at the fast path on current hardware). The shift is the idea of persistent threads pulling work from queues and the usage of uber-kernels!
This has been hinted at prior to Siggraph in some GRAMPS discussion, and has been detailed in the Understanding the Efficiency of Ray Traversal on GPUs paper. The core problem with hardware block scheduling, is that one single long-running thread stalls the entire block. The solution to this problem is to issue only enough threads to fill the GPU once, and then pull work from queues (use work stealing to balance cores). Stalls can be limited to warp granularity, and warps can run separate "kernels" from one giant uber-kernel. Note is this effectively almost a form of software MPMD on a SPMD machine.
Cost of warp-coherent dynamic branching to different sub-kernels can be either linear or a log cost binary search (since CUDA doesn't currently have branch by register). Also register count for the uber-kernel is going to be the maximum register count used by any sub-kernel, but the advantage in ALU utilization from software scheduling greatly out-weighs the possible less efficient register usage for some sub-kernels!
The talk did very briefly cover the best way to fetch from a queue in CUDA (but not in enough detail IMO, likely because of a lack of time to go there). Anyway I've got a 2nd huge CUDA mega post in progress which covers many parallel programming constructs in detail which can be useful in the new age of alternative and conventional rendering pipelines which use "compute". Covering things like the performance of shared memory atomics (serialization cases, when to avoid them, etc), CS4 vs CS5, and more. This is going to have to wait for next week, but it is going to be a great follow up to details which are either too technical or too time consuming to place in these great Siggraph talks.
OptiX
OptiX is NVidia's new interactive ray tracing engine, designed to provide a generic shader interface which will scale by GPU and scale to future NVidia architectures.
Seems like the core CUDA of OptiX is shader and ray software scheduling via uber-kernel and persistent threads. Where each ray has a shader defined state packet, and shaders can spawn rays.
There are two very important programming constructs likely being used here (arg, missed what I think was a HPG talk on this before Siggraph):
1. CONTINUATION BASED PROGRAMMING. Looks like the shader defined ray state packet gets passed between sub-kernels. Likely ray creation places ray on queue with packet.
2. PERSISTENT THREADS WITH CO-OPERATIVE MULTI-TASKING? Looks like shaders can have local data and spawn rays. So likely at some set points while shader is running (ie after it needs to wait on a ray result), local state is returned to continuation queue, and uber-kernel thread scheduler pulls another warp collection of coherent jobs to run? I need to think this through in detail and perhaps try it before I will know for sure.
Something Awesome Is Brewing
Look at what constructs are core to OptiX, and think about how OptiX is designed for both current and future GPUs! Everyone is in the race towards redefining graphics in terms of a "general purpose parallel machine" and NVidia is showing an elegant, easy to program model which scales forward via OptiX.
A little future hardware acceleration to efficiently support thread continuations (and dynamic grouping for efficient parallel execution), and you've got something truly wonderful.
That might not be it, "it" could be something totally different, but whatever "it" is, "it" is going to be awesome, and I'm overflowing with ideas on how to build things with "it".
Via Siggraph you get a glimpse into the minds of some of those involved in shaping the future of graphics. There is something magical about the insane competition going on between Intel, AMD, and NVidia. Something which I believe is going to move us into the age of fully programmable graphics in the not so distant future!
3D Display
First I don't see 3D as a gimic, in fact I just about only goto the movie theater for 3D movies, everything else is Blue-ray or DVD for me, so I'm very biased. I'd like the technology to take off.
Had a chance to try out NVidia's 3D glasses, and was impressed. The high refresh rate was very smooth. Unfortunately (for me, but great for their target audience) the example they had was an interactive three screen volume raycast of a scanned human body, so actual display update was 10 fps or so. Didn't get a chance to see it under high motion, or high depth range (between ultra foreground and horizon). Apparently you render both frames then do the swap, so one gets 60Hz per eye of physical scan out, with rendered frames at or under 60Hz.
Other vendor had a real 3D display, no glasses required (perhaps something over the display to insure interleaved vertical lines only hit the left or right eye). Another vendor had a 2 monitor solution, one upright, the other table top, and a half angle physical sheet to combine both polarized views.
OpenGL BOF
Missed it, got busy after the GRAMPS talk, did walk in at the end and got rewarded with a great 3.2 API Quick Reference Card.
Didn't realize it at the time, since I don't really know people yet, but likely gave Barthold at NVidia trouble about the divergent DX11 and GL paths (how DX11 mixes compute and graphics, and how since GL and CL are separate, that it is going to be a problem to port apps, and things like tessellation which likely could benefit from compute features, seems like a problem).
What I got in return was some very insightful responses, and good humor even given my very direct comments! Thanks Barthold, or whoever at Khronos/NVidia, for putting up with me!
One Day Down
Finished off Wednesday at Siggraph, and realized I'm a lot more scatter brained in person than I thought I would be when meeting many people I've never talked to before in person. Must be the programmers curse, great at language through code, traded off for bad at language through speech?
Slides Off This Page
The first talk was about Ray tracing and REYES in CUDA. The second on OptiX.
There was one very important overwhelming theme NVidia presented which marks a distinct shift in how to program in CUDA (and get at the fast path on current hardware). The shift is the idea of persistent threads pulling work from queues and the usage of uber-kernels!
This has been hinted at prior to Siggraph in some GRAMPS discussion, and has been detailed in the Understanding the Efficiency of Ray Traversal on GPUs paper. The core problem with hardware block scheduling, is that one single long-running thread stalls the entire block. The solution to this problem is to issue only enough threads to fill the GPU once, and then pull work from queues (use work stealing to balance cores). Stalls can be limited to warp granularity, and warps can run separate "kernels" from one giant uber-kernel. Note is this effectively almost a form of software MPMD on a SPMD machine.
Cost of warp-coherent dynamic branching to different sub-kernels can be either linear or a log cost binary search (since CUDA doesn't currently have branch by register). Also register count for the uber-kernel is going to be the maximum register count used by any sub-kernel, but the advantage in ALU utilization from software scheduling greatly out-weighs the possible less efficient register usage for some sub-kernels!
The talk did very briefly cover the best way to fetch from a queue in CUDA (but not in enough detail IMO, likely because of a lack of time to go there). Anyway I've got a 2nd huge CUDA mega post in progress which covers many parallel programming constructs in detail which can be useful in the new age of alternative and conventional rendering pipelines which use "compute". Covering things like the performance of shared memory atomics (serialization cases, when to avoid them, etc), CS4 vs CS5, and more. This is going to have to wait for next week, but it is going to be a great follow up to details which are either too technical or too time consuming to place in these great Siggraph talks.
OptiX
OptiX is NVidia's new interactive ray tracing engine, designed to provide a generic shader interface which will scale by GPU and scale to future NVidia architectures.
Seems like the core CUDA of OptiX is shader and ray software scheduling via uber-kernel and persistent threads. Where each ray has a shader defined state packet, and shaders can spawn rays.
There are two very important programming constructs likely being used here (arg, missed what I think was a HPG talk on this before Siggraph):
1. CONTINUATION BASED PROGRAMMING. Looks like the shader defined ray state packet gets passed between sub-kernels. Likely ray creation places ray on queue with packet.
2. PERSISTENT THREADS WITH CO-OPERATIVE MULTI-TASKING? Looks like shaders can have local data and spawn rays. So likely at some set points while shader is running (ie after it needs to wait on a ray result), local state is returned to continuation queue, and uber-kernel thread scheduler pulls another warp collection of coherent jobs to run? I need to think this through in detail and perhaps try it before I will know for sure.
Something Awesome Is Brewing
Look at what constructs are core to OptiX, and think about how OptiX is designed for both current and future GPUs! Everyone is in the race towards redefining graphics in terms of a "general purpose parallel machine" and NVidia is showing an elegant, easy to program model which scales forward via OptiX.
A little future hardware acceleration to efficiently support thread continuations (and dynamic grouping for efficient parallel execution), and you've got something truly wonderful.
That might not be it, "it" could be something totally different, but whatever "it" is, "it" is going to be awesome, and I'm overflowing with ideas on how to build things with "it".
Via Siggraph you get a glimpse into the minds of some of those involved in shaping the future of graphics. There is something magical about the insane competition going on between Intel, AMD, and NVidia. Something which I believe is going to move us into the age of fully programmable graphics in the not so distant future!
3D Display
First I don't see 3D as a gimic, in fact I just about only goto the movie theater for 3D movies, everything else is Blue-ray or DVD for me, so I'm very biased. I'd like the technology to take off.
Had a chance to try out NVidia's 3D glasses, and was impressed. The high refresh rate was very smooth. Unfortunately (for me, but great for their target audience) the example they had was an interactive three screen volume raycast of a scanned human body, so actual display update was 10 fps or so. Didn't get a chance to see it under high motion, or high depth range (between ultra foreground and horizon). Apparently you render both frames then do the swap, so one gets 60Hz per eye of physical scan out, with rendered frames at or under 60Hz.
Other vendor had a real 3D display, no glasses required (perhaps something over the display to insure interleaved vertical lines only hit the left or right eye). Another vendor had a 2 monitor solution, one upright, the other table top, and a half angle physical sheet to combine both polarized views.
OpenGL BOF
Missed it, got busy after the GRAMPS talk, did walk in at the end and got rewarded with a great 3.2 API Quick Reference Card.
Didn't realize it at the time, since I don't really know people yet, but likely gave Barthold at NVidia trouble about the divergent DX11 and GL paths (how DX11 mixes compute and graphics, and how since GL and CL are separate, that it is going to be a problem to port apps, and things like tessellation which likely could benefit from compute features, seems like a problem).
What I got in return was some very insightful responses, and good humor even given my very direct comments! Thanks Barthold, or whoever at Khronos/NVidia, for putting up with me!
One Day Down
Finished off Wednesday at Siggraph, and realized I'm a lot more scatter brained in person than I thought I would be when meeting many people I've never talked to before in person. Must be the programmers curse, great at language through code, traded off for bad at language through speech?
20090804
Off to Siggraph
Off to Siggraph 2009 tomorrow, be in around Wednesday at noon.
Any of you run into this ugly dude below and want to talk about graphics, please do!
Any of you run into this ugly dude below and want to talk about graphics, please do!
OpenGL 3.2
Link to GL and GLSL specs.
The Khronos Group continues their steady stride on OpenGL improvements! Really appreciate the hard work, dedication, and the ability to download spec with changes highlighted!
- 64-bit integer support
- geometry shader
- BGRA support
- Draw*BaseVertex(), provides index[i]+base_vertex indexing
- support for binding buffer objects to index targets
- texture fetch from multi-sample textures
- texture fetch from integer buffer texture
- texture fetch from integer 2D rect texture
- seamless cube map filtering
- SampleMaski() for multi-sample coverage mask
- new simple FramebufferTexture() call
- fence and wait on fence support
The Khronos Group continues their steady stride on OpenGL improvements! Really appreciate the hard work, dedication, and the ability to download spec with changes highlighted!
- 64-bit integer support
- geometry shader
- BGRA support
- Draw*BaseVertex(), provides index[i]+base_vertex indexing
- support for binding buffer objects to index targets
- texture fetch from multi-sample textures
- texture fetch from integer buffer texture
- texture fetch from integer 2D rect texture
- seamless cube map filtering
- SampleMaski() for multi-sample coverage mask
- new simple FramebufferTexture() call
- fence and wait on fence support
Aras's Compact Normal Storage
Aras PranckeviÄius : Compact Normal Storage - Impressive method of compressing view space normals into 2 bytes using a sphere-map transform! Page compares with other methods and includes GPU assembly output, timing, and error screen shots.
20090803
EAA Airshow


Had a great time at the annual EAA Airventure in Oshkosh Wisconsin this weekend. Each year pyro junkies get to experience the instant wave of heat and delayed boom of the "wall of flame". However, the best thing at the show each year IMO is the talk given by Burt Rutan, in which you get a glimpse into the mind the most important air and space visionary of our time!
As Seen on Twitter and More
@mike_acton: A quick sketch on why qsort is not a concurrent algorithm. Feedback welcome, as always! link
I'm a big fan of Mike's presentations, the above covering parallel merge sort. I'd also highly suggest, Typical C++ BS, also posted on Twitter (in fact it was after seeing this only on twitter that I decided to get an account).
@bkaradzic: Watching winning 4kb intro from Evoke09... Lunaquatic by BluFlame
Blade Runner Meets CryEngine 2 - Great indirect advertising for Cry Engine 2.
I'm a big fan of Mike's presentations, the above covering parallel merge sort. I'd also highly suggest, Typical C++ BS, also posted on Twitter (in fact it was after seeing this only on twitter that I decided to get an account).
@bkaradzic: Watching winning 4kb intro from Evoke09... Lunaquatic by BluFlame
Blade Runner Meets CryEngine 2 - Great indirect advertising for Cry Engine 2.
20090802
GPU Address Space Mapping
This post addresses some of the issues with texture and 2D/3D array address mappings when mixing compute and fixed function as well as dealing with cache lines and the higher granularity of global memory access of GPUs.
For those developers moving on to compute (CUDA, OpenCL, DX11 CS4/CS5), this is going to be an important thing to have a full understanding of!
I would suggest reading Jake Cannell's Winning my own little battle against Cuda Post for a great example of the problems with mixing fixed function texturing and compute.
Review
CACHE LINES
I'm using the term cache line to refer to both the cached and un-cached case. In the un-cached case, it would be the optimal size of a global memory operation. Cache line size for example, is often 64-bytes for current PCs, nearly sure 64-bytes for Larrabee, and at least for later model NVidia GPUs, global memory segment size is a mix of 32/64/128-bytes depending on coalescing (not sure what actual texture cache line size is).
LINEAR MAPPING
2D array elements (or image pixels) are addressed in linear order. A 64-byte cache line of 32-bit elements provides the following,
0123456789ABCDEF
Often array width is a power of two, and then element address for this 32-bit/element array can be computed using the following equation (3 OPS : 2 SHIFTs, 1 ADD),
address = (x << 2) + (y << constant)
Of which,
constant = 2 + log_base_2(width)
SPACE FILLING CURVE AND/OR TILED MAPPING
Elements are addressed using some kind of space filling curve. This space filling curve in should result in a cache line representing a tile of elements in the array. The tiles themselves can be addressed in linear order (easier for software to do the address translation). Or perhaps the tiles are also addressed in a space filling curve (which might provide benefits in the case of hardware predictive cache line pre-fetching). Or perhaps the entire array is accessed using a consistent space filling curve (no set tiles), in the case where the hardware provides a full bit-interleave instruction (Larrabee provides 2 such instructions).
The space filling curve might be Z-order (or full bit-interleaved), which would be a very expensive transform to do in software without special instructions (following the 16 elements per cache line example),
0145
2367
89CD
ABEF
Or perhaps the tiles are wide Z-order (multi-bit-interleaved), which is much easier to compute in software when the ISA is lacking a bit-interleave instruction,
0123
4567
89AB
CDEF
Still address translation can be expensive (11 OPS : 4 ANDs, 4 SHIFTs, 3 ADDs) even for just linear tile address with multi-bit-interleave for the tiles and power of two sized array width (note only need 3 shifts if elements are one byte, or if one can make use of a ISA which provides a "free" shift in the instruction addressing, like x86),
hiX = x & mask0
hiY = y & mask1
loX = x & mask2
loY = y & mask3
address = hiX << shift0 + hiY << shift1 + loX << shift2 + loY << shift3
The Problems
2D/3D LOCALITY AND JOB DISTRIBUTION
GPUs group pixels and compute jobs into 2D tiles (or blocks) for efficient computation (to maintain good 2D data locality). DX11 provides 3D blocks for CS5. Job or block distribution across cores in the GPU is NDA material, but it is possible to reverse engineer this via careful programming (in some cases). There are all sorts of complexities here, such as a cluster of cores sharing a texture cache.
Linear addressing of 2D data often leads to cache line utilization being low, and cache line thrashing.
COMPUTE ACCESS TO TEXTURES LIMITED COMPARED TO FF/ROP
To my knowledge, texture access in compute is limited to linear addressed textures. This either may or might NOT change for DX11. And if it does, it might not change for all vendors. In theory it should be possible for Larrabee, all depends on what the drivers do!
ROP units provide coalescing, might be in the form of write-combine caching, and/or in perhaps in the form of queuing ROP requests. Either way fixed function render targets are rather impressively fast, even for fine granularity scatter (try on a middle-end NVidia DX10 card to point scatter via drawing points in GL vs global memory scatter in CUDA). Fixed function ROP also provides write to MSAA render targets.
Compute global accesses are coarse granularity accesses, and likely might stay un-cached on some GPUs.
The point here is that compute access to render targets might NOT be as fast as the fixed function rendering path. Clearly Larrabee could be an exception here.
MIXING CS AND PS
If a CS pass builds something for a PS pass, then the PS pass might have to sample from a linear addressed texture. If texture cache access or memory bandwidth becomes a bottleneck, and only IF, there might be a few possible work-arounds.
If the texture is not filtered then one can work in a linear texture and do address translation manually. Even if one doesn't know the actual GPU texture address to texel mapping, one could test different software efficient mappings to see if a good ALU to TEX trade-off can be made.
Often thread groups (or blocks) in compute passes will all be accessing a shared "tile" of data anyway, so tile address mapping in CS might not be as bad if one computes a linear tile address, and then uses constants for the in-tile pixel ordering.
In the filtered case, CS producing data for filtered TEX lookups, the graphics and compute APIs don't currently provide a good solution (unless that solution is hidden behind the driver and hardware).
BYTES PER PIXEL VS BYTES PER CACHE LINE
In the physical cached case, as the number of bytes per pixel approaches bytes per cache line, linear vs tiled address order "should" become less important. For example with 64-byte cache lines,
4... 8... 16 32 --- bytes per pixel
---- ---- -- --
0123 0123 01 01 --- possible software cache line tile mapping
4567 4567 23
89AB
CDEF
A G-buffer or some other fat "logical texture", might have a logical 16-bytes/pixel, but this is divided perhaps across 4 textures with 4-bytes/pixel physically, resulting in the problem case again.
DISCLOSURE OF CACHE DESIGN
In the CPU space, cache details are mostly fully public knowledge. Often the OS even provides a way for software to query cache line details of all the cache levels. Some algorithms actually adapt themselves to cache line size (cache-aware), and there is growing research into cache-oblivious algorithms which naturally work well regardless of cache details.
In the GPU space, cache details (cache as in TEX) are to my knowledge all NDA material, and also likely a bit part of the secret sauce which enables some vendor advantage. GPUs caches likey have various important differences from CPU caches, such as in N-way where N is larger because DX/GL provide access to a much larger number of simultaneous arrays (textures) than would be typical in CPU space, or perhaps different pre-fetching logic (if there is even any automatic pre-fetching at all).
Note a smart programmer can reverse engineer cache design using careful programming, however often those who don't have NDA access also don't have a full set of graphics cards representing the market to test.
In terms of cache as program managed local store, NVidia's CUDA docs provide great documentation on most of the important issues regarding global memory access on NVidia's architecture. Exceptions such as how memory requests are parallalized to the memory controllers can be reversed engineered via software, but would again require having a large number of cards to test (cards have different numbers of MCs).
ATI also likely provides great docs, but I haven't looked into detail on their latest R600/R700 memory architecture yet.
DISCLOSURE OF TEXTURE ADDRESS MAPPING
All NDA material, sometimes possible to reverse engineer. Details such as tile size, space filling curve, etc, all hidden, and likely for good reason. Many programmers would cry themselves to sleep at night attempting to deal with all these details on all GPUs.
Possible Future Solutions To This Mess
The largest problem here is that vendors would all have to agree on a good solution to make it a "good" solution. Vendors opening up their texture address mapping would result in a mess of lots of vendor or card specific code in engine.
Larrabee's 2 bit-interleave instructions and NO ROP design, is an interesting solution to the problem. Yet I don't know anything about Larrabee's texture address mapping to know if this actually is a solution to the problem of CS stages writing to filterable cache friendly textures.
I've got another idea in mind for a general solution, but it requires a change in hardware, which likely is NOT going to be adopted via all vendors unless some central controlling entity like Microsoft decided to force it! So never going to happen!
For those developers moving on to compute (CUDA, OpenCL, DX11 CS4/CS5), this is going to be an important thing to have a full understanding of!
I would suggest reading Jake Cannell's Winning my own little battle against Cuda Post for a great example of the problems with mixing fixed function texturing and compute.
Review
CACHE LINES
I'm using the term cache line to refer to both the cached and un-cached case. In the un-cached case, it would be the optimal size of a global memory operation. Cache line size for example, is often 64-bytes for current PCs, nearly sure 64-bytes for Larrabee, and at least for later model NVidia GPUs, global memory segment size is a mix of 32/64/128-bytes depending on coalescing (not sure what actual texture cache line size is).
LINEAR MAPPING
2D array elements (or image pixels) are addressed in linear order. A 64-byte cache line of 32-bit elements provides the following,
0123456789ABCDEF
Often array width is a power of two, and then element address for this 32-bit/element array can be computed using the following equation (3 OPS : 2 SHIFTs, 1 ADD),
address = (x << 2) + (y << constant)
Of which,
constant = 2 + log_base_2(width)
SPACE FILLING CURVE AND/OR TILED MAPPING
Elements are addressed using some kind of space filling curve. This space filling curve in should result in a cache line representing a tile of elements in the array. The tiles themselves can be addressed in linear order (easier for software to do the address translation). Or perhaps the tiles are also addressed in a space filling curve (which might provide benefits in the case of hardware predictive cache line pre-fetching). Or perhaps the entire array is accessed using a consistent space filling curve (no set tiles), in the case where the hardware provides a full bit-interleave instruction (Larrabee provides 2 such instructions).
The space filling curve might be Z-order (or full bit-interleaved), which would be a very expensive transform to do in software without special instructions (following the 16 elements per cache line example),
0145
2367
89CD
ABEF
Or perhaps the tiles are wide Z-order (multi-bit-interleaved), which is much easier to compute in software when the ISA is lacking a bit-interleave instruction,
0123
4567
89AB
CDEF
Still address translation can be expensive (11 OPS : 4 ANDs, 4 SHIFTs, 3 ADDs) even for just linear tile address with multi-bit-interleave for the tiles and power of two sized array width (note only need 3 shifts if elements are one byte, or if one can make use of a ISA which provides a "free" shift in the instruction addressing, like x86),
hiX = x & mask0
hiY = y & mask1
loX = x & mask2
loY = y & mask3
address = hiX << shift0 + hiY << shift1 + loX << shift2 + loY << shift3
The Problems
2D/3D LOCALITY AND JOB DISTRIBUTION
GPUs group pixels and compute jobs into 2D tiles (or blocks) for efficient computation (to maintain good 2D data locality). DX11 provides 3D blocks for CS5. Job or block distribution across cores in the GPU is NDA material, but it is possible to reverse engineer this via careful programming (in some cases). There are all sorts of complexities here, such as a cluster of cores sharing a texture cache.
Linear addressing of 2D data often leads to cache line utilization being low, and cache line thrashing.
COMPUTE ACCESS TO TEXTURES LIMITED COMPARED TO FF/ROP
To my knowledge, texture access in compute is limited to linear addressed textures. This either may or might NOT change for DX11. And if it does, it might not change for all vendors. In theory it should be possible for Larrabee, all depends on what the drivers do!
ROP units provide coalescing, might be in the form of write-combine caching, and/or in perhaps in the form of queuing ROP requests. Either way fixed function render targets are rather impressively fast, even for fine granularity scatter (try on a middle-end NVidia DX10 card to point scatter via drawing points in GL vs global memory scatter in CUDA). Fixed function ROP also provides write to MSAA render targets.
Compute global accesses are coarse granularity accesses, and likely might stay un-cached on some GPUs.
The point here is that compute access to render targets might NOT be as fast as the fixed function rendering path. Clearly Larrabee could be an exception here.
MIXING CS AND PS
If a CS pass builds something for a PS pass, then the PS pass might have to sample from a linear addressed texture. If texture cache access or memory bandwidth becomes a bottleneck, and only IF, there might be a few possible work-arounds.
If the texture is not filtered then one can work in a linear texture and do address translation manually. Even if one doesn't know the actual GPU texture address to texel mapping, one could test different software efficient mappings to see if a good ALU to TEX trade-off can be made.
Often thread groups (or blocks) in compute passes will all be accessing a shared "tile" of data anyway, so tile address mapping in CS might not be as bad if one computes a linear tile address, and then uses constants for the in-tile pixel ordering.
In the filtered case, CS producing data for filtered TEX lookups, the graphics and compute APIs don't currently provide a good solution (unless that solution is hidden behind the driver and hardware).
BYTES PER PIXEL VS BYTES PER CACHE LINE
In the physical cached case, as the number of bytes per pixel approaches bytes per cache line, linear vs tiled address order "should" become less important. For example with 64-byte cache lines,
4... 8... 16 32 --- bytes per pixel
---- ---- -- --
0123 0123 01 01 --- possible software cache line tile mapping
4567 4567 23
89AB
CDEF
A G-buffer or some other fat "logical texture", might have a logical 16-bytes/pixel, but this is divided perhaps across 4 textures with 4-bytes/pixel physically, resulting in the problem case again.
DISCLOSURE OF CACHE DESIGN
In the CPU space, cache details are mostly fully public knowledge. Often the OS even provides a way for software to query cache line details of all the cache levels. Some algorithms actually adapt themselves to cache line size (cache-aware), and there is growing research into cache-oblivious algorithms which naturally work well regardless of cache details.
In the GPU space, cache details (cache as in TEX) are to my knowledge all NDA material, and also likely a bit part of the secret sauce which enables some vendor advantage. GPUs caches likey have various important differences from CPU caches, such as in N-way where N is larger because DX/GL provide access to a much larger number of simultaneous arrays (textures) than would be typical in CPU space, or perhaps different pre-fetching logic (if there is even any automatic pre-fetching at all).
Note a smart programmer can reverse engineer cache design using careful programming, however often those who don't have NDA access also don't have a full set of graphics cards representing the market to test.
In terms of cache as program managed local store, NVidia's CUDA docs provide great documentation on most of the important issues regarding global memory access on NVidia's architecture. Exceptions such as how memory requests are parallalized to the memory controllers can be reversed engineered via software, but would again require having a large number of cards to test (cards have different numbers of MCs).
ATI also likely provides great docs, but I haven't looked into detail on their latest R600/R700 memory architecture yet.
DISCLOSURE OF TEXTURE ADDRESS MAPPING
All NDA material, sometimes possible to reverse engineer. Details such as tile size, space filling curve, etc, all hidden, and likely for good reason. Many programmers would cry themselves to sleep at night attempting to deal with all these details on all GPUs.
Possible Future Solutions To This Mess
The largest problem here is that vendors would all have to agree on a good solution to make it a "good" solution. Vendors opening up their texture address mapping would result in a mess of lots of vendor or card specific code in engine.
Larrabee's 2 bit-interleave instructions and NO ROP design, is an interesting solution to the problem. Yet I don't know anything about Larrabee's texture address mapping to know if this actually is a solution to the problem of CS stages writing to filterable cache friendly textures.
I've got another idea in mind for a general solution, but it requires a change in hardware, which likely is NOT going to be adopted via all vendors unless some central controlling entity like Microsoft decided to force it! So never going to happen!
Subscribe to:
Posts (Atom)