20090429

PixelJunk 1-4



Definitely going to get PixelJunk 1-4. There is something highly addictive about interactive fluid, no doubt this is going to be a very enjoyable game, just like the other games in the Q-Games PixelJunk series! If they keep this up, I'm going to have to get a 1080p display ... Monsters at home is scary at NTSC, especially after playing a ton of co-op with my brother on a huge 1080p HDTV.

RGBM

Brian Karis has posted an excellent write up on practical usage of RGBM Color Encoding.

NVidia's Bindless Graphics

NVidia's page on Bindless Graphics and the PDF Presentation

Two awesome NV GL extensions which can enable a 7x draw call speed up by removing the GL bind pattern which causes the drivers to be cache miss bound. Until ARB gets around to API fixes that remove this problem for good, it is great that NVidia is supporting a work around!

20090428

Dithering : Dead Space Extraction Wii


Interesting dithering transparency on Dead Space for the Wii. First I thought it was bloom through dithered transparency effect, but seems (from the image above) that dithering is done via billboard sheets. Looks like additive blending. Not sure if it is for the effect or as an optimization (I know nothing about Wii hardware...). See the Shack gallery for more images.

20090427

Hierarchical Processor Array

Being a software guy, I don't look at software patents, because I'd rather not know, but I do make an exception on hardware patents. Decided to start going through GPU hardware patents to get a better understanding on lower levels of GPU hardware and/or what to possibly expect in future hardware. If talking about hardware patents is taboo for you then I'd suggest skipping this blog post...

One such interesting patent is NVidia's Hierarchical Processor Array filed in Nov 2007.

Not sure how much this reflects what is possibly in the GT300 or not, but there are a few things that I gathered from the patent about possible GPU evolution,

1. Parallel Fixed Function Units - This patents refers to optional duplication of all fixed function hardware (setup, raster, etc) for an "increase in the processing power or redundancy of the system".

2. Possible Writable on Chip Memory - CUDA's shared memory is referred to as shared register file in the patent. On chip memory refers to cached read-only uniforms and constants in a previous patent, but this patent adds, "On-chip memory is advantageously used to store data that is expected to be used in multiple threads, such as coefficients of attribute equations, which are usable in pixel shader programs, and/or other program data, such as results produced by executing general purpose computing program instructions". Might not mean writable, the results of GP computing could have been from a past kernel call, however does certainly leave the door open for a writable cache.

3. General Purpose Instruction Issue - Current (or perhaps older) GL|DX to CUDA required an expensive "context switch" to my knowledge. Could be a driver issue, or could be that older hardware wasn't able to interleave execution of programs beyond vert/geo/pixel shaders, I don't know. DX11 has a good number of different shader types in the rendering pipeline, so no doubt the hardware is going to have to be able to efficiently issue instructions from multiple different programs/kernels. One wouldn't want too many running at the same time on a given core (because cache pressure increases), but enough to keep the pipeline going, to avoid bubbles during draw call transitions, and possibly support interleaving a mix of ALU and MEM heavy programs to keep hardware utilization high in all ways. Seems as if this patent supports this.

4. No MIMD - Do NOT see anything direct in the patent about Multiple Instruction Multiple Data execution. Instruction issue is SIMD to PEs, "for any given processor cycle the same instruction issued to all P processing engines [PEs]". Where P is the SIMD width. However this patent adds "instruction unit may issue multiple instructions per processing cycle". Could multiple issue enable PE's to keep higher ALU efficiency when PE's are predicated? The patent says, "a SIMD group may include fewer than P threads, in which case some of the processing engines will be idle during cycles when that SIMD group is being processed", perhaps not.

5. Supergroup SIMD -> Possible Lower Overhead or Dynamic Warp Formation? - From the patent, "SIMD groups containing more than P threads (supergroups) can be defined. A supergroup is defined by associating the group index values of two (or more) of the SIMD groups with each other. When issue logic selects a supergroup, it issues the same instruction twice on two successive cycles". This can better amortize the instruction issue cost over more instructions, and can be used in combination with double clocking the PEs. I see three possible cases here. (a) That this is in the GT200 (and maybe previous generations), 8-wide SIMD doubled clocked to 16-wide half-warp, and hardware supergroup to 32-wide warp. Maybe with limitations that groups need to be sequential in order for the supergroup. (b) Or that the half-warp to 32-wide warp was done with paired instructions on current hardware, and this is just a more efficient hardware path. (c) Or that with future hardware (possibly GT3xx) this is a way to increase SIMD efficiency through dynamic warp formation. The idea being if keeping with 8-wide SIMD and a double clocked ALU, instruction issue could pick a hi/lo pair of 16-wide SIMD half-warps to issue based on common instruction. Clearly other granularities and options are possible.

6. Work Distribution - Patent leaves open many options, "shaders of a certain type may be restricted to executing in certain processing clusters or in certain cores", or "in some embodiments, each core might have its own texture pipeline or might leverage general-purpose functional units to perform texture computations". Data distribution covered in multiple forms. Broadcast, "receive information indicating the availability of processing clusters or individual cores to handle additional threads of various types and select a destination". Ring, "input data is forwarded from one processing cluster to the next until a processing cluster with capacity to process the data accetps the data". Function of input, "processing clusters are selected based on properties of the input data".

Still digesting this one. Patent seems to support a huge number of options, cannot wait to see what we get in the hardware later hopefully later in 2009!

Irony in PS3 vs Larrabee

I find some irony in developers not liking the PS3 but liking LRB.

Larrabee Model per Core

- Vector width of 16.
- Programs operate on a set of vector registers plus x86 scalar registers.
- Efficient branching for programs (not lanes of a vector).
- Hardware fine grain scheduling of 4 programs to one ALU unit.
- Software coarse scheduling to hide additional latency.
- Dual issue vector ALU + vector STORE.
- Most operations have single cycle throughput.
- 4-9 cycle vector latency.
- Guessing 4-9 cycles latency on branch misprediction as well.
- Guessing ~10 cycle on core L2 latency (cannot remember source of that info).
- Guessing ~100 cycle off core L2 latency (very rough guess for 32 cores).
- Guessing ~1000 cycle main memory latency (very rough guess).
- Scatter/gather can only access a single L1 line per clock.
- 8KB register set (32 vectors x 4 hardware hyperthreads).
- 32KB L1 (logical extended register file and shared memory cache).
- 256KB L2 (shared memory)
- 64 byte (512-bit cache line)
- 512-bit each way, ring bus to connect the cores.

Larrabee vs Cell

- Same amount of register space.
- With LRB larger vectors and hyperthreading takes place of loop unrolling.
- Same amount of shared memory space.
- But with LRB you get a cache.
- Both have a ring bus.
- Likely similar amount of latency and latency scaling between memory levels.

Irony in that likely to get excellent performance on both platforms (Larrabee and Cell) you'd have to be using a very similar data flow. Both platforms have advantages over the other. Cell having a smaller more flexible vector size, and ability to get to the metal to insure good pipelining. With LRB you'd be doing something similar, but attempting to find ways of "tricking" (pinning L1 lines, prefetching) the caches and instruction scheduling to do what you want. However, clearly Larrabee has a huge advantage in features supported by the vector ISA. Larrabee also offers the ability for programmers to get better performance out of lazy code (via cache), and often lazy code offers the advantage in development time to expend more resources optimizing the critical parts of the code.

Either LRB or PS3, you know you are going to have to get dirty with vector intrinsics or assembly to get the performance you want out of the code...

LRB Shared L2 vs SPU to SPU DMA

Seems like a lot of PS3 programmers really bypassed the SPU to SPU latency and throughput advantage (an interesting related SPU to SPU communication paper), and instead focused on task level parallelism doing direct SPU to global memory DMAs. Clearly if the algorithm's ALU to MEM ratio is high enough, there is no need to complicate matters with SPU to SPU communication.

Will be interesting to see if with LRB if developers intend to take advantage of throughput and latency advantage of L2 sharing compared to direct global memory access. Or will the complexity of insuring the proper L2 cache control and job scheduling make this a prohibitably complex task for many developers. Attempting adhoc address space sharing between threads is likely not a good option because of cache coherency issues.

Shared L2 | SPU to SPU DMA vs GPUs

Larrabee per Core (grouped by latency of access),

- 8KB register set (32 vectors x 4 hardware hyperthreads).
- 32KB L1 (extended register file + shared memory cache).
----------------
- 256KB L2 (shared memory)
----------------
- 8MB shared L2 (expensive shared memory, assuming 32 cores)

GT200 per Core (grouped by latency of access, to my knowledge),

- 64KB registers
- 16KB shared memory (increasing to at least 32KB with DX11 hardware)
- 8KB constants
----------------
- 24KB texture L1 shared by three cores (I think)
----------------
- 256KB (guessing) or ? shared texture L2 by all cores

All of this is to deal with the 10x or so ALU to MEM ratio (computation vs global memory bandwidth).

The programming model differences get really clear when looking at those numbers. GPU model is more along the lines of offering 2x fast access space (registers + fast memory) and requiring the programmer to get the data reuse in explicit blocks of say 256 or more threads. There effectively isn't a LRB like shared huge L2 or core2core DMA level in the GPU memory hierarchy. In contrast the Larrabee model offers less fast access space, but much more in the memory hierarchy.

BTW, made a lot of assumptions about Larrabee here, if I'm wrong, feel free to correct me in the comments...

20090424

A Favorate Website : Iñigo Quilez's Site

http://iquilezles.org/www/

An awesome website. Really looking forward to the pages "STILL UNDER DEEP RECONSTRUCTION". Was looking over the Budhabrot page and came across something rather interesting, "using the Metropolis integration method to the Budhabrot rendering", then googled and found Alex Steckles Page on the topic.

Seems as if my stochastic visibility is a little similar in form to the Metropolis-Hastings Budhabrot rendering. From Alex's page describing Metropolis Budhabrot rendering, "Pretend we have some function F that has a very difficult to sample probability distribution. We start with some value of x and propose small changes to it. Two tentative samples at F(x) and F(x') are drawn, and the "better" sample is chosen." What I'm doing is similar but with a different kind of fractal. In my case x is a traversal of an l-system, so F(x) yields a new child node (my "orbit" is an l-system traversal). At each step I take an array of x's and scatter all F(x')'s for all next generation children into another array using the z-buffer to choose the "better" sample (or x). Position to draw is a random point in the projected bounding volume of x, projected into an octahedon representing a full 360 degree view. My fitness function (ie the "better") is spherical projected depth minus bounding radius times a constant (z = depth - rad*c). Also have to redraw the root node(s) each frame to insure that the process can be iterated.

Likely there is also a way to use Metropolis-Hastings to solve zoomed-in 3D IFS rendering via the chaos game on the GPU. I still think L-Systems are easier to shape into what you want than an IFS. Landon Gray provides some great examples of what animation of a 3D L-System fractal could look like if one had coarse representation of nodes,



Or this,



Or some stills...

I haven't really ever thought seriously about more traditional, ie like Julia sets, 3D fractals until seeing Iñigo Quilez's page on Orbit Traps and thinking about his Kindernoiser demo,



The beauty of this form of raycasting is that traversal is done without any texture lookups (read ungodly fast, ALU bound and thus scales insanely well on highly parallel machines like GPUs), and a distance estimation can be done to quickly converge onto the surface. If something like a 3D version of Jock Cooper's Art could be done in realtime, that would indeed be quite awesome.

20090423

CUDA Sorting

Designing Efficient Sorting Algorithms for Manycore GPUs by Nadathur Satish, Mark Harris, and Michael Garland

An awesome paper. Some numbers which stick out are 256 thread block size (multiple blocks running in parallel on each SM), and minimum of 5000-10000 threads for peak performance on current hardware ... to do things really fast on GPUs you need to break them down into thousands of parallel pieces. Results on GT280 are 175-200M radix sorted keys/sec for sequences of 1M-100M keys, and 120-150M radix sorted key value pairs/sec for sequences of 1M-10M keys.

Also worth a look, Investigation of Uncoalesced vs Coalesced Read and Write Speeds in CUDA. Tests for 8800GT, wonder how the GT280 would compare.

BTW, I'm still on the lookout for good papers related to doing TREE type data structures in CUDA, if you know of some please share in the comments!

Kerosen 3D Flying / Fractal Terrain on iPhone

Interesting tech for an iPhone game,

20090421

OpenCL Alpha Available

NVIDIA Releases OpenCL Driver To Developers. Thanks rbrune for the comment, OpenCL Early Access Program registered developers have access to an OpenCL alpha.

20090420

Humm, all Current Mac's Have NVidia GPUs?

I'm sure this is old news which I just missed but, no sign of OpenCL yet, was at the Apple Store over the weekend and seems like all current Mac's have NVidia internals by default (for lowend, MCP79 right?).

And from this post by tmurray on the Cuda Forums, "MCP79: Zero-copy here implies two things--zero-copy and copy elimination. MCP79 will use any memory on the host directly, so this is really good for low-latency applications. There's no PCIe traffic or anything like that now, sysmem is used directly because MCP79 is the chipset. It's absolutely ridiculous. The only reason to not use zero-copy on an MCP79 is because of the 32-bit address space limitation, so in reality, you will pretty much always use zero-copy on MCP79. If you are an audio guy, please write something using zero-copy on MCP79--I've really wanted to do this, but I haven't had time. I expect its perf compared to other things in this segment to be mind-blowing."

Sounds like nothing but awesome OpenCL opportunities in future Apple Hardware!

Caustic Graphics

More info has surfaced on the Caustic Graphics Hardware.



According to the article, prototype with 2 FPGAs at 100 MHz each with a single DDR2 SO-DIMM for memory access. First ASIC should be 350 MHz with about 14x the performance of the prototype. The prototype was doing 640x480 with 3.6M triangles in the form of 2 cars, with 4 rays per pixel (?) at 5-6 FPS. Seems like to scale this up to 1280x720x30fps they are going to need 15x the performance.

20090417

Untraceable by TBC

Nice 1K PC demo of what looks to be voxel raycasting into simple fractals. Link on Pouet.

20090415

ASD's Rupture

Pouet link.

4K Intro with Realtime Reflections

Atomic Operations and Scheduling

Atomic Operations

On traditional CPUs (GPUs covered later) atomic operations are not free. For example old x86 machines (pre P6) have bus locking (sounds bad because it is bad). PPC (such as the 360) has a cache line reservation and software retry loop if an atomic store to the reserved line fails (do to loosing the reservation, for example another processor writing to the reserved cache line). Not to mention the issues of multiple cores going into grid lock attempting to share cache lines. In terms of cycle cost, atomic operations on the PPC can be upwards of 200+ cycles.

My personal favorite way to deal with atomic operations (or other locking mechanisms) on the traditional multi-core CPU is to schedule the problem away. And I am NOT referring to placing locks (such as critical sections, or mutexes) which force task switching. I'm talking about factoring all that cost into just insuring that tasks which would have to modify and share data structures don't run at the same time period!

Scheduling

What I use at home (personal projects) for job scheduling is dead simple (it is about 400 lines of C code including profiler drawing). One thread per logical core (or hyperthread) locked down by setting thread affinity. I've got a single list of jobs which I schedule one frame ahead (very important) into per thread job lists. Each thread pulls jobs from its own list and runs them.

Jobs record their runtime for the scheduler to use (free profiling). Jobs are all programmed such that they will use much less than a full frame (30 or 60Hz per frame). If I needed long jobs I'd have them manually check run time and end after a set time limit. Jobs are of two forms, a task and a batch. Task jobs run on only one of the threads, and can optionally be forced to run on one specific thread (say for using one OpenGL context). Batch jobs get scheduled to run on all threads simultaneously and get given a thread index and count of threads on the machine. It is then up to each thread to use the thread index to divide up the work domain. Each job has an associated counter which gets atomically decremented as thread(s) for the job finish.

To handle job dependencies, each job has an index to a previous job's counter. Jobs are not allowed to start until that dependent counter is zero. The trick is to insure that jobs never have to wait by setting up the job order by breaking up the program into a pipeline which has groups of independent jobs working in parallel with no dependencies (no sharing of writable data).

With this in mind, the wait case doesn't happen, and thus is tackled by a smart spinloop. Smart as in first spins until a set limit, then if dependency still isn't zero, it yields to the processor, then repeats until dependency count is zero.

On good operating systems (meaning not Windows because of it's awful scheduler), I have the threads play nice and if there is enough extra time after all jobs are finished for a frame, each thread sleeps a little short of the time left in that frame.

Questions

What about the problem of variable jobs running each frame and variable job runtime? I have a per job flag which marks if the job needs to run, and all possible jobs are permanently in the job list. Variable job runtime can be tackled by splitting the job into multiple parts,
and turning on and off jobs, or jobs manually ending after a set amount of time.

What about blocking IO? Separate threads which can block and lock free message passing.

Current and Near Future of Parallel Programming

Undoubtly this post is going to end in something about why I think CUDA is awesome. I know some people are scared off by the rigid programming model, but IMO a similar programming model is what you should have been using on multicore CPUs from the beginning! End rant.

Programming Model

The results of all this massively parallel madness is machines with many cores, each core which has a small bit of local memory (32KB, from DX11 CS5) and runs a huge number of threads. Local memory is really a software managed cache. Add in an extra ~8KB or so of local L1 read-only memory backed by a shared L2 (for read only vector gather, ie texture fetch). Add in an extra ~8KB or so for read only constants backed by a shared 64KB (those numbers from CUDA). I expect those read-only caches to get flushed between dependent jobs. On each core a set of N threads run at the same time (N = SIMD width), with other thread batches getting time sliced.

Banked Memory vs Cached Memory

Or rather one of the reasons I have a huge NVidia bias! With the N threads which do execute in parallel (N=16 on Larrabee, N=32 logically on NVidia's hardware), there are limitations as to what those threads can do. First they all are going to be doing the same instruction, so they are effectively all running the same program. Second, they as one unit can access this local memory. With Larrabee the group of threads can only access one cache line at a time. Vector scatter/gather cost (assuming all cache hits) is a function of the number of cache lines required to service the request. If all addresses diverge (hit a different cache line), then the cost is 16 times a single access. With NVidia's current hardware, the group of threads can access 16 banks of this local memory at the same time. So the cost of address divergence can be much less with NVidia's approach.

Banked memory opens a huge number of really cool options. If N is the number of banks for the local memory (16 with current CUDA), one can divide up memory into N regions and have each of the N threads do random read/write to a corresponding region (or bank) at full speed. So effectively all the single thread tools we've learned over the years like hash tables, queues, linked lists, etc, simply work at full speed if each thread sticks to a separate bank of memory. Threads can access other banks of memory at full speed as long as each thread agrees to access a different bank. Local memory performance is a function of bank conflicts. Atomic operations to local memory are fast as well! Learning to program such a system is really just learning how to map data into this model, and learning how to manage your own local store of memory.

With Larrabee local memory access is a function of the number of cache lines accessed, so vector gather/scatter is marginalized in performance. There is no way to use the traditional single thread tools at full performance unless they apply to a vector at a time. IMO this is going to result in the same kind of impossible to maintain code that we have with SSE.

Ouch!

So with OpenCL there are a lot of things I'm planning on doing in the future which simply would not run fast on Larrabee because Larrabee doesn't have banked memory. I'm somewhat torn now that I've really started looking into Larrabee about my thoughts on the future. Seems like fully portable OpenCL code is going to really be tied to the hardware in terms of performance when you want to do really awesome stuff. Thinking I might just have to go CUDA only for all my independent projects.

20090414

Tessellation Performance

Interesting post on repilogue included this presentation.



And from slide 20, tessellation is "3 times faster [than rendering the high polygon count geometry without the tessellator] with 1/100th the size!" What isn't included is what exactly was done for this test, but I am guessing just the standard FP32 xyz per vert high poly mesh vs tessellation. One thing that occurred to me was that for a long time I was thinking that the net result of tessellation would be a higher pressure on triangle count (ie smaller triangles), but in the context of apples to apples quality comparison, tessellation should actually reduce triangle size|count pressure compared to just rendering at the same quality using a DX9 path (in the context of view and normal dependent tessellation). Which would in turn provide better pixel quad utilization as we increase the quality of our models.

I'd still like to know what provided the 3x speed up. Was it mostly a bandwidth reduction (fetch vertex attributes) in the cost to transform the triangles, or a triangle setup cost reduction (less triangles), or a quad utilization increase (more efficient pixel shading)? If it was primarily a memory bandwidth reduction, then how does tessellation compare to some really smart vertex compression?

Voxels on DS

Gamedev image of the day doing software rendering of voxels on the Nintendo DS.

New Killzone 2 Maps

Killzone 2, enjoyed most of the single player, addicted to the multiplayer. Vids of the new maps below.



20090413

New Blog Home

Switching the blog over to Blogger. Mostly because I don't have time to add support for comments (old site is programmed in a custom forth like script interpreted in PHP to generate static html files). Previous blog will remain at www.farrarfocus.com/atom.