20090630

Yes 18 hour days but no bugs.

Back today from Alaska via overnight flight, will have a blog post on the trip once the photos are sorted through (and BTW I think Wisconsin has worse mosquitos). Playing catchup with everything missed over the past 9 days.

What looks (edit) NOT (edit, thanks Cyril) to be some DX11 realtime GI screenshots, one with indirect specular highlights. Found from the AMD DX11 Whitepaper Article, but with no link to the actual white paper, no new news as to DX11 (give us something real like limits on Append/Consume buffers mapped per shader). I'm also still disapointed at the CS4.x spec in how it doesn't provide support for features in current shipping NVidia DX10 cards (referring to G80 support for shared memory write access and later model atomic operations). Perhaps went to ATI's favor, or will ultimately help NVidia force DX11 upgrades?

Cuda 2.3 Beta is Available for Registered Developers. According to the forum post, provides float2half and half2float which I believe matches the DX11 functionality. Also some interesting unanswered questions regarding __synchronous_start(int s) and __synchronous_end() which looked to be found in the intermediate nvcc output.

Interesting stuff in the Shack's John Carmack Interview, "Now I am very excited about what I can do from a hardware and graphics standpoint with the 3GS. With vertex fragment shaders and OpenGL 2.0, I'm pretty convinced that I can actually run the MegaTexture id Tech 5 content creation pipeline on there. And I'm not sure what game I want to do that with yet, but the combination of seeing people download 700mb files of Myst on there, and the new capabilities, I could do some mind-blowingly cool stuff on there."

Battle Field Heroes is live.

20090619

18 Hour Days, Lots of Bugs, but No Computer?

My wife and I are going up to Alaska for next week. Packing light, not going to bring the pro camera this time (sticking to a real vacation), between the two of us, just two carry-on backpacks. Going to be awesome.

20090617

Requesting Siggraph 2009 Advice

I'm looking for any advice as to what to see and do this Siggraph 2009, since I've never been to a Siggraph before, and I've got to book everything this week (going on vacation next week before the early deadline ends). Also looking to meet up with any of you who read this blog and want to talk about tech and stuff! Please either email my gmail address (farrarfocus) or post comments below.

Currently planning on flying in 10am Wednesday and doing two day passes for the following,

Wed 5th: 1:45pm-5:30pm - Efficient Substitutes for Subdivision Surfaces
Thr 6th: 8:30am-12:15pm - Beyond Programmable Shading I
Thr 6th: 1:45pm-5:30pm - Beyond Programmable Shading II

20090616

Stream Compaction for Deferred Shading

Stream Compaction for Deferred Shading
Jared Hoberock, Victor Lu, Yuntao Jia, John C. Hart
Proc. High Performance Graphics, Aug. 2009

Paper addresses the issue of SIMD shader divergence when deferred shading on Geforce 8/9 series hardware using the following tests,

UNORDERED IMPLICIT - All N shaders in one program, use possible divergent branching (at pixel level) to shade pixel. In the case of a small number of shaders (or low run-time shaders) the overhead of stream compaction or sort was higher than divergent branching.

UNORDERED EXPLICIT - All N shaders in separate programs, run one full pass for each shader (N passes). Was always slower than the UNORDERED IMPLICIT path, even though max register usage (controls max warps running in parallel for latency hiding) would be smaller for explicit shaders. Overhead of fetching shader type just to mask threads was too high.

COMPACTION IMPLICIT - Overhead proportional to the number of required shaders. Use stream compaction to group pixels by shader and use possible divergent branching (at beginning and end of groups) to shade. They run N passes of a masked parallel scan (one per shader) to create offsets for stream compaction. Then one shading pass using possibly divergent branching (at the N shader boundaries in the compacted stream) to shade. The shading pass fetches pixel offset from the scan results, gathers the pixel data, shades, and scatters the results back.

COMPACTION EXPLICIT - Same as above, but run a pass per shader when shading. This case looks to always be faster than COMPACTION IMPLICIT because of better register usage when shading (better latency hiding).

SORT IMPLICIT - Overhead proportional to the log of the number of shaders. Instead of running N scans, radix sort by shader for the first pass. Shading is the same as COMPACTION IMPLICIT except using the radix sorted offsets instead the scan offsets.

SORT EXPLICIT - Save as above, but run a pass per shader when shading. This case provided the best results overall when overhead of running the sort was smaller than the cost of divergent branching in the UNORDERED IMPLICIT case.

LOCAL IMPLICIT - Do COMPACTION IMPLICIT in small tiles and only use shared memory. They were only able to support a working size of 8 warps (256 pixels). I think they ment 8 warps total per block and one block per multiprocessor, which would not be enought to hide memory latency, and would stall the multiprocessor on sync threads. Given the small tile size, shading costs alone were always higher than the UNORDERED IMPLICIT case.


About the Results

The paper results are in milliseconds on 2 types of 9800 cards. They don't really provide enough information (no number of samples shaded). My guess working backwards from 9800 GTX+ specs and average instructions per shader numbers in the paper, is about ~0.2G (G=billion) shader invocations in the Car scene (for a 512x512 image estimated from the PDF, also assuming ALU bound, which probably isn't the case) for 337ms of shading time (note sort overhead was an extra 244ms). I believe a 8800 GTX can do a 1M (million) entry parallel scan in under 1ms according to a previous paper I read. So 244ms overhead on a 9800 GTX+ likely collaborates my estimation that they are working in numbers greater than 50M at a time?

There is much more to talk about with regards to this paper, deferred shading, and compute/DX11, but too busy now...

Braid for the Mac

Old news which I never got, Braid is Available for the Mac. Woohoo (and yes I'm sans PC or 360 at home).

20090611

Factoring Out the Job Scheduler

Last night I got another idea on how I might be able to both simplify my job scheduler (in fact remove it) and better handle a dynamic CPU load. All while keeping a very simple design with minimal sync points. Be warned, I have NOT had time to try this yet, it should work in theory...

Threading

(1.) The program entry starts N threads then goes to sleep forever.
(2.) N threads to match the N hardware threads of the machine.
(3.) N threads are locked down via thread affinity.
(4.) Each of these N threads runs the exact same function.
(5.) Each thread starts with function(X, N).
(6.) Where X is the thread index ranging {0 to N-1}.

The basis of parallelism here is running N instances of a program. However since the N instances can have completely divergent branching, the program can use X to mix and match task and data parallel computation anywhere in the program. Note all threads share the same memory.


Synchronization

For efficient computation only very coarse grain synchronization is supported between the N threads. Coarse as in you can count the number of sync points per screen refresh on your fingers. Examples below show lines for each thread and letters as follows,

- -> thread doing work
x -> thread blocking
E -> event
S -> sync point, thread blocks until all threads finished E


During normal program operation sync points don't ever block because enough work is provided in between by a good program architect. Note in this model threads have a lot of slop and don't ever actually fully sync up in time unless something goes really wrong.

--------E-------S-------
-----------E------S-----
------E--------S--------
-------E------------S---


When something goes wrong, threads do block.

--------E-------S-------
---------------E--S-----
------E-----Sxxx--------
---E-----Sxxxxxx--------


The program would be composed of a sequence of E and S points per frame.

-----E1----E2S1-----S2--

In the above example, after S1, data generated prior to E1 is free to be read-only shared between all threads. Likewise after S2 data generated prior to E2 (ie between E1 and E2) is now safe to share as read-only.

Data outside sync points is considered "unordered" (just as UAVs are in DX11) and can be shared only with DX11 or CUDA like unordered atomic operations. Data which would otherwise require fine-grain synchronization to share, is simply re-computed in each thread.


Dynamic Load Balancing

This is where things get interesting. The goal case is where threads reach E at the same time (E is in the same point in code on all threads). There is enough slop in the system so that the E point in time can be a bit different between threads. To load balance one would want to dynamically adjust work distribution such that at the next E point, that the threads would be closer together in timing.

To enable dynamic balancing, all threads have a secondary work distribution factor Y (the primary work distribution factor is the thread index X). This Y factor provides a segment {start,end} along the range of {0 to 1}. This Y segment can be used by the thread to choose the ranges of data parallel work to operate on.

(1.) At E, each thread stores CPU clock cycle time.
(2.) This requires CPU clock cycle counter bases to be re-synced periodically.
(3.) Also at E, each thread records diff in cycle time to the previous E.
(4.) At S, threads read all the E times for all the threads.
(5.) Note the E times are read-only.
(6.) Each thread uses the E times to compute a new future Y segment.
(7.) This new future Y segment stored to be used at the next sync point.


Example Forms of Task and Data Parallel Programming

To have one thread handle an OpenGL interface, the program would just dynamic branch on X==0 (X being the thread index). Note, a GL context is usually tied to one thread. With DX11 there is also the concept of the primary thread, secondary threads build deferred commands. Note it is ok to lock down a task to just one thread because the dynamic load balancing would redistribute the data parallel work via the Y segment (from above).

To process M objects, each thread would compute an exclusive set of the M objects using the Y segment (something like Y.start * M to Y.end * M, with fixup to insure no overlap).


Advantages

With this system, there isn't a worklet system or a scheduler, there is just the program running multiple instances of itself. Synchronization is flexible, and rarely used (very parallel friendly). Read-only data would likely be well shared on a shared L1 or L2 or L3 because the same program is running on all threads and the dynamic load balancing works to bring together the phase (the time) of the threads.

iPhone 3GS

AnandTech: The iPhone 3GS Hardware Exposed & Analyzed

The new iPhone 3GS looks to be awesome, according to AnandTech,

- PowerVR SGX GPU
- OpenGL ES 2 (shaders)
- likely a 3.5x to 7x triangle throughput increase
- ARM Cortex A8
- 600 MHz
- dual issue instructions
- 32KB L1 I$, 32KB L1 D$, 256KB L2
- Neon
- likely a 2x processor performance increase

Yeah really cool! As long as VBOs and other SGX GPU hardware features work in iPhone OS 3.0, developers are bound to get really happy.

20090610

The Last Guardian Images

The Shack has images of The Last Guardian. Looks very much like SotC to me in muted color palette choice.

20090609

Some Assembly Required

This mornings googling turned up a really great blog, Some Assembly Required, written by Elan Ruskin at Valve. Great topics + disassembly + timings = awesome!

20090605

DX11 Binning

BTW, I over simplified my DX11 binning example in a previous post. In reality there are some important considerations for speed. For example on Larrabee since scatter time is a function of the number of cache lines, it might NOT be wise to bin into queues arranged in memory like this,

int Queue4Pixel0[16], int Queue4Pixel1[16], int Queue4Pixel2[16], ...

offset = (pixel << 4) + bin


Because only 1 queue would be on a cache line, so the scatter at best would be 16 cycles. Read back of the data later also might not be ideal in this format. Other option is this which has better cache locality at the expense of more complex offset logic,

int QueueBin0ForPixels0to16[16], int QueueBin1ForPixels0to16[16], ...

offset = (pixel & 15) + (((pixel & (~15)) + bin) << 4);


So for each grouping of 16 pixels, each queue bin is on the same cache line. Which in the completely coherent case would take a clock cycle to scatter to, and would degrade in performance as a function of how many different queue lengths there are in the groupings of 16 pixels.

The same above logic goes for global scatter in CUDA (NVidia) since performance is a function of the number of segments touched. Similar logic could work for shared memory (or shared register) scatter but there are other options in CUDA since performance is a function of bank conflicts (and not "cache lines" or "segments").

Clearly this is assuming via ATI and NVidia DX11 hardware that texture UAVs are built from CUDA/OpenCL/DX11CS global memory accesses. This assumption could be dead wrong, and if another functional unit is involved (like UAV enabled ROPs) then I would need to re-explore this...


CS vs PS

Oops, another important consideration I left out is the assumption that the binning is happening from the CS stage (so block arrangement insures vector alignment and coalescing).

PS stage provides some complex problems in that pixel grouping in the SIMD units isn't likely always coalesce/vector friendly with the address calculations above. Full SIMD width number of pixels (full course raster tile) is likely to get packed in a very vector cache line friendly way on Larrabee, but you'd have to know if the output addressing was Morton order (bit interleaved or not) and adjust the offset calculation. Grouping of 2x2 pixel quads (or just pixels) into SIMD vector groups for small triangles probably breaks down the binning into something vector scatter unfriendly on Larrabee (extra different cache lines). Banked memory could lessen this limitation (so who knows what NVidia's GT300 performance will be).

Thread Scheduling Part 3

My Personal Views on Solutions to This Mess

*** Personal views for the independent developer. Clearly things are different with the constraints a huge legacy code bases, etc...

First a little background, back when I started programming, is was practical to write your own operating systems and drivers (I did this in my teens). I would boot DOS, then switch into protected mode with the DOS interrupt table and low memory intact so if necessary I could switch into virtual x86 mode and use DOS for things I didn't feel like programming my self (like access to the FAT files system on the main harddrive). Everything else, sound, video, etc was done accessing the hardware directly. Actually things where much EASIER back then (IMO). Since then, the industry had added layer upon layer to abstract out the hardware...

WHAT DO I MEAN BY EASIER? - Back in the early days you got control over everything. Computers were single processor, no virtual memory, no threading. Applications (at least the stuff I did) would work with non-blocking asynchronous IO requests and coarse polling with a small set of interrupt service routines to handle hardware interrupts. It was rather easy to poll with one bitmask read to check for completion of any IO request, then cooperatively "multi-task" in application. Things were actually more interactive and responsive back then as well. Polling was cheep (read from memory, no kernel call), task switch was cheap because it was cooperative, no state to save and restore. Programs were tiny, systems were tiny. Check out Contiki for a modern example!

I hold no illusions that somehow we could go back to the way things were before, however there are a huge number of lessons from back then which should be applied now.



What I am Doing Circa 2009

First I have one thread per hyperthread in the system (or per processor in a non-hyperthreaded system). These threads are locked down via setting thread affinity. These threads are my "compute threads" which don't ever do blocking IO. Whenever possible I use low-level as possible non-blocking IO interfaces (this covers networking and raw file IO on most systems).

For all the other cases, meaning when the operating system or API doesn't provide a 100% non-blocking interface, I resort to blocking IO threads to which I non-locking queue requests to. Note because scheduling granularity is 1ms or so at best, queues get groups of requests at one time (to lower atomic operation overhead and overhead of kernel call to wake a IO thread which was blocking on an empty queue). Blocking IO threads run at a higher priority so they will be insured to preempt a "compute thread" whenever they get marked as runnable. Blocking IO threads are designed to have very short runtime and thus a minor effect on the "compute thread" which it prempted.

My job scheduler is quite different from what others do (it is modeled directly on how GPUs process draw calls and what is now the OpenCL model). I break my program down into an array/list of small jobs (like draw calls). I handle handle different sections of the program by turning on/off jobs in this list.

Each job has an index to a prior dependent job (which controls when the job can be scheduled). I keep track of job runtime (free profiling) and use this to pre-schedule all jobs into one queue per "compute thread" a frame in advance. So "compute threads" simply pull jobs in order off their queue and spin if in the unlikely case the job's dependency hasn't finished execution.

It is up to the program designer (ie me) to insure that the program pipeline doesn't ever need to stall and thus has enough non-dependent jobs to handle the variability of the preemption by the blocking IO threads.

Note that with this system I don't ever need any other thread-2-thread synchronization other than atomic operations. All that "work" and "mess" is factored out into proper program "pipeline" design!

To handle scalability to machines of varying number of processors, I have two types of jobs: (BATCH) This job is attempted to be scheduled on all "compute threads" at the same time. Each job entry point gets the "compute thread" index and the number of "compute threads". It is then up to the job to carve up the data and process its group. Batches are for data parallel work. (TASK) This job gets scheduled on only one processor. Tasks are for task parallel work.

For fast development cycles, I have all my code in a single dynamic (or shared on unix platforms) library, so I can re-compile, unload the old library, and load the new library to instantly test code changes without needing to restart the program. This is my re-attachable code model. BTW, I don't use the debugger ever at home, I don't need to!

20090604

Thread Scheduling Part 2

Part 2 is all about the IO side of Thread Scheduling, CPU IO in a parallel world.

Review, Blocking vs Non-Blocking

BLOCKING - Thread which issues the IO call is put to sleep until the operation is "finished". At some point in the IO call, a kernel call is made (to service the IO request) which results in a run-level task switch to kernel mode. In kernel mode, the kernel queues the IO request, sleeps the calling thread and does a task switch to another runnable thread. At some point later after the kernel services the IO request, the original thread (which issued the IO) is again scheduled to run (latency can be a function of IO latency plus how often the blocking thread gets rescheduled to run).

LIKELY BLOCKING, POSSIBLY NON-BLOCKING - In the case of reads, sometimes the data is cached (perhaps disk cache) and thus the IO call can be serviced right away. This can also be the case for writes when the kernel call copies the write request to a queue (or to the page cache) and immediately returns (note when queue is full or page isn't in page cache, kernel call can become a blocking call).

MOST LIKELY NON-BLOCKING, POSSIBLY BLOCKING - This is the troubling case where an interface is almost completely non-blocking, but has some caveats. Caveats in that the interface doesn't return some error code indicating that the call would block. Often this is the case when buffer or queue space is full. The v-sync section in the previous post is a good example of this!

NON-BLOCKING - Correct fully non-blocking API, if a call would block, the call returns with an error code. Non-blocking APIs might provide a polling function to return the status of pending operations. Non-blocking APIs might also provide a callback interface. In this case a user supplied function is called on pending IO. Callback interface can be lower latency at the expense of loosing control over job synchronization.


Disk Access Re-ordering To Reduce Seek Time

All operating systems which have good file IO performance on rotating disc based storage do this (else performance would suck worse than it does now). The operating system queues up the disk accesses required to service the file IO (virtual memory IO, etc) from all processes and threads, and orders the accesses to reduce seek time.

This should be kept in mind when thinking about application file IO. It is often smart to queue up file IO requests from many files at once, instead of just following the read->process->read->process model. Likewise, having just one asych thread servicing file requests using a blocking interface might not yield the best throughput depending on access patterns.


Read Ahead

The operating system provides a read-ahead mechanism which attempts to predict which pages will be needed in the future (based on past access patterns) and effectively adds future requests to current requests when reading from the device. In some cases (Linux) it can get various patterns like sequential + stride, etc.


Why Your Desktop User Experience Is Often a Waiting Game

(1.) Same thread processing the application "GUI" event queue does file IO. Epic Fail!

(2.) Application is creating a huge data structure, it writes into memory which was just allocated, but hasn't yet been mapped to physical pages by the OS. System is under heavy load, and not enough non-dirty pages are available to discard. Operating system needs to write out dirty pages to disc to provide pages for the application to write to. Application blocks until disc IO is finished on the dirty pages. Epic Fail!

(3.) Application suffers from code bloat. App could even be tiny but suffer from a bloated hierarchy of dynamic link libraries. Non-accessed code likely isn't resident (or if loaded, was paged out to provide space for something else). Example, user clicks on menu and opens up a seldom used dialog, paged-out code is accessed, and App blocks on file IO. Epic Fail!

(4.) Application access a huge number of small files, all with long path names (think bloated file structure with huge number of files per directory). Seek nightmare for two reasons, first just accessing the (possibly uncached) directory structure to find the files, and second accessing the (likely uncached) files. Epic Fail!

Solving the first case is just good application design. Solving the third case is expert application design. One has to insure all GUI or user-input processing is in a small core amount of common code (and data) which will always be resident (often accessed likely equals always resident). Possibly blocking functions (ie ones which might access code which is seldom called, ie which is paged out) needs to be done in an asynchronous thread. This way the app is always responsive!

Point here is anything which might result (directly or indirectly) in file IO must be on a separate thread from whatever thread or thread(s) interact with the user.


Flavors of File IO

CACHED STREAM FILE IO - Think the standard C library interface: open(), close(), read(), write(), etc. Reads and writes end up as a mix of blocking and non-blocking kernel calls. Nothing new here. I avoid this stuff.

MEMORY MAPPED FILE IO - The case where the operating system provides file IO directly using the virtual memory system. All the mapping kernel call has to do is to setup the page tables to invalid and then setup the os pages mapping to map to the file. At this point it can return (however it might prefetch based on mapping hints). When the app then attempts to first access memory in the mapped range, a page fault will happen, the kernel will then queue the disk IO, and sleep the thread (assuming the access isn't beyond the end of the file, also assuming the disk page isn't cached in memory). Depending on access patterns, the OS might prefetch N extra pages on fault (read-ahead). Nothing says that the entire file is loaded.

A scheduling pitfall of memory mapped IO is possible lengthy page faults even after the first access. So if using memory mapped IO in a blocking service thread, that thread would have to manually touch (read a integer from) all the pages to insure they are all actually loaded (skipping pages isn't an option because the read-ahead might detect that an then only service the skipped pages). This way when the async service thread marks the IO as complete, the program can assume a compute thread will not block. Problem with a blocking async thread touching pages to insure they are loaded is that each read will be a full (wasted) cache line miss out to main memory (if each miss is 1000 cycles, on a 2 GHz machine with 4KB pages, 1 ms of miss time is only good for at best ~7.8 MB of touched blocks, costing you 1/16 of your 60Hz frame). So totally NOT practical!

Pages modified are automatically marked by the CPU so on unmap the OS can only write out modified pages for free. Also in theory, memory mapped IO could provide the operating system a way to avoid an extra copy on read and write from the file system. This is assuming the OS does direct file DMA to/from the pages. Avoiding the copy would be avoiding the cache pollution and stalls caused by huge CPU to CPU copies. Avoiding the extra copy sitting around in the disk cache also saves memory.

Memory mapped file IO is available on Windows/Unix/Mac/Etc, but not on consoles, and other embedded systems!

RAW DIRECT ASYNCHRONOUS FILE IO - Given the problems with cached or memory mapped file IO, for databases and other critical high performance applications, operating systems often provide an interface for raw file IO. Interfaces for raw file IO are also commonly non-blocking and asynchronous (for obvious reasons) and likely batch request friendly.


Windows Asynchronous Disk I/O

Quick refresher, I'm not personally interested in things like accessing a huge number of random files. IMO that is just broken by design. What I do want is background IO with simple polling for completion (no heavy weight kernel call), and no duplication of memory (ie copy of the file block in the file cache). So I'm not going to cover things like I/O completion ports...

Microsoft has a great article on this which should be read first. Windows AIO has one key design flaw in that the operating system can choose to go synchronous for an operation without giving the application a choice. This IMO is an Epic Fail case for any asynchronous interface.

Cases of going synchronous are: NTFS compression, NTFS encryption, extending a file's length, and large number of cached file IO requests. FILE_FLAG_NO_BUFFERING (raw direct un-cached IO) best insures an actual asynchronous call. However from their example (of no-buffering async calls), only 500 requests got queued in 0.22 seconds. Which shows a 0.45ms average kernel call time. Yikes, that is either certainly not asynchronous, or just spin locking, or the thread is getting heavily preempted!!!

Windows does provide fail cases with ERROR_INVALID_USER_BUFFER or ERROR_NOT_ENOUGH_MEMORY whenever there are too many outstanding asynchronous I/O requests.

Windows also does provide a HasOverlappedIoCompleted() macro to check for AIO completion without using a kernel call.


MacOSX Asynchronous Disk I/O

POSIX asynchronous IO wasn't supported until MacOSX 10.4. POSIX AIO provides a true non-blocking interface, if a request cannot be enqueued, then functions fail. POSIX AIO requires a aio_error() "system call" to check on status of an async operation. I believe that POSIX AIO on OSX is a wrapper for kqueue() so polling does indeed require a system call (yuck). According to some internet sources a process is limited to 16 AIO requests at a time on OSX. Probably better to just use kqueue() directly on OSX.


Linux Asynchronous Disk I/O

Included in 2.6 kernel (patch for 2.4). Works on files or block devices opened with O_DIRECT or raw devices. Doesn't work on sockets or pipes. One option is to directly interface with the io_*() system calls through the libaio userspace wrapper library. Basically,

#include < libaio.h >

// setup context
aio_context_t ctx;
bzero(&ctx, sizeof(aio_context_t));
io_setup(max_events, &ctx);

// submit events
struct iocb tab[events];
io_prep_pread(tab+0, fd, buf, bytes, offset);
io_prep_pwrite(tab+1, fd, buf, bytes, offset);
...
io_submit(ctx, events, tab);

// poll on completion using syscall
// will want to wrap this in something which disables signals
struct io_event tab2[events];
num_events_or_error = io_getevents(ctx, 0, events, tab2, 0);
if(tab2[0].obj == tab+0) // then that event is finished
// tab2[0].res2 == 0 then ok, else error
// tab2[0].res == amount read/written


Links: Linux System Calls, Neat Interactive Map of Linux Kernel

Other option would be to use libposix-aio to get the POSIX AIO interface (which uses io syscalls). The link above is to the source which is a good reference to the syscalls (for example they wrap io_getevents() in something which masks interrupts, stuff like that likely learned the hard way through trial and error!).

BTW, Linux too fails on the most basic level of simply directly updating the iocb with the completion status so a kernel call isn't needed to poll. Arg.


Non-Blocking Network Interface

I personally don't use TCP, and only do UDP. Via UDP, the problem of the "connection" interface to the operating system goes away, and a single socket can gather all incoming packets from all clients and/or send packets to all clients (recvfrom()/sendto()). A quick ioctl(), or ioctlsocket() (with Windows WSA emulation of BSD sockets), can be used to set non-blocking operation.


Audio

Going to leave this for another blog post...

More DX11 Thoughts

Was going through Bill Bilodeau's Presentation off the repiloque blog, and realized I had missed some important info from slide 20 saying that the "3 times faster and 1/100th the size" is comparing ATI DX9 (actually DX10, with vertex texture fetch) tessellation (requiring at least one pass to export per edge tessellation factors, and a second to tessellate) to running the high poly mesh through the standard DX9 pipeline, also that "DX11 Tessellator algorithms can usually be done in one pass". Guessing this "usually" is suggesting that hardware might still optionally spill intermediate results to main memory under a few rare conditions?

Given that tessellation is a data amplification process and that pixel shading is (for the most part) also a data amplification process, it isn't too hard to understand that the hardware could likely keep intermediate data all on chip. GS is the wild card there, and I don't plan on using GS at all anyway.


UAV Limits

- Pixel Shader limited to 8 RTVs+UAVs total
- Compute Shader limited to 8 UAVs

So the limit of the union of Render Targets and Unordered Resource Views is the same as the limit of Render Targets... says to me that at least one DX11 vendor will be using the same interface for both RTVs and UAVs. Also UAV texture access is still unordered, I'm thinking any deep buffer magic (k-buffer, etc) is going to require atomic operations...

Speaking of a combined RTV+UAV ROP unit. So does the new DX11 path support better fine granularity (32-bit) scattering (assuming the scatter has good data locality)? Nick Thibieroz's Shader Model 5.0 and Compute Shader slides describe "Order-Independent Transparency" and "Data Binning Operations" as applications of UAVs. Is this really going to be fast enough make binning pixels in a single pass useful? With Larrabee the answer is YES (I think), what about ATI and NVidia? Currently with UAVs emulated via CUDA on a DX10 GT200, the answer is a clear NO.

Binning literally means something like this (in C like code, not HLSL),

bin = atomic_add(queue_head + pixel_offset, 1);
if(bin < max_bin) { queue[pixel_offset * max_bin + bin] = value; }


Note the first line is a fully coalesced vector friendly GLOBAL atomic add (this is an unordered view). In theory this could be fast (coalesced) but latency unfriendly because of the needed return value. The second line scatter requires a fast 32-bit scatter (value is a 32-bit pixel) to a "cache" or ROP write-combining buffer to be bandwidth friendly. If binning of 32-bit values with good locality gets fast, then an amazing amount of algorithms become possible on DX11. I've been waiting for this since the CUDA PTX .surf space has been documented but never implemented in hardware...


Append / Consume

No limits posted yet that I can find (I haven't tried/installed the DX11 software reference), but ATI presentations seem to show only Raw or Structured Buffers can be mapped as Append or Consume in a shader but seems like NOT as both Append and Consume at the same time? I'm also guessing Append output or Consume input is to/from main memory since it seems most likely that the Append to Consume boundary is at the draw call. I'm also thinking the limits on Append/Consume buffers mapped in a shader will be small.

Pixel Junk Shooter

E3 has provided a higher quality version of Q-Games Pixel Junk Shooter video,

20090603

AMD/ATI's DX11 at Computex

VR-Zone, Anandtech

Awesome, June 2009 ISV (unfortunately not me) getting actual hardware. One slide has "Dynamic Branching" as a tech innovation, so possible improvements on branch granularity?

20090602

Thread Scheduling Part 1

Decided to re-look at the problems of IO and scheduling across all platforms to see what has changed since the last time I checked (turns out not much at all)! Beginning with Windows, Charles Bloom's Post is a great place to start.


Windows Thread Scheduling Refresher

Scheduling Priorities - Priority range 1 (lowest) to 31 (highest). Threads in each priority are scheduled round-robin. Scheduler runs highest priority thread possible at all times, so it is possible for higher priority threads to starve low priority threads. Base priority is set from priority class and priority level (in the class).

If a higher-priority thread becomes available to run, system stops lower-priority thread (before it finishes it's time slice), and runs the higher priority thread for a full time slice.

That is what the docs say. Here is my interpretation: Threads are scheduled to run for a "quantum" number of timer interrupts (which increases BTW when the dynamic priority gets bumped). If a higher-priority thread becomes available to run, it could in theory get scheduled at two places: (1.) the timer interrupt, or (2.) after an IO interrupt (or kernel call) is processed which results in a higher priority thread getting marked as runnable. *** I have a feeling case (2.) doesn't happen on Windows. Which means that scheduling interrupt timer granularity is of huge importance for any blocking IO worker threads! Default Windows interrupt timer granularity was previously no better than 10-15ms (I don't know what is now). Good idea to fix this with timeBeginPeriod(1) for games to get a 1ms granularity.

SwitchToThread() can be used to yield to the next thread in line to run on the current processor only. Thread is re-queued for the next pass of round-robin. Note if no other equal priority thread (higher priority threads will automatically preempt a lower priority thread) is available to run, then this does nothing.

Priority Boosts - Actual priority is base priority + dynamic priority. Windows boosts priority in the following cases, (1.) when a NORMAL_PRIORITY_CLASS process becomes the foreground window, (2.) when events are sent to a windows event queue the thread associated with the queue gets a boosted priority, and (3.) when wait conditions are satisfied for a blocked thread. Set(Process|Thread)PriorityBoost() can be used to control this.


To Place Scheduling Granularity In Perspective

Quantums vary 20-120ms for NT4/2000/XP/2003 depending on priority boost, workstation or server OS, etc. Note with a 60Hz refresh rate, timeBeginPeriod(1) 1ms scheduling granularity, and assumption *** above, this results in only 16.67 possible chances for a thread to get involuntarily preempted (by a higher priority thread) per frame! Mix in background tasks/IO and it is easy to see wild flux in run-time. The only way to hit higher task switches per frame is to have threads manually blocking (on IO or object) in a way that isn't a spinlock. Note other applications might not play nice and higher task switches per frame means more time doing nothing.

If there is anything to be learned from this,

(1.) Any IO blocking worker threads need to be given a high enough priority to insure they always preempt "compute" threads which would otherwise run for their entire quantum. With any luck in any of these 16.67 chances/frame for a compute thread to get preempted, the higher priority of IO blocking threads would then insure any available unblocked blocking IO threads get run before another compute thread gets reissued.

(2.) Interactive in-focus high FPS applications pushing CPU limits cannot really afford to play nice and Sleep() free CPU time before the next frame. However, (and I haven't tried this yet), there might be a way to manually reduce priority to simulate sleep and then allow a higher priority timer thread to "wake" the sleeper thread by increasing its priority.

(3.) Likely upper limit for number of possible dependent stages in a 60Hz game pipeline probably best to be well under 16 if possibly blocking.

(4.) Threading sync interfaces which use double (or more) possibly blocking sync primitives to enter and leave can be very bad.

(5.) There is a good point to non-blocking IO!


Thread Scheduling and V-Sync

First forget the notion that it is OK for the GPU to wait when one is rendering faster than v-sync. Frame rates are variable, being able to get an early start on the next frame could be quite useful (lowers the chance of dropped frames)!

There are multiple ways hardware could possibly handle sync:

(1.) CPU adds a GPU command buffer command which stalls the GPU until v-sync, followed by a command which swaps buffers. CPU then wouldn't have to stall and could continue adding commands for the next frame until command buffer queue space fills up (at which case the draw call would stall). Clearly in this case the GPU stalls, and cannot begin processing commands for the next frame, this is very bad!

(2.) CPU adds a GPU command which sets up for the swap. Then later the CPU uses something besides a command buffer command to trigger the actual swap (memory mapped IO to hardware register, etc, likely from a v-blank interrupt CPU side). So neither CPU or GPU needs to stall. CPU could queue draw calls and issue those draw calls to the GPU as long as they didn't write into the front buffer (used before the swap), or fill up command buffer queues. At some point the CPU would get notification that the swap was completed and could issue dependent (as in write to the previous front buffer, ie the new back buffer) calls for the next frame. This is the (only) sane way to do things.

I'm going to assume GPU supports (and driver is using) case (2.).

If the application is GPU bound, meaning the CPU can issue draw calls faster than the GPU could ever complete them. At some point the graphics API function call will block. The API call will block for two possible reasons, first buffer space is all used up, or second, it blocks on swap to insure the CPU side doesn't get too many frames ahead. So if one wants to push the CPU too, don't do that from the same thread issuing graphics API function calls!

Alan's 2D Fluid Wake



I sense 2D CFD fluid effects (similar to Little Big Planet) going on as a post processes in Alan Wake. Subtle but definitely there. Grab the 720p vid on Gamersyde to see, 3:10 or so look at the birds. Also looks like they are reusing motion vectors for both 4 tap directional "duplication" and motion blur. Very nice tech indeed.

Strange E3 Rendering Artifact

I've been looking forward to Insominiac Games's A Crack in Time after finishing A Quest for Booty. It will be very interesting to see how they make use of their light pre-pass engine updates with RCF : ACIT. Seems like the water effects from R2 have been possibly improved (better high frequency detail?) and made their way into ACIT. Also in this work-in-progress gameplay trailer, something strange is going on with the water shader which only effects the lower right of the frame?... oops.



20090601

MT Summary of the Evolution Framework is Unstoppable!

Yes, the title was inspired by google's Japanese to English translation of,

part 1 : google translate version
part 2 : google translate version

MT Framework and Resident Evil 5

Hard to understand the translation, here is what (I think) I gathered from it.

- 180-200MB textures per scene
- 15000 tri/character for main character
- 4000 tri/face in cut close-up
- 4000:500:200 tri/character LODs
- 91:57:35 bone/joint LODs
- Hovok for physics ragroll, IK in house?
- Does HDR to LDR mapping when drawing (Valve method) to 32bpp targets.
- IBL on characters.
- Vertex color baked shadowing for some geometry.
- Texture baked shadowing for some geometry.
- Dynamic shadows multiply against baked, double shadowing :(
- 1024x1024 shadow map for the scene.
- Water updates.
- Movable shrubs on player interaction.

Speaking of MT. Lost Planet 2,



Has awesomely overdone motion blur (which I like),