<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-1089702811248465710</id><updated>2011-08-16T02:38:43.834-05:00</updated><title type='text'>Atom</title><subtitle type='html'>Description? Read the blog.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default?start-index=101&amp;max-results=100'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>152</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-7427175221435200552</id><published>2010-01-03T00:05:00.002-06:00</published><updated>2010-01-03T01:49:22.627-06:00</updated><title type='text'>PC CPU Task Parallelism Limits</title><content type='html'>&lt;i&gt;All numbers below are from a Windows XP + 2.39 GHz Intel Core 2 Duo laptop. Performance on other machines and configurations will vary (duh).&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Some Numbers&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Histogram of the measured clock cycles/task of one thread processing the same ~10K clock cycle task many times (in a loop) on the above laptop, &lt;br /&gt;&lt;br /&gt;&lt;tt&gt;835330 ... 0-16K cycles &lt;br /&gt;8005 ..... 16K-64K cycles&lt;br /&gt;1178 ..... 64K-256K cycles &lt;br /&gt;66 ....... 256K-1M cycles&lt;br /&gt;20 ....... 1M-4M cycles&lt;br /&gt;5 ........ greater than 4M cycles&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;This specific task is a fully ALU bound dummy task which spins doing pure math (all values in registers). Run time variability is a function of hyper-threading (minor effect) and operating system preemption (major effect). Time scale here is somewhere around 4 seconds of total run time (for 0.8M tasks), which could be 120 frames of a 30 Hz game on this laptop. Note the variability.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Preemption&lt;/b&gt; &lt;br /&gt;&lt;br /&gt;A major problem for any task parallel system on the CPU is preemption at non-task boundaries. The net result is that tasks at random could easily get stalled for over 1 millisecond (happened 5 times in 4 seconds in the above example). Any tasks which depend on the result of those stalled tasks also get stalled. &lt;br /&gt;&lt;br /&gt;&lt;i&gt;The visible result is variable frame rates and a screwed over player/user experience when the application attempts to have more than a few task parallel dependencies per video frame.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Workarounds&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The best solution is cooperative multitasking: tasks release the CPU when they are finished, programmers insure task run time is at the desired program response latency. As desktop operating systems have de-evolved over the years we have lost the ability to provide a correct solution.&lt;br /&gt;&lt;br /&gt;Are there any workarounds?&lt;br /&gt;&lt;br /&gt;UNDER-UTILIZE CPU - In order to always hit v-sync, the program can under-utilize the CPU such that the maximum amount of time taking to compute a frame is the minimum amount of time in the worst case preemption. Clearly preemption is not a bounded problem, a program will always have to accept some kind of frame drop.&lt;br /&gt;&lt;br /&gt;DUPLICATE TASKS - With a painful case of over-engineering, one could build a task system with transactional memory, such that if a currently needed task dependency is not completed due to preemption, that a running thread could manually duplicate and run the dependent task.&lt;br /&gt;&lt;br /&gt;Anything less painful?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Manual Self Preemption?&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Seemed like a great idea: keep track of thread run time and yield execution to another task worker thread before the operating system would preempt the thread's time-slice. This way the program could simulate cooperative multitasking switching at task boundaries.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Epic fail on the XP Laptop!&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Of the two ways to do this with fixed thread affinity: pair of threads yielding execution, or pair of threads blocking/releasing each other via signals, all have high overhead and do not fully solve the variability problem.&lt;br /&gt;&lt;br /&gt;So anything less painful?&lt;br /&gt;&lt;br /&gt;No!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-7427175221435200552?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/7427175221435200552/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2010/01/pc-cpu-task-parallelism-limits.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/7427175221435200552'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/7427175221435200552'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2010/01/pc-cpu-task-parallelism-limits.html' title='PC CPU Task Parallelism Limits'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-5044521748178680442</id><published>2009-12-29T20:07:00.002-06:00</published><updated>2009-12-29T20:13:09.265-06:00</updated><title type='text'>DirectCompute 3D Fluid Simulation Tube</title><content type='html'>&lt;a href="http://users.skynet.be/fquake/"&gt;Download Page for Jan Vlietinck's Simulation Demo&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;object width="560" height="340"&gt;&lt;param name="movie" value="http://www.youtube.com/v/z_7DgpJK-eI&amp;hl=en_US&amp;fs=1&amp;hd=1"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/z_7DgpJK-eI&amp;hl=en_US&amp;fs=1&amp;hd=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="560" height="340"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-5044521748178680442?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/5044521748178680442/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/12/directcompute-3d-fluid-simulation-tube.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/5044521748178680442'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/5044521748178680442'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/12/directcompute-3d-fluid-simulation-tube.html' title='DirectCompute 3D Fluid Simulation Tube'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-9142629893075237143</id><published>2009-12-13T13:32:00.002-06:00</published><updated>2009-12-20T20:37:09.257-06:00</updated><title type='text'>Random Pre-Christmas</title><content type='html'>&lt;object width="480" height="385"&gt;&lt;param name="movie" value="http://www.youtube.com/v/QqRuhSWhGg4&amp;hl=en_US&amp;fs=1&amp;hd=1"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/QqRuhSWhGg4&amp;hl=en_US&amp;fs=1&amp;hd=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://research.scee.net/files/presentations/gcapaustralia09/Pitfalls_of_Object_Oriented_Programming_GCAP_09.pdf"&gt;Tony Albrecht (SCEE): Pitfalls of Object Oriented Programming&lt;/a&gt;&lt;br /&gt;One of the best I've seen on this topic. See more SCEE presentations on &lt;a href="http://research.scee.net/presentations"&gt;SCEE's Presentation Page&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.cgw.com/Publications/CGW/2009/Volume-32-Issue-12-Dec-2009-/CG-In-Another-World.aspx"&gt;CGW on Avatar Production&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.zbrushcentral.com/showthread.php?t=79141&amp;page=1&amp;pp=15"&gt;Unchared 2 Art on Z Brush Central&lt;/a&gt;&lt;br /&gt;Thanks to &lt;a href="http://twitter.com/BrianKaris"&gt;@BrianKaris&lt;/a&gt; for this link.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://pixeljunk.jp/library/Shooter/"&gt;PixelJunk Shooter&lt;/a&gt;&lt;br /&gt;Is out now and is quite fun/awesome!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-9142629893075237143?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/9142629893075237143/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/12/random-pre-christmas.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/9142629893075237143'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/9142629893075237143'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/12/random-pre-christmas.html' title='Random Pre-Christmas'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-5973677177461586662</id><published>2009-12-08T22:48:00.005-06:00</published><updated>2009-12-09T23:16:03.885-06:00</updated><title type='text'>Parallel Beast</title><content type='html'>Been too long since I've posted anything about parallel programming, considering my addiction with the topic. Guess I should also make this clear, that what I post here is my (sometimes radical) personal view and not necessarily representative of my employer.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Perpetual "Computational" Burnout&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Burnout as in data moves around and around in a circle, but just cannot seem to put the power to any useful computation. &lt;a href="http://www.nvidia.com/content/GTC/documents/SC09_Dally.pdf"&gt;Bill Dally's SC09: The Future of GPU Computing Presentation&lt;/a&gt; puts this to numbers on page 14, &lt;br /&gt;&lt;br /&gt;&lt;i&gt;"Moving a word across die = 10FMAs"&lt;br /&gt;"Moving a word off chip = 20FMAs"&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Data Routing and Work Scheduling Problem&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Efficient parallel programming is effectively a data routing and work scheduling problem, balancing the cost of moving data and computation. Chip is effectively a static routed network, going parallel involves reappropriating chip area towards computation at the cost of something else. Massively parallel machines will continue to enable algorithms which take advantage of limiting data movement to scale in performance.&lt;br /&gt;&lt;br /&gt;Scheduling is of critical importance because it insures data is reused instead of being moved to and from a holding area as a result of inefficient ordering of computation. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Memory Hierarchy&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Furthermore as the balance of chip area enables more computation, the area devoted to holding data is used more for storing active computation rather than simply storing data. Area devoted to holding data includes registers, cache, and/or program managed local store. &lt;br /&gt;&lt;br /&gt;The difference between a massively scalar processor and a massively parallel processor can be visualized by thinking of the on-chip memory hierarchy pyramid. At the tip of the massively scalar processor pyramid is the registers and at the bottom is L3. In contrast on a massively parallel processor the registers might be the widest part of the pyramid (inverted pyramid)!&lt;br /&gt;&lt;br /&gt;The life time of the data in this holding area is greatly reduced as throughput increases. This places even more importance on scheduling, data must be reused in the ever smaller window of opportunity. The roll played by the memory hierarchy changes from holding data for a long time for use in much later computations, to more of a roll of smoothing out a short window of access patterns.&lt;br /&gt;&lt;br /&gt;What this literally means is that to be efficiently parallel at a fine granularity, work must be explicitly scheduled group reusing data. Anything fine granularity and dynamic comes at a cost. The reason for vectorization and grouping vectors into blocks is part way to insure the tight scheduling required to group data access to common data.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Medium Granularity Scheduling: OS Hand of Perf Death&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Preemption is a nightmare for the latency of task parallel programming dependencies. When the scheduler preempts your worker thread while it is in the middle of processing a task, dependent tasks wait on the rescheduling of that thread so the task can finish. Often that thread was just placed on the end of a queue, so there is going to be a good delay before that thread will run again. What is desired in this case is a cooperative or task aware scheduling, so that a thread will yield to another thread at a task boundary. Working around these system level problems causes a cascade of design complexities.&lt;br /&gt;&lt;br /&gt;What I want personally in the future of Operating Systems and massively parallel machines is an OS which can allocate entire cores and memory to a application for exclusive access! Take your awful mess and get out of the way. Let me return to the days of microsecond and nanoseconds response time instead of the abysmal peaks of milliseconds of latency on scheduling.&lt;br /&gt;&lt;br /&gt;Actually no need to wait, can do this right now.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Raw Programming a Parallel Beast&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Hidden under the guise of a GPU is an awesome parallel machine, and a driver which provides chunks of exclusive access to the entire machine, and during those periods of exclusive access, no operating system or driver to screw anything up. &lt;br /&gt;&lt;br /&gt;&lt;i&gt;Getting Naked With the Parallel Beast!&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Dead simple to play in &lt;a href="http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/online/modules.html"&gt;CUDA&lt;/a&gt; and a GT200 series GPU: launch enough warps via 32xN sized CTAs to fill the machine just once using a Cx1 size grid. Where C is the number of CTAs to fill the machine. The result is N*C "hardware threads" (as in a traditional CPU) with 32-wide vector units, texture units, and unified memory! Use nvcc's "--maxrregcount" to set the maximum number of registers per warp such that C CTAs will correctly fill machine (and be careful with shared memory usage).&lt;br /&gt;&lt;br /&gt;Use clock() to insure the warps exit at a task aware boundary at some Hz, then simply repeatedly relaunch the same kernel to continue the machine. Allocate a huge chunk of GPU memory to work within without fear that accessing that memory might ever incur a page fault. Use page-locked host memory to roll your own queue to communicate to and from the CPU for keyboard and mouse IO. Do a little OpenGL interop for drawing to the display, and you have a 100% software massively parallel machine to prototype or code on.&lt;br /&gt;&lt;br /&gt;Scalar code in CUDA is easy, simply write C code and don't explicitly use "threadIdx.x" to have the code adapt to the lane of the 32-wide vector. Use a mix of scalar and vector code at will.&lt;br /&gt; &lt;br /&gt;The launch kernel becomes the binary of the "operating system" and "application". Use "threadIdx.y + N*blockIdx.x" to obtain an index to the "hardware thread". CUDA 2.3 PTX ISA does NOT provide branch or call by register, workaround to provide a data dependent branch is a giant switch (GPU's actually a lot better at it than you think it is). Use global memory atomics to communicate between warps or CTAs.&lt;br /&gt;&lt;br /&gt;Could run N*C instances of something like &lt;a href="http://www.sics.se/contiki/"&gt;Contiki&lt;/a&gt; and do message passing through global memory. Or could test data structures for 100's of "cpu thread" scaling. Perfect for those who want to return to the glory days of 64KB but with 1Tflop and 1GB of memory ... 320x200 at 30 fps provides 0.5 Mflops per pixel per frame on a GTX 275. What's not to love about that!&lt;br /&gt;&lt;br /&gt;Imagination is your limit.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-5973677177461586662?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/5973677177461586662/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/12/parallel-beast.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/5973677177461586662'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/5973677177461586662'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/12/parallel-beast.html' title='Parallel Beast'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-3400832865573518223</id><published>2009-12-08T00:52:00.002-06:00</published><updated>2009-12-08T02:42:11.125-06:00</updated><title type='text'>Demo Tube 2</title><content type='html'>&lt;object width="560" height="340"&gt;&lt;param name="movie" value="http://www.youtube.com/v/WzSnnHHq0gA&amp;hl=en_US&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/WzSnnHHq0gA&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="560" height="340"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;&lt;object width="560" height="340"&gt;&lt;param name="movie" value="http://www.youtube.com/v/82wGnCD0Saw&amp;hl=en_US&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/82wGnCD0Saw&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="560" height="340"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;&lt;object width="560" height="340"&gt;&lt;param name="movie" value="http://www.youtube.com/v/KpHvvbvmAbM&amp;hl=en_US&amp;fs=1&amp;hd=1"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/KpHvvbvmAbM&amp;hl=en_US&amp;fs=1&amp;hd=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="560" height="340"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;&lt;object width="480" height="385"&gt;&lt;param name="movie" value="http://www.youtube.com/v/n4jqULNzVSY&amp;hl=en_US&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/n4jqULNzVSY&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;&lt;object width="425" height="344"&gt;&lt;param name="movie" value="http://www.youtube.com/v/_E060acbpow&amp;hl=en_US&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/_E060acbpow&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;&lt;object width="425" height="344"&gt;&lt;param name="movie" value="http://www.youtube.com/v/pWxftUTMbRQ&amp;hl=en_US&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/pWxftUTMbRQ&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;&lt;object width="425" height="344"&gt;&lt;param name="movie" value="http://www.youtube.com/v/ayv5Y6DAtvs&amp;hl=en_US&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/ayv5Y6DAtvs&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-3400832865573518223?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/3400832865573518223/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/12/demo-tube-2.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3400832865573518223'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3400832865573518223'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/12/demo-tube-2.html' title='Demo Tube 2'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-5576574257328215803</id><published>2009-12-03T13:07:00.002-06:00</published><updated>2009-12-03T13:09:16.514-06:00</updated><title type='text'>PixelJunk Shooter Trailer</title><content type='html'>&lt;a href="http://pixeljunk.jp/library/Shooter/#movie"&gt;PixelJunk Shooter trailer shows off more fluid effects and in-game awesomeness!&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-5576574257328215803?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/5576574257328215803/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/12/pixeljunk-shooter-trailer.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/5576574257328215803'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/5576574257328215803'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/12/pixeljunk-shooter-trailer.html' title='PixelJunk Shooter Trailer'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-1761533108563635519</id><published>2009-12-02T11:36:00.002-06:00</published><updated>2009-12-02T11:43:42.834-06:00</updated><title type='text'>GigaVoxels Mandelbulb</title><content type='html'>&lt;a href="http://www.icare3d.org/blog_techno/gpu/gigabroccoli_the_mandelbulb_into_gigavoxels.html"&gt;GigaBroccoli: The Mandelbulb into GigaVoxels Blog Post&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Really cool work by Cyril, this time run-time on demand generation of 3D geometry into the GigaVoxels brick cache. See the blog link above for more info, videos, and screen shots.&lt;br /&gt;&lt;br /&gt;&lt;object width="425" height="344"&gt;&lt;param name="movie" value="http://www.youtube.com/v/xF4mTGyskr0&amp;hl=en_US&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/xF4mTGyskr0&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-1761533108563635519?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/1761533108563635519/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/12/gigavoxels-mandelbulb.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1761533108563635519'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1761533108563635519'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/12/gigavoxels-mandelbulb.html' title='GigaVoxels Mandelbulb'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-8163640914037944153</id><published>2009-12-02T11:31:00.002-06:00</published><updated>2009-12-02T11:36:20.395-06:00</updated><title type='text'>PixelJunk Shooter Dec 10th!</title><content type='html'>&lt;a href="http://blog.us.playstation.com/2009/12/tis-the-pixeljunk-season/"&gt;PixelJunk Shooter Thursday, December 10th&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-8163640914037944153?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/8163640914037944153/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/12/pixeljunk-shooter-dec-10th.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8163640914037944153'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8163640914037944153'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/12/pixeljunk-shooter-dec-10th.html' title='PixelJunk Shooter Dec 10th!'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-7787436473585569091</id><published>2009-12-01T19:44:00.006-06:00</published><updated>2009-12-01T19:56:55.165-06:00</updated><title type='text'>Real-Life "CFD Visualization"</title><content type='html'>&lt;img src="http://farm3.static.flickr.com/2485/4150830256_877d97908b.jpg"&gt;&lt;br /&gt;&lt;br /&gt;Longer exposure night shot of the sparks from a backyard bonfire visiting family on Thanksgiving. Thanks to my wife Kathryn for the photograph, and my nephew Joe for motivation required to take the time to burn wet firewood,&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2493/4147675463_4d0e53bf3b_m.jpg"&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-7787436473585569091?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/7787436473585569091/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/12/real-life-cfd-visualization.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/7787436473585569091'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/7787436473585569091'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/12/real-life-cfd-visualization.html' title='Real-Life &quot;CFD Visualization&quot;'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm3.static.flickr.com/2485/4150830256_877d97908b_t.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-6095903952991824082</id><published>2009-12-01T13:06:00.003-06:00</published><updated>2009-12-01T13:15:05.989-06:00</updated><title type='text'>PixelJunk Shooter Youtubes</title><content type='html'>&lt;object width="480" height="385"&gt;&lt;param name="movie" value="http://www.youtube.com/v/DIC2Z21Zmj0&amp;hl=en_US&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/DIC2Z21Zmj0&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;&lt;object width="480" height="295"&gt;&lt;param name="movie" value="http://www.youtube.com/v/MgMIEovdXqU&amp;hl=en_US&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/MgMIEovdXqU&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="295"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-6095903952991824082?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/6095903952991824082/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/12/pixeljunk-shooter-youtubes.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6095903952991824082'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6095903952991824082'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/12/pixeljunk-shooter-youtubes.html' title='PixelJunk Shooter Youtubes'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-1864090446186342381</id><published>2009-11-24T03:04:00.003-06:00</published><updated>2009-11-24T04:26:06.002-06:00</updated><title type='text'>Battlefield: Bad Company 2</title><content type='html'>Been playing the PS3 beta because it is an awfully addictively fun game. Don't think I will ever want to go back to a game with a trivially static play area again, no matter how pretty the developer makes it. Baked global illumination, forget it, just not good enough any more. Nothing compares to the fun factor of being able to leave your mark, to be able to interact with your surroundings, and to do that in a tank!&lt;br /&gt;&lt;br /&gt;&lt;object width="560" height="340"&gt;&lt;param name="movie" value="http://www.youtube.com/v/5T-X1eDrE88&amp;hl=en_US&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/5T-X1eDrE88&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="560" height="340"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;The point of excellence in game design is when a game provides more than just the sum of canned content generated in production. Otherwise a game is simply an interactive movie: you play to see the next scene, and when the end is found, there is not anything more to it. Sure there is the market for the interactive novel, but games since the dawn of humans have been about enjoying the interaction between people in a system engineered to produce fun in the process. By this definition, BBC2 multi-player beta definitely captures the essence of what it is to be a game.&lt;br /&gt;&lt;br /&gt;&lt;object width="560" height="340"&gt;&lt;param name="movie" value="http://www.youtube.com/v/8xttxsuSOq0&amp;hl=en_US&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/8xttxsuSOq0&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="560" height="340"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;Other devs out there should be taking notes. &lt;br /&gt;&lt;br /&gt;You've got greater than 1 Tflop to play with on high end GPUs in the PC space, over 100,000 times more processing power than many of us started programming with, and more to come in future years.&lt;br /&gt;&lt;br /&gt;What are you going to do with that?&lt;br /&gt;&lt;br /&gt;Many of you are simply going to increase your resolution, provide a more pretty interactive novel. Others are going to push forward with something new, provide gamers with an interactive experience which brings renewed energy into the industry, and learn to wield the scaling of the parallel machine for something beyond just graphics!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-1864090446186342381?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/1864090446186342381/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/11/battlefield-bad-company-2.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1864090446186342381'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1864090446186342381'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/11/battlefield-bad-company-2.html' title='Battlefield: Bad Company 2'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-6377683025692150587</id><published>2009-11-20T03:10:00.002-06:00</published><updated>2009-11-20T03:27:50.473-06:00</updated><title type='text'>Real Time Global Illumination Using Temporal Coherence</title><content type='html'>Martin Knecht has &lt;a href="http://martinknecht.blogspot.com/2009/07/real-time-global-illumination-using.html"&gt;posted a video and his thesis&lt;/a&gt; on Real Time Global Illumination Using Temporal Coherence.&lt;br /&gt;&lt;br /&gt;&lt;object width="480" height="385"&gt;&lt;param name="movie" value="http://www.youtube.com/v/CBnJagqPlpc&amp;hl=en_US&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/CBnJagqPlpc&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-6377683025692150587?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/6377683025692150587/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/11/real-time-global-illumination-using.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6377683025692150587'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6377683025692150587'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/11/real-time-global-illumination-using.html' title='Real Time Global Illumination Using Temporal Coherence'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-1759883601691993831</id><published>2009-11-19T18:39:00.005-06:00</published><updated>2009-11-20T03:01:21.202-06:00</updated><title type='text'>Time Flys and the Blog is Lonely</title><content type='html'>Too busy to write but links of interesting stuff,&lt;br /&gt;&lt;br /&gt;&lt;a href="http://repi.blogspot.com/2009/11/parallel-futures-of-game-engine.html"&gt;Johan Andersson's Parallel Futures of a Game Engine&lt;/a&gt;&lt;br /&gt;&lt;a href="http://directtovideo.wordpress.com/2009/11/13/deferred-rendering-in-frameranger/"&gt;Direct To Video on Deferred Rendering in Frameranger&lt;/a&gt;&lt;br /&gt;&lt;a href="http://sandervanrossen.blogspot.com/"&gt;Sander van Rossen's Progress on Virtual Texturing&lt;/a&gt;&lt;br /&gt;&lt;a href="http://aras-p.info/blog/2009/11/04/deferred-cascaded-shadow-maps/"&gt;Aras's Post on Deferred Shadow Maps&lt;/a&gt;&lt;br /&gt;&lt;a href="http://graphics.ucsd.edu/~toshiya/"&gt;Stochastic Progressive Photon Mapping by T. Hachisuka and H. W. Jensen&lt;/a&gt;&lt;br /&gt;&lt;a href="http://www.facebook.com/photo.php?pid=2883759&amp;id=8409118252"&gt;Fermi spotted on Twitter&lt;/a&gt;&lt;br /&gt;&lt;a href="http://scyourway.supercomputing.org/conference/selection"&gt;SC09 Papers&lt;/a&gt;&lt;br /&gt;&lt;a href="http://www.gameenginebook.com/coursemat.html"&gt;State-Based Scripting in Uncharted 2: Among Thieves&lt;/a&gt;&lt;br /&gt;&lt;a href="http://www.fractalforums.com/profile/iq/?sa=showPosts"&gt;Quilez's Mandelblub&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-1759883601691993831?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/1759883601691993831/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/11/time-flys-and-blog-is-lonely.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1759883601691993831'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1759883601691993831'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/11/time-flys-and-blog-is-lonely.html' title='Time Flys and the Blog is Lonely'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-5719877073557787098</id><published>2009-11-07T18:38:00.002-06:00</published><updated>2009-11-07T18:43:43.302-06:00</updated><title type='text'>Demo Tube</title><content type='html'>&lt;object width="560" height="340"&gt;&lt;param name="movie" value="http://www.youtube.com/v/GuFrubIT_OQ&amp;hl=en&amp;fs=1&amp;hd=1"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/GuFrubIT_OQ&amp;hl=en&amp;fs=1&amp;hd=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="560" height="340"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;&lt;object width="480" height="385"&gt;&lt;param name="movie" value="http://www.youtube.com/v/30M4GME4GZ0&amp;hl=en&amp;fs=1&amp;hd=1"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/30M4GME4GZ0&amp;hl=en&amp;fs=1&amp;hd=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;&lt;object width="560" height="340"&gt;&lt;param name="movie" value="http://www.youtube.com/v/rQEw_p6cI9M&amp;hl=en&amp;fs=1&amp;hd=1"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/rQEw_p6cI9M&amp;hl=en&amp;fs=1&amp;hd=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="560" height="340"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;&lt;object width="425" height="344"&gt;&lt;param name="movie" value="http://www.youtube.com/v/eKX1wKUIzyc&amp;hl=en&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/eKX1wKUIzyc&amp;hl=en&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;&lt;object width="560" height="340"&gt;&lt;param name="movie" value="http://www.youtube.com/v/XE43jSV_63U&amp;hl=en&amp;fs=1&amp;hd=1"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/XE43jSV_63U&amp;hl=en&amp;fs=1&amp;hd=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="560" height="340"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-5719877073557787098?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/5719877073557787098/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/11/demo-tube.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/5719877073557787098'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/5719877073557787098'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/11/demo-tube.html' title='Demo Tube'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-3129736733674531547</id><published>2009-11-01T18:48:00.002-06:00</published><updated>2009-11-01T21:24:56.672-06:00</updated><title type='text'>Link Soup</title><content type='html'>&lt;a href="http://meshula.net/wordpress/?p=381"&gt;meshula.net : Stone Soup&lt;/a&gt; - Thanks Nick for this post reminiscing of the exotic time before the common C compiler!&lt;br /&gt;&lt;br /&gt;&lt;a href="http://idav.ucdavis.edu/~dfalcant/research/hashing.php"&gt;Real-time Parallel Hashing on the GPU&lt;/a&gt; - Neat paper, and poster at GTC. Builds upon Cuckoo hashing (using N&gt;=2 hash functions instead of one, lookup requires at most a check of N places, insertion requires recursive eviction and insertion of the filled bin using the other hash functions), cuckoo hashing for N=3 hash functions can almost achieve 90% hash table occupancy. Paper presents parallel cuckoo hashing. First step is to use a high level hash function to divide the input into bins sized to fit into the local store, followed by parallel version of cuckoo hashing for the bin. Read the paper for more details.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://graphics.ucsd.edu/~toshiya/"&gt;Stochastic Progressive Photon Mapping&lt;/a&gt; - PPM extended to compute the average radiance over a region instead of a point.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.cse.ust.hk/~yanglei/"&gt;Amortized Supersampling&lt;/a&gt; - Interesting, still problems with rapidly changing shading such as specular highlights, but great progress towards a complete solution.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Rudebox by Alcatrax&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;object width="560" height="340"&gt;&lt;param name="movie" value="http://www.youtube.com/v/itqUd8nxxfc&amp;hl=en&amp;fs=1&amp;hd=1"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/itqUd8nxxfc&amp;hl=en&amp;fs=1&amp;hd=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="560" height="340"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Jellyfish&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Visited the Monterey Bay Aquarium the week before Halloween,&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2678/4055893279_f324d2dbcd.jpg"&gt;&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2499/4056648034_bde41d9aef.jpg"&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-3129736733674531547?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/3129736733674531547/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/11/link-soup.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3129736733674531547'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3129736733674531547'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/11/link-soup.html' title='Link Soup'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm3.static.flickr.com/2678/4055893279_f324d2dbcd_t.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-1758408842172127753</id><published>2009-10-28T22:56:00.002-05:00</published><updated>2009-10-28T23:18:30.529-05:00</updated><title type='text'>Random End of October</title><content type='html'>Looks like Insomniac updated their website: &lt;a href="http://www.insomniacgames.com/research_dev"&gt;Link to R&amp;D Page&lt;/a&gt;. Going to have to finish Uncharted 2 before I get onto A Crack in Time however. Been meaning to watch &lt;a href="http://www.nvidia.com/object/3D_Vision_3D_Movies.html"&gt;Nürburgring 24 Hour Race&lt;/a&gt; in 3D.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-1758408842172127753?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/1758408842172127753/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/10/random-end-of-october.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1758408842172127753'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1758408842172127753'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/10/random-end-of-october.html' title='Random End of October'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-2470845989565610280</id><published>2009-10-23T16:11:00.003-05:00</published><updated>2009-10-23T16:16:42.432-05:00</updated><title type='text'>GPU Technology Conference</title><content type='html'>&lt;a href="http://www.nvidia.com/object/gpu_technology_conference.html"&gt;GPU Technology Conference Main Page&lt;/a&gt; | &lt;a href="http://blogs.nvidia.com/gtc/"&gt;GTC Blog&lt;/a&gt;&lt;br /&gt;Screen-casts plus audio recordings for the sessions are now starting to get posted. Look in the session catalog/calendar for the links.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-2470845989565610280?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/2470845989565610280/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/10/gpu-technology-conference.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2470845989565610280'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2470845989565610280'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/10/gpu-technology-conference.html' title='GPU Technology Conference'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-7309195280583605250</id><published>2009-10-20T10:14:00.002-05:00</published><updated>2009-10-20T10:48:34.029-05:00</updated><title type='text'>1489% on 8-core machine</title><content type='html'>No not 14.89x faster, but rather 14.89x slower!&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.bluebytesoftware.com/blog/PermaLink,guid,c261af7d-d737-4802-b8af-3809124fb61e.aspx"&gt;False sharing is no fun&lt;/a&gt; from Joe Duffy's Weblog. Really like this post because of the simplicity as to which it shows the problem of sharing cache lines on the typical multi-core CPU.&lt;br /&gt;&lt;br /&gt;If one were to assume that you'd always be exactly 14.89 times slower when sharing cache lines (clearly this would vary) across 8 other cores, and you wanted this slow down to only add 6.25% (or 1/16) time overhead to your code, you would want to only share cache lines about 0.45% of the time. &lt;br /&gt;&lt;br /&gt;Actual number here is not the point (and wouldn't be correct anyway), but rather that on the CPU, parallel performance is found by doing a majority of computation in effectively isolated memory regions.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-7309195280583605250?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/7309195280583605250/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/10/1489-on-8-core-machine.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/7309195280583605250'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/7309195280583605250'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/10/1489-on-8-core-machine.html' title='1489% on 8-core machine'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-3265013891110112337</id><published>2009-10-13T10:19:00.003-05:00</published><updated>2009-10-13T10:42:21.792-05:00</updated><title type='text'>Naumachia</title><content type='html'>Found this thanks to a twitter post from Jerome Liard (&lt;a href="http://twitter.com/blackjero"&gt;@blackjero&lt;/a&gt;) at Q-Games. &lt;br /&gt;&lt;br /&gt;&lt;object width="560" height="340"&gt;&lt;param name="movie" value="http://www.youtube.com/v/mpgkqPaEelM&amp;hl=en&amp;fs=1&amp;hd=1"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/mpgkqPaEelM&amp;hl=en&amp;fs=1&amp;hd=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="560" height="340"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://naumachia.aureasection.com/"&gt;naumachia.aureasection.com&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The "Kill Zone 2" effect (particle and transparent effects really make the engine) applied to 3D space combat. Impressive work from a team of just three developers!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-3265013891110112337?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/3265013891110112337/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/10/naumachia.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3265013891110112337'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3265013891110112337'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/10/naumachia.html' title='Naumachia'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-6293674894525897427</id><published>2009-09-30T23:56:00.004-05:00</published><updated>2009-10-01T11:06:49.689-05:00</updated><title type='text'>Fermi</title><content type='html'>&lt;a href="http://www.nvidia.com/object/fermi_architecture.html"&gt;NVIDIA Fermi Architecture Page&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Download the white paper for some architectural details. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Personal View&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Yeah the architecture is awesome! For HPC, double precision fused multiply add at 1/2 the single precision rate, ECC memory support, and 40-bit address space support! For compute, configurable shared memory/L1 split, function and data pointer support via a unified address space, faster global atomics, unified L2 and more...&lt;/i&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-6293674894525897427?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/6293674894525897427/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/fermi.html#comment-form' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6293674894525897427'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6293674894525897427'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/fermi.html' title='Fermi'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-6830771917809138801</id><published>2009-09-25T00:11:00.002-05:00</published><updated>2009-09-25T02:57:09.246-05:00</updated><title type='text'>Last Night in Wisconsin</title><content type='html'>Yeah, last night in Wisconsin. Could talk about how I've actually physically moved out of Wisconsin and back to my Illinois Condo the weekend before September 11th when my lease was up, and how that has resulted in not having the ability to do any interesting CUDA and GPU computing programming for the past few weeks, but likely that would be too boring!&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Prior Dormant Hobby&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Had a 1969 911E which I started to build into a track car years ago and never finished. Sold off the car, trailer, and related stuff this Wednesday. Since I had this huge stalled project (full tear down and rebuild of the entire car), the racing bug was effectively squashed, the wife would have killed me if I bought another car without selling the old project. Actually I ended up doing just that! Made matters worse that the 911E body and trailer was in her parents barn. Also bad that I took her WRX to the track a few times and destroyed the rotors which I didn't fix for a year which ultimately resulted in her taking the car in to get the brakes done locally. Kathryn, you really are the best wife ever! Yes, she reads this blog once and a while ... I'll get an email saying, "I actually understood that post" ;)&lt;br /&gt;&lt;br /&gt;With the project car gone, and a very track friendly daily driver to replace it, the racing bug is biting back hard. &lt;br /&gt;&lt;br /&gt;&lt;embed id=VideoPlayback src=http://video.google.com/googleplayer.swf?docid=-8299660899942404832&amp;hl=en&amp;fs=true style=width:400px;height:326px allowFullScreen=true allowScriptAccess=always type=application/x-shockwave-flash&gt; &lt;/embed&gt;&lt;br /&gt;&lt;br /&gt;Found that video via &lt;a href="http://www.rx7club.com/showthread.php?t=777978"&gt;this rx7club forum post&lt;/a&gt; while looking for someone who is using a &lt;a href="http://www.gforcetransmissions.com/default.html"&gt;G-Force Transmission&lt;/a&gt; dog box in a road racing track car. &lt;br /&gt;&lt;br /&gt;Need to do the brakes, brake lines, harness, and more first, but a G-Force dog box conversion of the Tremec T-56 in my car, is on the list long before adding more HP. Yes it is still going to be my daily driver, no the sound doesn't bother me, and yes one of my old daily drivers didn't have heat, air conditioning, or an interior (so safe to classify me as crazy with respect to what I drive).&lt;br /&gt;&lt;br /&gt;Looks like &lt;a href="http://www.thunderhill.com/index.html"&gt;Thunder Hill Raceway&lt;/a&gt; is less than a 3 hour drive from where I will be living...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-6830771917809138801?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/6830771917809138801/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/last-night-in-wisconsin.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6830771917809138801'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6830771917809138801'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/last-night-in-wisconsin.html' title='Last Night in Wisconsin'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-263370140991679298</id><published>2009-09-24T22:52:00.002-05:00</published><updated>2009-09-24T22:58:05.299-05:00</updated><title type='text'>Function 09: Behind Elevated</title><content type='html'>&lt;a href="http://www.iquilezles.org/www/material/function2009/function2009.pdf"&gt;Function 09 - Behind Elevated&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;object width="640" height="385"&gt;&lt;param name="movie" value="http://www.youtube.com/v/_YWMGuh15nE&amp;hl=en&amp;fs=1&amp;hd=1"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/_YWMGuh15nE&amp;hl=en&amp;fs=1&amp;hd=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="640" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-263370140991679298?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/263370140991679298/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/function-09-behind-elevated.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/263370140991679298'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/263370140991679298'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/function-09-behind-elevated.html' title='Function 09: Behind Elevated'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-1495142503968580977</id><published>2009-09-24T22:35:00.002-05:00</published><updated>2009-09-24T22:51:56.630-05:00</updated><title type='text'>Pixel Junk Shooter: TGS 09 Gameplay!</title><content type='html'>&lt;div style="width: 480px;"&gt;&lt;object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=8,0,0,0" id="gtembed" width="480" height="392"&gt; &lt;param name="allowScriptAccess" value="sameDomain" /&gt; &lt;param name="allowFullScreen" value="true" /&gt; &lt;param name="movie" value="http://www.gametrailers.com/remote_wrap.php?mid=56598"/&gt;&lt;param name="quality" value="high" /&gt; &lt;embed src="http://www.gametrailers.com/remote_wrap.php?mid=56598" swLiveConnect="true" name="gtembed" align="middle" allowScriptAccess="sameDomain" allowFullScreen="true" quality="high" pluginspage="http://www.macromedia.com/go/getflashplayer" type="application/x-shockwave-flash" width="480" height="392"&gt;&lt;/embed&gt; &lt;/object&gt;&lt;/div&gt;&lt;br /&gt;&lt;a href="http://translate.google.com/translate?prev=hp&amp;hl=en&amp;js=y&amp;u=http%3A%2F%2Fgigazine.net%2Findex.php&amp;sl=ja&amp;tl=en&amp;history_state0="&gt;More videos and images can be found on Gigazine&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-1495142503968580977?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/1495142503968580977/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/pixel-junk-shooter-tgs-09-gameplay.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1495142503968580977'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1495142503968580977'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/pixel-junk-shooter-tgs-09-gameplay.html' title='Pixel Junk Shooter: TGS 09 Gameplay!'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-8707038222785928466</id><published>2009-09-23T14:41:00.009-05:00</published><updated>2009-09-24T23:22:30.688-05:00</updated><title type='text'>R800</title><content type='html'>R800 info is out, architecture looks similar to R700. Hard to get a clear answer on if texture filtering has indeed moved completely to the SIMD units (EDIT: perhaps not, see new comments/edits). Also have not seen any definitive answers with regards to &lt;s&gt;triangle setup performance and&lt;/s&gt; info on append/consume performance. HD 5870 specs appear to be as follows,&lt;br /&gt;&lt;br /&gt;EDIT: &lt;a href="http://translate.google.com/translate?prev=hp&amp;hl=en&amp;js=y&amp;u=http%3A%2F%2Fwww.hardware.fr%2Farticles%2F770-6%2Fdossier-amd-radeon-hd-5870.html&amp;sl=fr&amp;tl=en&amp;history_state0="&gt;hardware.fr&lt;/a&gt; tests show same triangle setup rate as R700. No parallel triangle setup, but perhaps double samples rasterized per triangle per clock. Very interesting indeed!&lt;br /&gt;&lt;br /&gt;EDIT: &lt;a href="http://forum.beyond3d.com/showpost.php?p=1338791&amp;postcount=12"&gt;ATI's Dave Baumann on B3D post&lt;/a&gt;: &lt;i&gt;"Texture interpolators have been removed from the design and is done on the shader core. In general we are seeing this as a performance improvement - its also the reason why one of the Vantage feature test gets a disproportionate increase over the previous gen."&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;OVERALL&lt;br /&gt;- 2.7 Tflops single precision.&lt;br /&gt;- 544 Gflops double precision.&lt;br /&gt;- 153.6 GB/sec bandwidth.&lt;br /&gt;- 20 SIMD cores.&lt;br /&gt;&lt;br /&gt;TEXTURE&lt;br /&gt;- 272 Gtex/sec 32-bit pixel unfiltered samples.&lt;br /&gt;- 68 Gtex/sec bilinear filtering.&lt;br /&gt;- 1 TB/sec L1 texture cache.&lt;br /&gt;- 16KB TEX L1 per SIMD core (320KB total).&lt;br /&gt;- Texture units able to read compressed AA color buffers.&lt;br /&gt;- Texture filtering done in SIMD units (according to AnandTech)???&lt;br /&gt;&lt;br /&gt;COMPUTE&lt;br /&gt;- 8KB extra compute L1 per SIMD core (160KB total).&lt;br /&gt;- 32KB local data store (640KB total).&lt;br /&gt;- 64KB global data store.&lt;br /&gt;&lt;br /&gt;MC/RBE/ROP&lt;br /&gt;- 435 GB/sec L2 cache bandwidth.&lt;br /&gt;- 128KB L2 cache per memory controller (512KB total).&lt;br /&gt;- 4 64-bit memory controllers (MCs).&lt;br /&gt;- Render Back-Ends (RBEs) can process 32 pixel/clk.&lt;br /&gt;- RBEs look to be divided across the 4 MCs.&lt;br /&gt;- Fast color clears.&lt;br /&gt;&lt;br /&gt;SIMD&lt;br /&gt;- EDIT: vertex attribute interpolation done in SIMD units.&lt;br /&gt;- EDIT: Support for full speed 24-bit integer math.&lt;br /&gt;- Dedicated Sum of Absolute Differences instruction.&lt;br /&gt;- Faster Dot Product instruction.&lt;br /&gt;- Required support for DX11 instructions.&lt;br /&gt;- Full speed denormals?&lt;/tt&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-8707038222785928466?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/8707038222785928466/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/r800.html#comment-form' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8707038222785928466'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8707038222785928466'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/r800.html' title='R800'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-4698362636589733769</id><published>2009-09-21T21:56:00.008-05:00</published><updated>2009-09-22T00:36:47.778-05:00</updated><title type='text'>Bad Industry Humor: To 12 MPix and Beyond!</title><content type='html'>Second post in the awfully-bad-industry-humor or would-have-been-better-off-sleeping catagory,&lt;br /&gt;&lt;br /&gt;&lt;i&gt;"At first we thought it would be a great idea, the new ATI GPU was to have something like 10x the bandwidth and 20x the compute power of the Xbox 360. So what was working at 720p on the 360 should work about the same on a six 1080p display array off the new GPU. Napkin math said peak capacity amortized to around 8 Kflops/pixel and 800 bytes/pixel bandwidth at 720p/30fps on the 360 and about the same targets on the 12 MPix (multi-monitor) output of the new GPU.&lt;br /&gt;&lt;br /&gt;The project would be simple ... (just use nearly the same engine, add tessellation and up the texture quality) ... or so we thought.&lt;br /&gt;&lt;br /&gt;Early on, decided to go with virtual texturing. With deferred shading, needed about 380 MB alone for the G-buffer (2xMSAA 16B/pixel). Another minimum of 1 GB would be needed for four layers of a compressed 16Kx16K virtual texture. This virtual texture would provide 21 four layer texels per screen pixel (a great ratio).&lt;br /&gt;&lt;br /&gt;That is when the problems started. &lt;br /&gt;&lt;br /&gt;Artists used to work with 2Kx2K source stamp textures at maximum size. Now they were required to work at 8Kx8K just to be safe. This required an upgrade of all the artists machines to 64-bit Photoshop CS4 to support enough address space so they could have more than 16 layers active in Photoshop without swapping to disk (we use 16-bit per channel source images). Had to get rid of all the Mac machines because 64-bit wasn't supported on the Mac with Photoshop.&lt;br /&gt;&lt;br /&gt;Old source material just would not cut it with the new 64 Mpix texture resolution requirement. We had to purchase a few 40 Mpix Phase One digital camera backs and special medium format camera gear just to gather good source photos (at a cost of about $30-40K US per each group of camera/back/lenses/etc). Also had to hire a special photographer to help train the artists on the new gear (teach details like how to choose the proper aperture for sharp photos, etc).&lt;br /&gt;&lt;br /&gt;Furthermore our publisher would only allow us to use four dual-layer Blue-Ray discs, which is a measly 200 GB of data. The art department was outraged, what is the point of having 1 to 2TB desktop drives when the game can only use 200 GB of space? Had something to do with the publisher saying that 12x speed BR drives would require a little over a one hour install to the HDD, and that was the upper maximum time that the average user would stand when every 16 to 20 minutes they had to put in another BR disk to complete the install.&lt;br /&gt;&lt;br /&gt;And don't get me started on the required upgrade to our in-house GI farm, storage solution, and network ... lets just say re-baking the lighting and virtual texture was almost measured in weeks not hours.&lt;br /&gt;&lt;br /&gt;In the end the project got canned and we ran out of money, going for 10x the resolution and quality on 2x the budget of a console title just was not possible yet. We never should have bet on technology that was going to only sustain or reduce the bandwidth (or compute) to pixel ratio ... or worse yet, require hanging an array of monitors from the ceiling. &lt;br /&gt;&lt;br /&gt;Later we heard of another post DX11 launch title going the opposite direction, doing something VGA retro, real-time photon mapping at 320x200 at 30Hz with motion blur and a real camera lens model. The graphics engine was called something like The Turing Engine because most viewers could not tell the difference between the real-time 320x200 rendering and a VHS tape of similar real life source video. Something like 2 Mflops/pixel and 100 KB/pixel of bandwidth at 30 fps, damn, why didn't I think of that!"&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;If for a second you took any of that seriously, you really need to fill up on coffee before hitting the early morning blog run!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-4698362636589733769?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/4698362636589733769/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/bad-industry-humor-to-12-mpix-and.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/4698362636589733769'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/4698362636589733769'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/bad-industry-humor-to-12-mpix-and.html' title='Bad Industry Humor: To 12 MPix and Beyond!'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-2207295593147195133</id><published>2009-09-18T02:10:00.003-05:00</published><updated>2009-09-18T02:23:01.391-05:00</updated><title type='text'>CEDEC 2009: Anti-Downsized Buffer Artifacts</title><content type='html'>&lt;a href="http://www.daionet.gr.jp/~masa/column/2009-09-12.html"&gt;CEDEC 2009: Anti-Downsized Buffer Artifacts&lt;/a&gt; &lt;br /&gt;&lt;i&gt;"The page you requested was too large to translate."&lt;/i&gt;&lt;br /&gt;Anyone have a good summary?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-2207295593147195133?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/2207295593147195133/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/cedec-2009-anti-downsized-buffer.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2207295593147195133'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2207295593147195133'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/cedec-2009-anti-downsized-buffer.html' title='CEDEC 2009: Anti-Downsized Buffer Artifacts'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-8822646749408840036</id><published>2009-09-17T21:16:00.005-05:00</published><updated>2009-09-18T02:02:59.962-05:00</updated><title type='text'>Food-For-Thought Supermarket Aisle</title><content type='html'>&lt;i&gt;This post is to keep the twitter peanut gallery happy ;) Thanks Ivan-Assen Ivanov for the inspiration, sure was a lot more fun then doing something productive like continuing my search for a place to live when I move...&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Starting With Some Simple Examples&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Q: If you are playing chess with a distant opponent via snail mail, and want to play the game faster than the US postal service can deliver messages, how do you do it?&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;A: You send not only your next move, but also your response to all of your opponent's possible next moves, under agreement neither player will cheat.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;In the above case, latency is the bottleneck, compute time is abused with redundant work to solve the latency problem.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Q: A pre-gunpowder age general wants to communicate a real-time updates of highly complex battle field commands to distant field officers but only has a limited set of flags, how is this done?&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;A: General and field officers pre-agree upon a complex battle plan option (or meaning) per flag.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Bandwidth is the bottleneck, solved by taking advantage of local memory using the ultimate form of compression.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Q: A manager has 12 phone meetings in a given very busy day, how is this handled efficiently?&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;A: A secretary pre-schedules each phone meeting.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;This is a rough example of the problem of task preemptive scheduling (interrupted phone calls) when there are dependencies (communication) between nodes.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Humans Are Natural Parallel Programmers&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Ironically, humans as a group are great at solving problems with distributed memory, as individuals have only a subset of the knowledge stored in the continuation of the species. Many of us, myself included, couldn't possibly remember everything we need to know on a daily basis. In fact, often we get better at remembering where we can find information (indexes or addresses) than we are at remembering the information itself. Humans organize into hierarchies to solve problems, as well as pipeline processes, think about manufacturing (thanks Mike Acton), supply chain management, etc.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;The point I am getting at here, is that going parallel is natural and is easy to think about when grounding the problem in terms of something real.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Human Species Machine&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;How about something more crazy...&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Massively parallel programming gets rather interesting when thinking about distributed memory, non-coherent redundancy, the problem of node failure, message passing, latency, and compression, all together.&lt;br /&gt;&lt;br /&gt;One fantastic quality of humans is that we all have different memories of similar events or ideas, a kind of distributed non-coherent redundancy. The memory of an event (or idea) is physically distributed across many nodes, with great overlap in the "important" data and less overlap in details which are more important to individual nodes. &lt;br /&gt;&lt;br /&gt;People live and die (node failure) and pass on ideas though ripples left in the fabric of this non-coherent distributed memory. Human language and culture provides a huge distributed context as to which to compress data for message passing. A single phrase can draw upon this giant context for meaning.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Stability Without Coherent Memory&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Simple patterns can serve to generate intrinsic randomness (divergence), or can serve to maintain and safely transmit information (convergence). Digital computation is a great example of this (given the underlining "analog" flow of electrons). &lt;br /&gt;&lt;br /&gt;With the human species machine, constant message passing serves to shape and change the distributed memory. Ideas and information become stable when replicated across many nodes. The more parallel the machine, and the more non-coherent, distributed, and redundant the memory gets, the less a single node failure matters. In fact, given the amount of message passing, the information is constantly in flux at the scale of individual nodes, but as a global whole, a stable idea is highly coherent collectively.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;BTW, if you are still reading this post, I would highly suggest reading Stephen Wolfram's New Kind of Science (at least twice)!&lt;/i&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-8822646749408840036?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/8822646749408840036/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/food-for-thought-supermarket-aisle.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8822646749408840036'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8822646749408840036'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/food-for-thought-supermarket-aisle.html' title='Food-For-Thought Supermarket Aisle'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-7473434936317928599</id><published>2009-09-16T07:37:00.005-05:00</published><updated>2009-09-16T09:32:30.241-05:00</updated><title type='text'>NVIDIA and Moving</title><content type='html'>Moving out to Silicon Valley soon to start a new job at NVIDIA!&lt;br /&gt;&lt;br /&gt;Clearly the blog will be changing, should be obvious what I will NOT be writing about. Also more excited about what I will be working on at my new job than anything I would have worked on at home. Future "at home / free time" focus will likely be shifting towards actual technical papers or presentations or demos.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Soon Farewell to Wisconsin&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Had a great time working many of the people at &lt;a href="http://humanhead.com/?pagename=Jobs"&gt;Humanhead&lt;/a&gt;: Brian, Jimmy, Ash, Giovanni, Michael, Norm, and more. &lt;br /&gt;&lt;br /&gt;Also have a lot of great memories of Madison and Wisconsin: &lt;a href="http://www.menupix.com/madison/restaurants.php?id=730306"&gt;Lao Laan Xang&lt;/a&gt; (favorite local restaurant), &lt;a href="http://www.fordsgym.com/"&gt;Ford's Gym&lt;/a&gt;, &lt;a href="http://www.dnr.state.wi.us/Org/land/parks/specific/devilslake/"&gt;Devils Lake&lt;/a&gt; (hiking, climbing), &lt;a href="http://dnr.wi.gov/org/land/parks/specific/bluemound/"&gt;Blue Mound&lt;/a&gt;, &lt;a href="http://www.dcfm.org/"&gt;Farmer's Market on the Square&lt;/a&gt;, &lt;a href="http://www.airventure.org/"&gt;EAA Airshow&lt;/a&gt;, &lt;a href="http://www.nps.gov/apis/index.htm"&gt;Apostle Islands&lt;/a&gt;, and &lt;a href="http://www.roadamerica.com/"&gt;Road America&lt;/a&gt; (race track). &lt;br /&gt;&lt;br /&gt;Did not even scratch the surface of everything I could have done when in Wisconsin. For instance really did not take advantage of the good proximity to two great road courses: Road America and &lt;a href="http://www.blackhawkfarms.com/"&gt;Black Hawk Farms&lt;/a&gt;. Last Road America visit warped the rotors on the WRX (which is ironically my wife's daily driver). Did not get around to either replacing the rotors (done) or getting myself a track friendly car (also done) until it was too late (too busy, and now moving). Hopefully will rectify this problem at Infineon Raceway or Laguna Seca after I move.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-7473434936317928599?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/7473434936317928599/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/nvidia-and-moving.html#comment-form' title='18 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/7473434936317928599'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/7473434936317928599'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/nvidia-and-moving.html' title='NVIDIA and Moving'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>18</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-7548392562724510542</id><published>2009-09-15T02:33:00.001-05:00</published><updated>2009-09-15T02:36:46.252-05:00</updated><title type='text'>Quite "yes we can"</title><content type='html'>&lt;object width="640" height="385"&gt;&lt;param name="movie" value="http://www.youtube.com/v/_zc9ShGmkSU&amp;hl=en&amp;fs=1&amp;hd=1"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/_zc9ShGmkSU&amp;hl=en&amp;fs=1&amp;hd=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="640" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.pouet.net/prod.php?which=53785"&gt;Quite "yes we can", 1st at Chaos Constructions 2009 combined 4k compo.&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-7548392562724510542?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/7548392562724510542/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/quite-yes-we-can.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/7548392562724510542'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/7548392562724510542'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/quite-yes-we-can.html' title='Quite &quot;yes we can&quot;'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-2376999322513487041</id><published>2009-09-13T12:53:00.004-05:00</published><updated>2009-09-13T14:37:18.736-05:00</updated><title type='text'>Selling My Truck</title><content type='html'>&lt;i&gt;In case anyone in the Chicago area is interested (the truck is garaged in Westmont IL),&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://lh3.ggpht.com/_bIKhQVqvDtc/Sq0xIeUg0fI/AAAAAAAAADU/di_9J1A1SwQ/s800/LargeAdS10.jpg"&gt;&lt;img src="http://lh3.ggpht.com/_bIKhQVqvDtc/Sq0xIeUg0fI/AAAAAAAAADU/di_9J1A1SwQ/s400/LargeAdS10.jpg"&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.kbb.com/KBB/UsedCars/PricingReport.aspx?YearId=2002&amp;Mileage=98000&amp;VehicleClass=UsedCar&amp;ManufacturerId=9&amp;ModelId=329&amp;PriceType=Private+Party&amp;VehicleId=3600&amp;SelectionHistory=3600|31416|60559|0|0|138403|true|138436|true|138638|true|138626|true&amp;Condition=Good&amp;QuizConditions="&gt;Kelley Blue Book Value: $5210&lt;/a&gt;&lt;br /&gt;Selling For: $5000 or best offer&lt;br /&gt;&lt;br /&gt;Contact: Timothy Farrar &lt;br /&gt;Phone: 630-430-6656 &lt;br /&gt;Gmail: farrarfocus&lt;br /&gt;&lt;br /&gt;Had this truck custom made from the factory with special options like the close-ratio New Venture Gear transmission with single shift rail rated to 300 ft lbf of torque, and the Steel wheels (actually durable compared to crappy Aluminum). Great for towing, great for weekend trips, fun to drive. Was not planning on ever selling this truck, but moving across the country in two weeks and I have too many vehicles (truck, project car, and trailer need to go).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Chevy S-10 2WD Extended Cab&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Original Owner&lt;br /&gt;4.3 L V6&lt;br /&gt;Heavy duty 5 speed manual transmission option&lt;br /&gt;Limited slip differential&lt;br /&gt;Bed liner and topper&lt;br /&gt;Hitch&lt;br /&gt;98K highway miles&lt;br /&gt;New tires&lt;br /&gt;Heavy duty suspension option&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Specific Options&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;G80 : AXLE POSITRACTION, LIMITED SLIP&lt;br /&gt;GU4 : AXLE REAR, 3.08 RATIO&lt;br /&gt;K60 : GENERATOR, 100 AMP&lt;br /&gt;L35 : ENGINE, GAS, 6 CYL, 4.3L, CPI, V6, 90 DEG&lt;br /&gt;M50 : TRANSMISSION, MAN 5 SPD, NVG&lt;br /&gt;PH1 : WHEEL, 15 X 7, STEEL&lt;br /&gt;QCE : TIRE ALL, P205/75R15/N BL R/PE ST TL ALS&lt;br /&gt;Z85 : CHASSIS PACKAGE, HIGH PAYLOAD FIRM RIDE&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Note to Self&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.s10forum.com/classifieds/showproduct.php?product=53154"&gt;S10 Forum Ad&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-2376999322513487041?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/2376999322513487041/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/selling-my-truck.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2376999322513487041'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2376999322513487041'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/selling-my-truck.html' title='Selling My Truck'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://lh3.ggpht.com/_bIKhQVqvDtc/Sq0xIeUg0fI/AAAAAAAAADU/di_9J1A1SwQ/s72-c/LargeAdS10.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-8243610753391441025</id><published>2009-09-12T11:32:00.001-05:00</published><updated>2009-09-12T11:33:24.936-05:00</updated><title type='text'>DirectX SDK August 2009</title><content type='html'>&lt;a href="http://www.microsoft.com/downloads/details.aspx?FamilyID=b66e14b8-8505-4b17-bf80-edb2df5abad4&amp;displaylang=en"&gt;DirectX SDK August 2009&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-8243610753391441025?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/8243610753391441025/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/directx-sdk-august-2009.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8243610753391441025'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8243610753391441025'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/directx-sdk-august-2009.html' title='DirectX SDK August 2009'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-8164926991183481580</id><published>2009-09-12T11:27:00.002-05:00</published><updated>2009-09-12T11:32:08.536-05:00</updated><title type='text'>Hotchips 2009 OpenCL Presentations</title><content type='html'>&lt;a href="http://www.khronos.org/library/detail/2009-hotchips/"&gt;Hotships 2009 OpenCL Presentations&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-8164926991183481580?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/8164926991183481580/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/hotchips-2009-opencl-presentations.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8164926991183481580'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8164926991183481580'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/hotchips-2009-opencl-presentations.html' title='Hotchips 2009 OpenCL Presentations'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-6967146197260626612</id><published>2009-09-09T20:34:00.002-05:00</published><updated>2009-09-09T20:45:59.843-05:00</updated><title type='text'>LLVM - Block Implementation Specification</title><content type='html'>&lt;a href="http://clang.llvm.org/docs/BlockImplementation.txt"&gt;http://clang.llvm.org/docs/BlockImplementation.txt&lt;/a&gt;&lt;br /&gt;&lt;a href="http://clang.llvm.org/docs/BlockLanguageSpec.txt"&gt;http://clang.llvm.org/docs/BlockLanguageSpec.txt&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Interesting read for anyone looking to get a better idea on GCD low level details. Have not yet had the time to really look at GCD in enough detail yet myself. However, it is very exciting for Apple to take a step in the right direction and both (1) make the OS smaller and faster, and (2) start to address issues of parallel programming.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-6967146197260626612?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/6967146197260626612/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/llvm-block-implementation-specification.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6967146197260626612'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6967146197260626612'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/llvm-block-implementation-specification.html' title='LLVM - Block Implementation Specification'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-3999122153561797078</id><published>2009-09-09T20:08:00.003-05:00</published><updated>2009-09-09T20:33:23.302-05:00</updated><title type='text'>L1 Misses</title><content type='html'>&lt;a href="http://x264dev.multimedia.cx/?p=149"&gt;Dark Shikari has a really great blog post talking about L1 misses&lt;/a&gt;. Quoting from that post,&lt;br /&gt;&lt;br /&gt;&lt;i&gt;"And yet in many cases–such as in x264–much more time is wasted on L1 misses than L2 misses.&lt;br /&gt;&lt;br /&gt;The AMD processor documentation says that the L2-&gt;L1 prefetcher is not strided, and tests on Intel chips suggest the same.  This means that if we are performing, for example, an access of a block of image data that is in L2 but not L1 cache, every single line of data will cause an L1 cache miss. The benchmarks seem to agree; the first chroma motion compensation during qpel in x264 takes more than twice as long as the others!"&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;This is really a fantastic example of some of the less talked about limitations of using CPU caches and processors optimized for low latency serial computation. Will be interesting to see if this becomes more or less of a problem as CPU style architectures have ever larger vector units and end up using L1 more as a virtual register file.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-3999122153561797078?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/3999122153561797078/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/l1-misses.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3999122153561797078'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3999122153561797078'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/l1-misses.html' title='L1 Misses'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-7486935103587947855</id><published>2009-09-09T19:22:00.002-05:00</published><updated>2009-09-09T20:02:03.011-05:00</updated><title type='text'>Erlang</title><content type='html'>Never really looked at Erlang, so just watched &lt;a href="http://www.infoq.com/presentations/erlang-software-for-a-concurrent-world"&gt;Erlang : Software for a Concurrent World&lt;/a&gt; to get a general overview. Lots of interesting bits in that presentation. One thing I think might be a possible weakness of Erlang is the idea of "location transparency". Seems as if when going parallel with message passing, that computation location and data location becomes a critical factor in algorithm design (perhaps less critical when nodes are interconnected by high-radix routers). That presentation referenced &lt;a href="http://www.tilera.com/pdf/ProductBrief_TILEPro64_Web_v2.pdf"&gt;Tilera Tile64&lt;/a&gt; hardware, which I had not seen before. Interesting in how the above Erlang presentation talks about how shared memory doesn't scale, and yet the Tilera hardware dedicates area to a &lt;a href="http://arstechnica.com/hardware/news/2007/08/MIT-startup-raises-multicore-bar-with-new-64-core-CPU.ars"&gt;giant shared L2&lt;/a&gt;. That last link from Ars references &lt;a href="http://groups.csail.mit.edu/cag/raw/"&gt;The Raw Project&lt;/a&gt;, which I also had never seen ... &lt;i&gt;starting to feel like I have been living in a vacuum!&lt;/i&gt; Anyway, for those interested in what others are doing with parallel systems and programming, lots of interesting reading is only a google away from many of the above topics.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-7486935103587947855?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/7486935103587947855/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/erlang.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/7486935103587947855'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/7486935103587947855'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/erlang.html' title='Erlang'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-2112585761153794620</id><published>2009-09-03T01:20:00.005-05:00</published><updated>2009-09-06T17:54:42.085-05:00</updated><title type='text'>Early September Link Dump</title><content type='html'>&lt;i&gt;Still no time for anything but links and videos...&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://mygdc.gdconf.com/vault/1337"&gt;GDC 2009 Europe and Canada&lt;/a&gt;&lt;br /&gt;&lt;a href="http://highperformancegraphics.org/program.html"&gt;HPG 2009&lt;/a&gt;&lt;br /&gt;&lt;a href="http://www.mpi-inf.mpg.de/~ritschel/Microrendering/"&gt;Micro Rendering&lt;/a&gt;&lt;br /&gt;&lt;a href="http://developer.apple.com/mac/library/documentation/Performance/Reference/GCD_libdispatch_Ref/Reference/reference.html"&gt;Grand Central Dispatch&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Krakatoa particle renderings, links to these videos were posted on Twitter. &lt;br /&gt;&lt;br /&gt;&lt;object width="640" height="385"&gt;&lt;param name="movie" value="http://www.youtube.com/v/7c4WYzr30B0&amp;hl=en&amp;fs=1&amp;hd=1"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/7c4WYzr30B0&amp;hl=en&amp;fs=1&amp;hd=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="640" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;&lt;object width="640" height="385"&gt;&lt;param name="movie" value="http://www.youtube.com/v/izKMqINXE-w&amp;hl=en&amp;fs=1&amp;hd=1"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/izKMqINXE-w&amp;hl=en&amp;fs=1&amp;hd=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="640" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-2112585761153794620?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/2112585761153794620/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/early-september-link-dump.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2112585761153794620'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2112585761153794620'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/09/early-september-link-dump.html' title='Early September Link Dump'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-2925421307596958899</id><published>2009-08-19T22:17:00.004-05:00</published><updated>2009-08-20T01:12:13.706-05:00</updated><title type='text'>Mid-August Core Dump</title><content type='html'>&lt;i&gt;Been busy... too much interesting stuff, too little time.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Logarithmic Z Buffer&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.gamedev.net/community/forums/mod/journal/journal.asp?jn=503094"&gt;Journal of Lethargic Programmers Post&lt;/a&gt; thanks to tweet from Jerome at Q-Games (&lt;a href="http://twitter.com/blackjero"&gt;@blackjero&lt;/a&gt;). Provides large Z-buffer range in 24-bits at the expense of destroying correct interpolation, might be something to revisit with DX11 tessellation (tessellate to the point where interpolation artifacts are not a problem).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Assembly 2009 Seminars&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://media.assembly.org/vod/2009/Seminars/"&gt;Direct link to all.&lt;/a&gt; Watched the &lt;a href="http://media.assembly.org/vod/2009/Seminars/879_Developing-the-technology-behind-Love_h264_1856kbit.mp4"&gt;Technology Behind Love&lt;/a&gt; presentation. Interesting tech, smart use of OpenGL buffer objects, etc, but I'm not wild about the min filter as a post process (looks better with those thin fine details).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Carmack Keynote&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.quakeunity.com/file=2919"&gt;Watch vid here.&lt;/a&gt; Refreshing to hear someone just speak their mind and not care, even if I don't agree with all of it. Not knowing people on the bottom floor, was priceless! Seems like they are taking a big hit with a mess of single thread OO game code forcing adoption of full frame of latency on all parallel job results. Sounds awfully painful to me.&lt;br /&gt;&lt;br /&gt;Best way forward IMO is to go as wide (parallel) as possible with everything including game code! Tight pipeline of independent jobs ordered by common sets of a small number of dependencies, enables low enough latency to make use of parallel results in-frame. If one is doing this right, jobs won't look like this (example with 5 cores),&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;AAA.BBBCDD.FFFF.&lt;br /&gt;AAA.BB.CDE.FFFF.&lt;br /&gt;AA..BB.CDEEFFF..&lt;br /&gt;AA..B..CDE.FFFF.&lt;br /&gt;GGGGGGGGGGG.HIIJ&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;Note the bubbles in the pipeline, and the clear sync points between groups of jobs, and one large serial operation (G) which isn't latency friendly. Instead the ideal result would look something like this,&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;AABCDDEEFIJJ&lt;br /&gt;AAACDDEEFIJJ&lt;br /&gt;ABBCCDEEGIIJ&lt;br /&gt;AABBCDDEEEIJ&lt;br /&gt;ABBCCDEEHIIJ&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;Note how enough independent work is interleaved to handle run-time variation. For example job B can start before A is finished (D might be the first job depending on results from A). Also note the task parallel G and H, as well as how jobs go as wide as possible to keep the latency down! Dead simple, dead fast!&lt;br /&gt;&lt;br /&gt;&lt;i&gt;I know, can be tough (and impossible) for some to do given the constrains of legacy code base. Plus integration of 3rd party middleware can be a real problem (and with regards to this, clearly I'm not a fan of SPURS). But sooner or later, everything will need to work this way to scale on future hardware.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Light Propagation Volumes in CryEngine 3&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.crytek.com/technology/presentations/"&gt;Grab the presentation off the Crytek Technology Page&lt;/a&gt;. Really appreciate the PS3 and 360 run-time numbers. Impressive in the trade-offs made for real-time performance. Locking the volume to world grid points (CCSM style) is a great idea. This hierarchical irradiance volume method might very well just be the beginning of all sorts of interesting approximations ... just wait until someone revisits this with DX11!&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.incrysis.com/index.php?option=com_content&amp;task=view&amp;id=818&amp;Itemid=1"&gt;GDC Europe: Cevat on The Future of Gaming Graphics&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Sweeney HPG Keynote&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://graphics.cs.williams.edu/archive/SweeneyHPG2009/"&gt;Slides here.&lt;/a&gt; Like functional programming, but NOT into either software transactional memory and cache coherency. Also the cost slide is way misleading: STMD shader programming model is one of the most productive and easy programming models ever, traditional C/C++ and vector intrinsics is many times more costly given a programmer who actually know how to solve problems using both methods! Lets not forget that the STMD shader/kernel model has been critical for the rapid evolution of the GPU: everything scales, everything is portable across quite different implementations!&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Giga Voxels&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://artis.imag.fr/Publications/2009/CNLSE09/GigaVoxels_Siggraph09_Slides.pdf"&gt;Siggraph 2009 Slides.&lt;/a&gt; It was nice to briefly talk to Cyril at Siggraph. Cyril keep up the great work!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-2925421307596958899?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/2925421307596958899/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/mid-august-core-dump.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2925421307596958899'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2925421307596958899'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/mid-august-core-dump.html' title='Mid-August Core Dump'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-8117921861282781543</id><published>2009-08-13T09:28:00.002-05:00</published><updated>2009-08-14T09:53:21.820-05:00</updated><title type='text'>Siggraph 2009: Thursday - Frostbite</title><content type='html'>Beyond Programmable Shading I Talk,&lt;br /&gt;&lt;a href="http://s09.idav.ucdavis.edu/talks/04-JAndersson-ParallelFrostbite-Siggraph09.pdf"&gt;Parallel Graphics in Frostbite - Current &amp; Future by Johan Andersson&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The deferred lighting part of the presentation provides a glimpse into the types of things possible with DX11. &lt;u&gt;Running the light intersection step in the same kernel as the lighting is an excellent example of task parallel computation in a data parallel kernel. Note this has the advantage of keeping intermediate data on-chip!&lt;/u&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Optimization&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;In the presentation, the depth min/max computation and the parallel lighting tests, are good examples of the types of problems one will need to solve using group share memory (or shared memory for those in CUDA). At some point one will want to optimize the interactions between threads communicating through this shared memory. &lt;br /&gt;&lt;br /&gt;Below I've included some simple CUDA test results on a GTX 275 which provides an example on how to go about optimizing shared memory interaction between threads. &lt;br /&gt;&lt;br /&gt;&lt;i&gt;Note, I don't have details on DX11 hardware, so the logic below may or may not carry over to actual DX11 hardware!&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;MIN/MAX REDUCTION TESTS&lt;br /&gt;&lt;br /&gt;Numbers below represent overall run time in GPU clock cycles as reported by the CUDA clock() function (took the minimum of a few tests since run-time even on the GPU varies), so lower is better. The kernel returns the GPU clock() at the end point in the kernel, so I use a CPU side min/max reduction to get a &lt;u&gt;rough&lt;/u&gt; idea of GPU time. The results are good for general trends, but one would have to test each in a physical application to really know (and to fully verify actual results, which I only did roughly via eyeballing a few blocks).&lt;br /&gt;&lt;br /&gt;32x8 block size (256 threads, width matches CUDA warp length).&lt;br /&gt;32x32 block grid (1024x256 threads).&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;CLOCKS .... TEST&lt;br /&gt;----------------&lt;br /&gt;88286 ..... no min/max reduction, but syncthreads (overhead of test)&lt;br /&gt;113880 .... non-atomic log-time horz only&lt;br /&gt;174712 .... non-atomic log-time horz, with atomic min/max vertical&lt;br /&gt;338372 .... non-atomic linear-time horz only&lt;br /&gt;397648 .... non-atomic linear-time horz, with atomic min/max vert&lt;br /&gt;408048 .... non-atomic linear-time horz, with non-atomic linear-time vert&lt;br /&gt;1028708 ... full colliding atomic min/max&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;I'm going to stress this again: it is important to note here that run time does vary (the difference between atomic and non-atomic linear-time vertical reductions should be logically the same), and the test reporting itself can effect run times (recording GPU clock cycles requires writing results back to memory, uses registers, as well as has some ALU overhead, for example a 64-bit to 32-bit conversion if I remember right). The only thing gathered from these numbers should be general trends, programmers will have to test in an actual application to see side effects of varring shared memory requirements and register usage!&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;The "full colliding atomic min/max" represents the baseline case in the presentation. It was neat to see even this initial &lt;i&gt;"not production tested or optimized" (see page 21)&lt;/i&gt; deferred lighting demo running with 1000 very large intersecting lights running quite a bit faster than the needs for a real-time game...&lt;br /&gt;&lt;br /&gt;With the tests, I did not try all the possibilities (haven't had the time), for example doing the vertical stages first, and also I did not try a log-time vertical reduction (because it would have required lots of syncthreads).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;TERMS AND RESULTS&lt;br /&gt;&lt;br /&gt;horizontal - means across threads in a warp&lt;br /&gt;vertical - means across warps&lt;br /&gt;&lt;br /&gt;Trends verify that colliding shared memory atomics are indeed serialized on NVidia's GT200 chipset (it is only logical that they would be). Best performance I found was a non-atomic log-time parallel reduction horizontally (5 steps for a 32-wide block, each step writes results back to shared memory for use in the next step), followed by a barrier and then a vertical reduction. Knowing the hardware, it should be possible to do the vertical reduction steps without atomic operations, and without group share scatter. Without knowing the DX11 in detail, I can only guess that this path would work in CS4 and also be the fast path in on CS5 hardware.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-8117921861282781543?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/8117921861282781543/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/siggraph-2009-thursday-frostbite.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8117921861282781543'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8117921861282781543'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/siggraph-2009-thursday-frostbite.html' title='Siggraph 2009: Thursday - Frostbite'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-96695824894049356</id><published>2009-08-12T17:22:00.002-05:00</published><updated>2009-08-12T17:44:29.855-05:00</updated><title type='text'>Stochastic Visibility in Distorted Fisheye</title><content type='html'>Talked about this &lt;a href="http://farrarfocus.blogspot.com/2009/07/weak-computer-took-one-for-team.html"&gt;before&lt;/a&gt;, "Screenshot would have been included if computer was alive". Finally got around to doing a few screen grabs. &lt;br /&gt;&lt;br /&gt;Quick refresher, the engine is a modified version of the one in &lt;a href="http://farrarfocus.blogspot.com/2009/07/video-of-particles-via-l-system-and.html"&gt;this video&lt;/a&gt;. The screens below do NOT show a rendered view, instead they show a direct rendering of the {x,y,z} coordinates (modified by a transform to color them) of the nodes themselves. Yes the scene tree is physically stored in the distorted fisheye projection, parent nodes in the hierarchy show up as the salt and pepper noise in the screen grabs. &lt;br /&gt;&lt;br /&gt;The persistent scene is 1024x1024 or 1M nodes (minus the edges of the fisheye projection) of which their 8M children are processed each frame. I bumped up the number of roots in the scene tree to 64K (from the 4K I mentioned in the previous blog post).&lt;br /&gt;&lt;br /&gt;It was awesome to see the stochastic visibility algorithm work on all sorts of horrid cases like insane amounts of object to object overlap (overlap being both from rooted objects overlapping, as well as the root's tree nodes self overlapping).&lt;br /&gt;&lt;br /&gt;&lt;i&gt;I've been distracted with the CUDA version, so I haven't had time to try the collision improvement. I am keeping this alive however, just in case I get surprised by the triangle rate on DX11 hardware...&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;View of 64K nodes from outside the structure,&lt;br /&gt;&lt;br /&gt;&lt;img src="http://lh4.ggpht.com/_bIKhQVqvDtc/SoL9vgZbzNI/AAAAAAAAABw/EYhzncwzQA4/s800/sq3.jpg"&gt;&lt;br /&gt;&lt;br /&gt;BTW, background color is a function of eye position,&lt;br /&gt;&lt;br /&gt;&lt;img src="http://lh3.ggpht.com/_bIKhQVqvDtc/SoL9vXsHfdI/AAAAAAAAABs/aTSIrSbj3dY/s800/sq2.jpg"&gt;&lt;br /&gt;&lt;br /&gt;Inside,&lt;br /&gt;&lt;br /&gt;&lt;img src="http://lh3.ggpht.com/_bIKhQVqvDtc/SoL9vbOTViI/AAAAAAAAABo/BDxDHWSpsRc/s800/sq1.jpg"&gt;&lt;br /&gt;&lt;br /&gt;View of nodes self intersecting from outside,&lt;br /&gt;&lt;br /&gt;&lt;img src="http://lh6.ggpht.com/_bIKhQVqvDtc/SoL965vIciI/AAAAAAAAACI/s6iuA3RI-Fs/s800/sq7.jpg"&gt;&lt;br /&gt;&lt;br /&gt;And from inside the structure (note the corners of the distorted fisheye are not part of the image, this is a 360 degree view), &lt;br /&gt;&lt;br /&gt;&lt;img src="http://lh4.ggpht.com/_bIKhQVqvDtc/SoL96vMzKdI/AAAAAAAAACE/TDZ8_cXQBLk/s800/sq6.jpg"&gt; &lt;br /&gt;&lt;br /&gt;Other shots also with different fractal rules (different amounts of overlap, etc),&lt;br /&gt;&lt;br /&gt;&lt;img src="http://lh4.ggpht.com/_bIKhQVqvDtc/SoL-Ad9cPzI/AAAAAAAAACY/e7t5kJZ0zQM/s800/sq11.jpg"&gt;&lt;br /&gt;&lt;br /&gt;&lt;img src="http://lh5.ggpht.com/_bIKhQVqvDtc/SoL97Z114wI/AAAAAAAAACU/2OLz6VckxaI/s800/sq10.jpg"&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-96695824894049356?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/96695824894049356/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/stochastic-visibility-in-distorted.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/96695824894049356'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/96695824894049356'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/stochastic-visibility-in-distorted.html' title='Stochastic Visibility in Distorted Fisheye'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://lh4.ggpht.com/_bIKhQVqvDtc/SoL9vgZbzNI/AAAAAAAAABw/EYhzncwzQA4/s72-c/sq3.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-2052269239650323669</id><published>2009-08-12T11:28:00.003-05:00</published><updated>2009-08-12T12:06:07.605-05:00</updated><title type='text'>GPU Address Space Mapping II</title><content type='html'>&lt;a href="http://farrarfocus.blogspot.com/2009/08/gpu-address-space-mapping.html"&gt;This is somewhat a continuation of the topics from this prior post.&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Decided to look at the issues of GPU memory from a different perspective, from the perspective of the packets of information which need to be routed around on chip. &lt;u&gt;I'm using the CUDA Compute 1.3 model as a rough guide, and building an oversimplified hypothetical GPU model which probably is different from the actual hardware.&lt;/u&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;The aim of this post is to cover some of the reasons GPUs have a different architecture than CPUs, and to provide a natural intuition on how to go about solving problems on the GPU.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Hypothetical GPU Model&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The GPU modeled in this post has a bunch of independent ALU units connected to a bunch of independent MC (memory controllers) via a very fast on-chip interconnect network. If there are N MCs then in this example, global memory is banked N-way, so that memory requests can be load-balanced between the MCs. The individual bank width is large (perhaps 32 bytes) to amortize the bits used for address per read/write transaction (for efficient on-chip routing and GDDR access).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;INTERCONNECT PACKETS FOR GLOBAL MEMORY ACCESS&lt;br /&gt;&lt;br /&gt;Since Compute 1.3 supports 32B (B=byte) segments, the hypothetical hardware model routes 32B data segments to and from the MCs. Each routed data packet also includes extra bits such as segment address, type of transaction (read, write, or atomic operation), mask bits (to mask words of the operation), packet routing header, etc.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Note I'm working under the assumption that global atomic ALU ops are done at the MC (it might not be, or it might be done at the ROP in real hardware).&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Using this model, larger segment transactions (and atomics which have 2 operands, or work on 64-bit values), require multiple packets to service a single request. Below is a rough idea of packets,&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;transaction ........ {send} -- {receive}&lt;br /&gt;-------------------- -------------------&lt;br /&gt;write .............. {extra bits, 32B segment} -- {}&lt;br /&gt;read ............... {extra bits} -- {return extra bits, 32B segment}&lt;br /&gt;atomic ............. {extra bits, 32B atomic operand(s)} -- {}&lt;br /&gt;atomic w/ return ... {extra bits, 32B atomic operand(s)} -- {extra, 32B old values}&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;FINE VS COARSE ROUTING&lt;br /&gt;&lt;br /&gt;Under the simple model above it becomes easy to see why fine granularity scatter/gather is not efficient.&lt;br /&gt;&lt;br /&gt;1. Want to stick to large packets to amortize data payload per extra bits. Having an address per word of data would likely cut the amount of data in a fixed size packet in half.&lt;br /&gt;&lt;br /&gt;2. Even if the packet did include an address per data word, the packet itself is routed to one MC. Given memory is banked to MCs, scatter/gather to random addresses (even if this random set of addresses contains good data locality) would likely need to go to different MCs, again cutting the amount of data per packet.&lt;br /&gt;&lt;br /&gt;Given a wide interconnect bus (large packets), an attempt to solve (2) would require coalescing buffers per MC on each ALU unit (buffer space to collect enough requests per MC to fill a large packet). As the number of MCs increase the coalescing buffer space per MC would likely decrease (unfriendly scaling for a set buffer size). Fixed function buffer space would likely trade area from ALU, local store or register area.&lt;br /&gt;&lt;br /&gt;Note here that if packets switch from being routed to/from MCs to ALUs (which might be the case if the GPU had hardware support for queues for a programmable pipeline), the coalescing buffer problem remains. Attempting to work around this problem might require either a smaller packet size, or some kind of hierarchical network (more on this below in GPU Problem Solving).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;TEXTURES, FRAMEBUFFERS, AND FIXED FUNCTION&lt;br /&gt;&lt;br /&gt;Texture units provide fine granularity gather. Continuing the above model, the compromise of having a huge amount of addressing/extra bits per result payload, is partly mitigated by only locally routing requests to a single cluster shared texture unit per ALU. However the cost of the texture units in area is relatively large.&lt;br /&gt;&lt;br /&gt;Framebuffer ROP units provide fine granularity scatter. The compromise of high address/extra to payload data ratio is partly mitigated by only routing requests to a single ROP for a full tile of data computed at one ALU unit. The cost of ROP units in area is also relatively large!&lt;br /&gt;&lt;br /&gt;&lt;i&gt;See the common pattern, fine granularity gather/scatter, but only efficient for tight data locality. Both solve the network problem by routing to/from one location. For the ROPs, having an address mapping which maps 2D tiles to ROPs key.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;ROPs in conjunction with fixed function raster hardware, also provide a combined atomic operation (min or max) conditional scatter of a packet of data (think in terms of multiple render targets with Z and stencil). This is a tremendously powerful construct which is very hard to emulate efficiently in compute.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;TANGENT ON DX11&lt;br /&gt;&lt;br /&gt;A good amount of chip area is used for fixed function, might as well use both fixed function and compute to maximize what can be done with the GPUs of the next generation. DX11 API offers some interesting possibilities depending on what decisions IVHs made with the hardware (assuming I'm understanding DX11 correctly)...&lt;br /&gt;&lt;br /&gt;Two possibilities include,&lt;br /&gt;&lt;br /&gt;1. Higher fixed function triangle setup throughput. The fast path with the low to mid-end cards of the DX10 generation for data scatter was to do point scatter with all the computation done in the vertex shader and use the fixed function hardware to scatter via the ROP units. Towards higher end cards, triangle rates did NOT scale enough to make this the fast path.&lt;br /&gt;&lt;br /&gt;2. RT UAVs. Render Target Unordered Access Views could be an interesting construct to play with, if this path accesses memory via a tiled mapping (possibly backed by a little cache). Meaning in theory RT UAVs might support data local scatter/gather better than via global memory access. Better could mean either lower bandwidth or lower ALU work (which would be required to simulate scatter via compute methods).&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Going to have to wait for DX11 cards to go public to know for sure.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Essence of the GPU&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;MAKING THE COMMON CASE FAST&lt;br /&gt;&lt;br /&gt;For a program to scale, the majority of work must be independently parallel.&lt;br /&gt;&lt;br /&gt;1. The CPU does not provide an avenue to solve problems which require high parallel throughput.&lt;br /&gt;&lt;br /&gt;2. However the GPU provides both parallel throughput and the option to trade parallel throughput for emulation of non-parallel constructs.&lt;br /&gt;&lt;br /&gt;With high parallel throughput, other less used constructs (less used because they would not scale, for example memory coherency, fine granularity random access, etc) can be emulated in software on parallel hardware. This also provides the programmer ability to use extra program specific knowledge to tailor a better (than general purpose CPU) solution to the problem.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;THE LIE THAT GPU PROGRAMMING IS INHERENTLY HARD&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Had the situation been reversed, and a majority of programmers were programming on highly parallel machines, given a serial machine, the common programmer would have a really hard time solving problems!&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Intuition on GPU Problem Solving&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;BUILD A SCALABLE NETWORK, AND YOU HAVE THE SOLUTION TO YOUR PROBLEM!&lt;br /&gt;&lt;br /&gt;Best technique I've found is to describe the problem in terms of a network (or graph) of transforms on the data. Common problems, such as the following, all have a similar solution.&lt;br /&gt;&lt;br /&gt;1. Data transfer between nodes in the network is too fine grained.&lt;br /&gt;&lt;br /&gt;2. Need to scatter to too many nodes, or gather from too many nodes.&lt;br /&gt;&lt;br /&gt;3. The node requires too much local store space.&lt;br /&gt;&lt;br /&gt;The solution is to build some kind of hierarchical network and divide the problem into smaller parts. This type of problem solving is the parallel version of the divide and conquer method common to solving classic serial CPU (problems like sorting). &lt;br /&gt;&lt;br /&gt;&lt;i&gt;Parallel machines force you to factor your problem. How the problem is factored (or how the cost of computation or communication is amortized) can be seen directly in the network (or graph) that is the solution to the problem!&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;With serial CPU programming, for the most part, programmers do not even think about the cost of gathering data to process because of the giant fixed function CPU and memory hierarchy. With a parallel machine, the cost to gather data to process is the cost to route that data through the network. With current GPUs, the data transferred between distant nodes in the network (or graph) gets transferred to and from global memory. Near nodes (threads) communicate through a local store (small shared memory).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;LINK TO BRAIDED PARALLELISM&lt;br /&gt;&lt;br /&gt;Given the ability for the hierarchical elements of the network to scale in width (amount of parallelism) and height (of a reduction tree for example), then the solution can scale to either problem size or machine width.&lt;br /&gt;&lt;br /&gt;Two core options for the network/graph (can mix/match these),&lt;br /&gt;&lt;br /&gt;1. Fixed routing, jobs are draw calls (actual draws or compute invocations). For a fixed graph, run time can be bound to a relatively fixed cost (easy to meet frame rate requirements for example).&lt;br /&gt;&lt;br /&gt;2. Braided parallelism via variable routing, uber-kernel with "locked" threads. Enables the network/graph to adapt at run-time to the varying needs of the program.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;In Closing&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;HYPOTHETICAL GPU MODEL VS REAL HARDWARE&lt;br /&gt;&lt;br /&gt;Quick CUDA tests on a GTX 275 of a simple non-atomic global read-increment-write kernel, with varying amounts of possible segment bank collisions show quite different run times. Performance when all blocks access the same segment is many times worse than when blocks access different memory segments. This result is predicted well by the hypothetical GPU model described at the start of this post.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Note actual global memory performance on the GTX 275 is quite complex. For example, when running more than one block per multiprocessor, in at least one test it appeared that it is more important that the blocks access different segment banks, than if the individual warps in a given block access different banks (suggests scheduling effects).&lt;/i&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-2052269239650323669?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/2052269239650323669/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/gpu-address-space-mapping-ii.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2052269239650323669'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2052269239650323669'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/gpu-address-space-mapping-ii.html' title='GPU Address Space Mapping II'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-1693036085720675830</id><published>2009-08-11T15:18:00.000-05:00</published><updated>2009-08-11T15:24:21.198-05:00</updated><title type='text'>V-Ray RT and CUDA</title><content type='html'>&lt;a href="http://www.cgarchitect.com/news/SIGGRAPH-2009-CHAOS-GROUP-GPU.shtml"&gt;Video from Siggraph, CUDA V-Ray RT demo on a GTX 285, thanks to cgarchitect.&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-1693036085720675830?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/1693036085720675830/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/v-ray-rt-and-cuda.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1693036085720675830'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1693036085720675830'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/v-ray-rt-and-cuda.html' title='V-Ray RT and CUDA'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-3863207912651898020</id><published>2009-08-10T00:50:00.003-05:00</published><updated>2009-08-10T00:58:06.806-05:00</updated><title type='text'>Macton’s Posterous</title><content type='html'>&lt;a href="http://macton.posterous.com/roundup-recent-sketches-on-concurrency-data-d"&gt;Mike Acton: Recent Sketches on Concurrency&lt;/a&gt; ... almost lost this set of awesome presentations in the twitter sphere!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-3863207912651898020?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/3863207912651898020/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/mactons-posterous.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3863207912651898020'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3863207912651898020'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/mactons-posterous.html' title='Macton’s Posterous'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-6288055922266055594</id><published>2009-08-10T00:33:00.002-05:00</published><updated>2009-08-10T00:39:44.243-05:00</updated><title type='text'>Assembly 2009</title><content type='html'>&lt;object width="640" height="378"&gt;&lt;param name="movie" value="http://capped.micksam7.com/playeralt.swf?vid=cncd_orange_fairlight-frameranger" /&gt;&lt;param name="wmode" value="transparent" /&gt;&lt;param name="allowFullScreen" value="true" /&gt;&lt;param name="bgcolor" value="#000000" /&gt;&lt;embed src="http://capped.micksam7.com/playeralt.swf?vid=cncd_orange_fairlight-frameranger" wmode="transparent" bgcolor="#000000" allowFullScreen="true" width="640" height="378" type="application/x-shockwave-flash"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;&lt;object width="640" height="380"&gt;&lt;param name="movie" value="http://capped.micksam7.com/playeralt.swf?vid=portal_process-hotel_bar_heroes" /&gt;&lt;param name="wmode" value="transparent" /&gt;&lt;param name="allowFullScreen" value="true" /&gt;&lt;param name="bgcolor" value="#000000" /&gt;&lt;embed src="http://capped.micksam7.com/playeralt.swf?vid=portal_process-hotel_bar_heroes" wmode="transparent" bgcolor="#000000" allowFullScreen="true" width="640" height="380" type="application/x-shockwave-flash"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;&lt;object width="640" height="380"&gt;&lt;param name="movie" value="http://capped.micksam7.com/playeralt.swf?vid=youth_uprising_mlat_design_out-muon_baryon" /&gt;&lt;param name="wmode" value="transparent" /&gt;&lt;param name="allowFullScreen" value="true" /&gt;&lt;param name="bgcolor" value="#000000" /&gt;&lt;embed src="http://capped.micksam7.com/playeralt.swf?vid=youth_uprising_mlat_design_out-muon_baryon" wmode="transparent" bgcolor="#000000" allowFullScreen="true" width="640" height="380" type="application/x-shockwave-flash"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;&lt;object width="640" height="378"&gt;&lt;param name="movie" value="http://capped.micksam7.com/playeralt.swf?vid=sqny-dollop" /&gt;&lt;param name="wmode" value="transparent" /&gt;&lt;param name="allowFullScreen" value="true" /&gt;&lt;param name="bgcolor" value="#000000" /&gt;&lt;embed src="http://capped.micksam7.com/playeralt.swf?vid=sqny-dollop" wmode="transparent" bgcolor="#000000" allowFullScreen="true" width="640" height="378" type="application/x-shockwave-flash"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-6288055922266055594?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/6288055922266055594/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/assembly-2009.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6288055922266055594'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6288055922266055594'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/assembly-2009.html' title='Assembly 2009'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-3062828578342494219</id><published>2009-08-10T00:08:00.003-05:00</published><updated>2009-08-10T00:20:46.685-05:00</updated><title type='text'>Siggraph 2009: Thursday - State</title><content type='html'>&lt;i&gt;Continuing Beyond Programmable Shading topics...&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;GPU vs CPU State&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;TLB/etc on CPU provide virtual memory via address mapping. GPU state (such as buffers, textures and render targets) serves a similar purpose as the TLB/etc in terms of providing a special address mapping, however GPU state is more useful in that GPU state provides the information which enables an address mapping designed for efficient 2D and 3D data locality. &lt;br /&gt;&lt;br /&gt;&lt;i&gt;Not sure if GPUs will someday get virtual memory, I can see great arguments both ways.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;CPU provides ISA which enables changing TLB, page tables, and other system state. In contrast GPUs provide a fixed function command buffer interface to change GPU state.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;At some point the graphics API (and possibly the GPU) will need the ability to change GPU state without the CPU in order to keep on evolving...&lt;/i&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-3062828578342494219?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/3062828578342494219/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/siggraph-2009-thursday-state.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3062828578342494219'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3062828578342494219'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/siggraph-2009-thursday-state.html' title='Siggraph 2009: Thursday - State'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-1038658983417941243</id><published>2009-08-09T23:06:00.003-05:00</published><updated>2009-08-10T00:05:25.495-05:00</updated><title type='text'>Siggraph 2009: Thursday - Braided Parallelism II</title><content type='html'>&lt;i&gt;Continuing Beyond Programmable Shading topics...&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Braided Parallelism II&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Along with braided parallelism is the issue of programs with more variable run-times. Variable run-time programs implies the need to run more than one kind of program at the same time on the same core (perhaps through some future MPMD machine, or emulated MPMD via an uber-binary in a SPMD machine).&lt;br /&gt;&lt;br /&gt;Core to both these issues is the possible problem of register allocation between running programs. Static allocation of chunks of registers to a program could become inefficient when simultaneously running multiple different programs which have different amounts of register occupancy or running programs which vastly vary in the number of live registers over the course of execution.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Not sure how much of a problem this will be in practice on typical usage cases.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;The GPU wants to maximize the number of running programs which fit in the local register and shared-register space (also called group share or shared memory). This is important to maximize the amount of latency hiding capacity.&lt;br /&gt;&lt;br /&gt;Larrabee skirts this issue with a smaller register file and a cache hierarchy. The program's stack tends to provide a good description of data locality (top of the stack is more likely to get used). I'm going to assume that register files are less area efficient than a block accessed local memory (because of the crossbar(s), etc). However having a coherent cache adds other complexities such as a TLB (TLB issues covered in another post).&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Will be interesting to see how ATI and NVIDIA GPUs evolve with regards to this issue...&lt;/i&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-1038658983417941243?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/1038658983417941243/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/siggraph-2009-thursday-braided_09.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1038658983417941243'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1038658983417941243'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/siggraph-2009-thursday-braided_09.html' title='Siggraph 2009: Thursday - Braided Parallelism II'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-4054317644822620766</id><published>2009-08-09T15:46:00.004-05:00</published><updated>2009-08-09T16:21:13.967-05:00</updated><title type='text'>Siggraph 2009: Thursday - Issue of Coherent Cache</title><content type='html'>&lt;i&gt;Continuing Beyond Programmable Shading topics...&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Coherent Caches&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;1. Given equal estimated {area, process, power}, where is the performance advantage of switching chip area from ALU to CACHE, and what is the right balance?&lt;br /&gt;&lt;br /&gt;2. Given a performance advantage is found in (1), is the advantage from simply having a larger local store, or is cache coherency a critical factor?&lt;br /&gt;&lt;br /&gt;3. How do algorithms which make use of cache coherency scale as GPUs continue to grow ever wider and more parallel?&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Anyone else notice that a majority the examples Intel has shown on their rendering pipeline treats the cache as a program managed local store?&lt;/i&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-4054317644822620766?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/4054317644822620766/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/siggraph-2009-thursday-issue-of.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/4054317644822620766'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/4054317644822620766'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/siggraph-2009-thursday-issue-of.html' title='Siggraph 2009: Thursday - Issue of Coherent Cache'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-3783337302510255524</id><published>2009-08-09T14:52:00.004-05:00</published><updated>2009-08-10T01:23:16.487-05:00</updated><title type='text'>Siggraph 2009: Thursday - Braided Parallelism</title><content type='html'>&lt;i&gt;For me Thursday was a full day of Beyond Programmable Shading I and II, &lt;a href="http://s09.idav.ucdavis.edu/"&gt;slides can be found here&lt;/a&gt;. There were a bunch of rather interesting talks and topics, and I'm going to attempt to elaborate on some of the core themes in a bunch of posts starting with this one...&lt;/i&gt; &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Braided Parallelism&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Scalability of a data and task parallel work mix is limited by data dependencies. It is the programmers responsibility to insure there is enough independent work to do between dependencies.&lt;br /&gt;&lt;br /&gt;One important difference between task and data parallel work is that task parallel work cannot fill the entire width (all cores across the entire chip) of the machine at one time, and often data parallel work can. &lt;br /&gt;&lt;br /&gt;Given this crucial observation, the real meaning of braided parallelism becomes clear: &lt;u&gt;efficient braided parallelism implies that a machine can pipeline and execute (without idle ALU units) jobs which do NOT fill the entire parallel width of the machine.&lt;/u&gt; &lt;br /&gt;&lt;br /&gt;Task parallel work &lt;a href="http://macton.smugmug.com/gallery/8936708_T6zQX#593421933_6jT9t"&gt;likely involves data parallel elements&lt;/a&gt;, so perhaps a new term "task data parallel" should be used. As a machine gets wider, more and more work will likely fall into the task data parallel category.&lt;br /&gt;&lt;br /&gt;A CPU like architecture is NOT needed for braided parallelism, and NVidia's OptiX ray engine is a good example of this. Seems safe to predict that future GPUs (and perhaps DX11 GPUs) will continue to get better at pipelining small jobs, and thus more efficient at braided parallelism.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Will be covering issues regarding the efficiency of braided parallelism on GPUs in later posts...&lt;/i&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-3783337302510255524?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/3783337302510255524/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/siggraph-2009-thursday-braided.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3783337302510255524'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3783337302510255524'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/siggraph-2009-thursday-braided.html' title='Siggraph 2009: Thursday - Braided Parallelism'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-1547908641001288113</id><published>2009-08-09T14:41:00.003-05:00</published><updated>2009-08-09T14:49:11.146-05:00</updated><title type='text'>Siggraph 2009: Thursday - Id Tech 5</title><content type='html'>&lt;b&gt;Id Tech 5 Challenges&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Highlights of the Beyond Programmable Shading &lt;a href="http://s09.idav.ucdavis.edu/talks/05-JP_id_Tech_5_Challenges.pdf"&gt;presentation&lt;/a&gt; from Id.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Completely off topic, but I really like the sky-box seen in the screen shots. Great atmospheric diffusion, and a contrast range which photographers get through careful use of a polarizing filter to darken the sky in combination with proper exposure control (perhaps with a graduated neutral density filter) to keep the highlights from burning out too much.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;VIRTUAL TEXTURING&lt;br /&gt;&lt;br /&gt;The screen shots show how they map their UV texture space into the 128K x 128K texels. One can see a mix of world textures and character face textures in the virtual texture pages. Surely there is a relatively high importance in keeping good spacial locality for large multi-tile reads to minimize seek time.&lt;br /&gt;&lt;br /&gt;They are not using tri-linear filtering, and instead use bi-linear only. In order to handle the LOD pop issue (caused by optical media seek latency going from a low resolution to high resolution tile), they first up-sample the lower mip, then continuously re-transcode (blend up-sampled tile with decoded tile from higher disc compression, then re-compress to DXT format and update the physical tile on the GPU). Should provide a visually seemless experience, in which detail smoothly blends in.   &lt;br /&gt;&lt;br /&gt;Over-subscription of the physical texture space is solved via adjusting a global LOD bias slider until the entire working set fits.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;JOB SYSTEM&lt;br /&gt;&lt;br /&gt;Their job system has job lists (which of which are fully independent) with signal and block on signal constructs to enforce scheduling dependencies. By design, job processing is given 1 frame of latency to complete.&lt;br /&gt;&lt;br /&gt;They provided some approximate job costs,&lt;br /&gt;&lt;br /&gt;2 ms - animation blending&lt;br /&gt;2 ms - sorting for transparency&lt;br /&gt;4 ms - collision detection&lt;br /&gt;4 ms - obstacle avoidance&lt;br /&gt;4 ms - misc&lt;br /&gt;4 ms - audio&lt;br /&gt;8 ms - virtual texturing&lt;br /&gt;10 ms - rendering&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;RENDERING&lt;br /&gt;&lt;br /&gt;They have sorted alpha blended foliage and particles. They also have a system for detail model generation which items such as rocks and pebbles (perhaps similar to a more advanced shrub system). This is an interesting blend between unique texturing, and instanced stuff to add tiny details.&lt;br /&gt;&lt;br /&gt;Not in the presentation, but mentioned after the talk, they ultimately decided not to use Edge or SPU geometry processing. Seems as if the trade off was between higher performance (especially for a vertex bound smaller virtual texture feedback render target) at the cost of needing to keep vertex and index data CPU side for SPU processing, and likely they had free GPU memory (the physical texture for virtual texturing is limited by GPU texture size limits) but were pushing the limit on CPU memory.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;GPU JOBS&lt;br /&gt;&lt;br /&gt;They mentioned that they are starting to look into running some jobs GPU side, investing in some forward looking usage of the compute capacity of current and new hardware.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-1547908641001288113?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/1547908641001288113/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/siggraph-2009-thursday-id-tech-5.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1547908641001288113'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1547908641001288113'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/siggraph-2009-thursday-id-tech-5.html' title='Siggraph 2009: Thursday - Id Tech 5'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-4954134944328444394</id><published>2009-08-05T18:31:00.008-05:00</published><updated>2009-08-06T01:47:57.905-05:00</updated><title type='text'>Siggraph 2009: Wednesday</title><content type='html'>&lt;b&gt;Alternative Rendering Pipelines in CUDA&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://developer.nvidia.com/object/siggraph-2009.html"&gt;Slides Off This Page&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The first talk was about Ray tracing and REYES in CUDA. The second on &lt;a href="http://www.nvidia.com/object/optix.html"&gt;OptiX&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;u&gt;There was one very important overwhelming theme NVidia presented which marks a distinct shift in how to program in CUDA (and get at the fast path on current hardware). The shift is the idea of persistent threads pulling work from queues and the usage of uber-kernels!&lt;/u&gt;&lt;br /&gt;&lt;br /&gt;This has been hinted at prior to Siggraph in some GRAMPS discussion, and has been detailed in the &lt;a href="http://farrarfocus.blogspot.com/2009/07/gpu-ray-traversal-efficiency_09.html"&gt;Understanding the Efficiency of Ray Traversal on GPUs&lt;/a&gt; paper. The core problem with hardware block scheduling, is that one single long-running thread stalls the entire block. The solution to this problem is to issue only enough threads to fill the GPU once, and then pull work from queues (use work stealing to balance cores). Stalls can be limited to warp granularity, and warps can run separate "kernels" from one giant uber-kernel. Note is this effectively almost a form of software MPMD on a SPMD machine.&lt;br /&gt;&lt;br /&gt;Cost of warp-coherent dynamic branching to different sub-kernels can be either linear or a log cost binary search (since CUDA doesn't currently have branch by register). Also register count for the uber-kernel is going to be the maximum register count used by any sub-kernel, but the advantage in ALU utilization from software scheduling greatly out-weighs the possible less efficient register usage for some sub-kernels!&lt;br /&gt;&lt;br /&gt;The talk did very briefly cover the best way to fetch from a queue in CUDA (but not in enough detail IMO, likely because of a lack of time to go there). Anyway I've got a 2nd huge CUDA mega post in progress which covers many parallel programming constructs in detail which can be useful in the new age of alternative and conventional rendering pipelines which use "compute". Covering things like the performance of shared memory atomics (serialization cases, when to avoid them, etc), CS4 vs CS5, and more. This is going to have to wait for next week, but it is going to be a great follow up to details which are either too technical or too time consuming to place in these great Siggraph talks.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;OptiX&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;OptiX is NVidia's new interactive ray tracing engine, designed to provide a generic shader interface which will scale by GPU and scale to future NVidia architectures.&lt;br /&gt;&lt;br /&gt;&lt;u&gt;Seems like the core CUDA of OptiX is shader and ray software scheduling via uber-kernel and persistent threads. Where each ray has a shader defined state packet, and shaders can spawn rays.&lt;/u&gt;&lt;br /&gt;&lt;br /&gt;There are two very important programming constructs likely being used here (arg, missed what I think was a HPG talk on this before Siggraph): &lt;br /&gt;&lt;br /&gt;1. CONTINUATION BASED PROGRAMMING. Looks like the shader defined ray state packet gets passed between sub-kernels. Likely ray creation places ray on queue with packet.&lt;br /&gt;&lt;br /&gt;2. PERSISTENT THREADS WITH CO-OPERATIVE MULTI-TASKING? Looks like shaders can have local data and spawn rays. So likely at some set points while shader is running (ie after it needs to wait on a ray result), local state is returned to continuation queue, and uber-kernel thread scheduler pulls another warp collection of coherent jobs to run? I need to think this through in detail and perhaps try it before I will know for sure.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Something Awesome Is Brewing&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Look at what constructs are core to OptiX, and think about how OptiX is designed for both current and future GPUs! Everyone is in the race towards redefining graphics in terms of a "general purpose parallel machine" and NVidia is showing an elegant, easy to program model which scales forward via OptiX.&lt;br /&gt;&lt;br /&gt;A little future hardware acceleration to efficiently support thread continuations (and dynamic grouping for efficient parallel execution), and you've got something truly wonderful.&lt;br /&gt;&lt;br /&gt;That might not be it, "it" could be something totally different, but whatever "it" is, "it" is going to be awesome, and I'm overflowing with ideas on how to build things with "it".&lt;br /&gt;&lt;br /&gt;Via Siggraph you get a glimpse into the minds of some of those involved in shaping the future of graphics. There is something magical about the insane competition going on between Intel, AMD, and NVidia. Something which I believe is going to move us into the age of fully programmable graphics in the not so distant future!&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;3D Display&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;First I don't see 3D as a gimic, in fact I just about only goto the movie theater for 3D movies, everything else is Blue-ray or DVD for me, so I'm very biased. I'd like the technology to take off.&lt;br /&gt;&lt;br /&gt;Had a chance to try out NVidia's 3D glasses, and was impressed. The high refresh rate was very smooth. Unfortunately (for me, but great for their target audience) the example they had was an interactive three screen volume raycast of a scanned human body, so actual display update was 10 fps or so. Didn't get a chance to see it under high motion, or high depth range (between ultra foreground and horizon). Apparently you render both frames then do the swap, so one gets 60Hz per eye of physical scan out, with rendered frames at or under 60Hz. &lt;br /&gt;&lt;br /&gt;Other vendor had a real 3D display, no glasses required (perhaps something over the display to insure interleaved vertical lines only hit the left or right eye). Another vendor had a 2 monitor solution, one upright, the other table top, and a half angle physical sheet to combine both polarized views.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;OpenGL BOF&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Missed it, got busy after the GRAMPS talk, did walk in at the end and got rewarded with a great 3.2 API Quick Reference Card. &lt;br /&gt;&lt;br /&gt;Didn't realize it at the time, since I don't really know people yet, but likely gave Barthold at NVidia trouble about the divergent DX11 and GL paths (how DX11 mixes compute and graphics, and how since GL and CL are separate, that it is going to be a problem to port apps, and things like tessellation which likely could benefit from compute features, seems like a problem). &lt;br /&gt;&lt;br /&gt;What I got in return was some very insightful responses, and good humor even given my very direct comments! Thanks Barthold, or whoever at Khronos/NVidia, for putting up with me!&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;One Day Down&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Finished off Wednesday at Siggraph, and realized I'm a lot more scatter brained in person than I thought I would be when meeting many people I've never talked to before in person. Must be the programmers curse, great at language through code, traded off for bad at language through speech?&lt;/i&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-4954134944328444394?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/4954134944328444394/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/siggraph-2009-wednesday.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/4954134944328444394'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/4954134944328444394'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/siggraph-2009-wednesday.html' title='Siggraph 2009: Wednesday'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-6060284303259261904</id><published>2009-08-04T14:47:00.003-05:00</published><updated>2009-08-04T15:22:15.116-05:00</updated><title type='text'>Off to Siggraph</title><content type='html'>Off to Siggraph 2009 tomorrow, be in around Wednesday at noon. &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Any of you run into this ugly dude below and want to talk about graphics, please do!&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm4.static.flickr.com/3288/2958955288_6a7a6cb47e_m.jpg"&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-6060284303259261904?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/6060284303259261904/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/off-to-siggraph.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6060284303259261904'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6060284303259261904'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/off-to-siggraph.html' title='Off to Siggraph'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm4.static.flickr.com/3288/2958955288_6a7a6cb47e_t.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-7360545525554009627</id><published>2009-08-04T09:42:00.000-05:00</published><updated>2009-08-04T09:43:34.561-05:00</updated><title type='text'>OpenGL 3.2</title><content type='html'>&lt;a href="http://www.opengl.org/registry/"&gt;Link to GL and GLSL specs.&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The Khronos Group continues their steady stride on OpenGL improvements! Really appreciate the hard work, dedication, and the ability to download spec with changes highlighted!&lt;br /&gt;&lt;br /&gt;- 64-bit integer support&lt;br /&gt;- geometry shader&lt;br /&gt;- BGRA support&lt;br /&gt;- Draw*BaseVertex(), provides index[i]+base_vertex indexing&lt;br /&gt;- support for binding buffer objects to index targets&lt;br /&gt;- texture fetch from multi-sample textures&lt;br /&gt;- texture fetch from integer buffer texture&lt;br /&gt;- texture fetch from integer 2D rect texture&lt;br /&gt;- seamless cube map filtering&lt;br /&gt;- SampleMaski() for multi-sample coverage mask&lt;br /&gt;- new simple FramebufferTexture() call&lt;br /&gt;- fence and wait on fence support&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-7360545525554009627?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/7360545525554009627/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/opengl-32.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/7360545525554009627'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/7360545525554009627'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/opengl-32.html' title='OpenGL 3.2'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-4719221662002775447</id><published>2009-08-04T08:14:00.002-05:00</published><updated>2009-08-04T08:27:14.805-05:00</updated><title type='text'>Aras's Compact Normal Storage</title><content type='html'>&lt;a href="http://aras-p.info/texts/CompactNormalStorage.html"&gt;Aras Pranckevičius : Compact Normal Storage&lt;/a&gt; - Impressive method of compressing view space normals into 2 bytes using a sphere-map transform! Page compares with other methods and includes GPU assembly output, timing, and error screen shots.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-4719221662002775447?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/4719221662002775447/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/arass-compact-normal-storage.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/4719221662002775447'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/4719221662002775447'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/arass-compact-normal-storage.html' title='Aras&apos;s Compact Normal Storage'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-8151684110366480116</id><published>2009-08-03T09:14:00.002-05:00</published><updated>2009-08-03T09:29:15.176-05:00</updated><title type='text'>EAA Airshow</title><content type='html'>&lt;img src="http://farm3.static.flickr.com/2639/3785186096_4674b931af.jpg"&gt;&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2632/3784383801_887cf29442.jpg"&gt;&lt;br /&gt;&lt;br /&gt;Had a great time at the annual EAA Airventure in Oshkosh Wisconsin this weekend. Each year pyro junkies get to experience the instant wave of heat and delayed boom of the "wall of flame". However, the best thing at the show each year IMO is the talk given by &lt;a href="http://en.wikipedia.org/wiki/Burt_Rutan"&gt;Burt Rutan&lt;/a&gt;, in which you get a glimpse into the mind the most important air and space visionary of our time!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-8151684110366480116?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/8151684110366480116/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/eaa-airshow.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8151684110366480116'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8151684110366480116'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/eaa-airshow.html' title='EAA Airshow'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm3.static.flickr.com/2639/3785186096_4674b931af_t.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-8406839429400530441</id><published>2009-08-03T07:32:00.003-05:00</published><updated>2009-08-03T08:49:52.532-05:00</updated><title type='text'>As Seen on Twitter and More</title><content type='html'>&lt;i&gt;&lt;a href="http://twitter.com/mike_acton"&gt;@mike_acton&lt;/a&gt;: A quick sketch on why qsort is not a concurrent algorithm. Feedback welcome, as always! &lt;a href="http://macton.smugmug.com/gallery/9114809_C9awM#607513208_xqWYf"&gt;link&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;I'm a big fan of Mike's presentations, the above covering parallel merge sort. I'd also highly suggest, &lt;a href="http://macton.smugmug.com/gallery/8936708_T6zQX/1/593426709_ZX4pZ#593426709_ZX4pZ"&gt;Typical C++ BS&lt;/a&gt;, also posted on Twitter (in fact it was after seeing this only on twitter that I decided to get an account).&lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;a href="http://twitter.com/bkaradzic"&gt;@bkaradzic&lt;/a&gt;: Watching winning 4kb intro from Evoke09... Lunaquatic by BluFlame&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;object width="560" height="340"&gt;&lt;param name="movie" value="http://www.youtube.com/v/nlSIrjx8AXE&amp;hl=en&amp;fs=1&amp;hd=1"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/nlSIrjx8AXE&amp;hl=en&amp;fs=1&amp;hd=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="560" height="340"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.pcgameshardware.com/aid,684381/Blade-Runner-meets-Cryengine-2-Amazing-Sci-Fi-images/News/"&gt;Blade Runner Meets CryEngine 2&lt;/a&gt; - Great indirect advertising for Cry Engine 2.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-8406839429400530441?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/8406839429400530441/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/as-seen-on-twitter-and-more.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8406839429400530441'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8406839429400530441'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/as-seen-on-twitter-and-more.html' title='As Seen on Twitter and More'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-8570142190677291933</id><published>2009-08-02T15:43:00.002-05:00</published><updated>2009-08-02T15:51:32.296-05:00</updated><title type='text'>GPU Address Space Mapping</title><content type='html'>&lt;i&gt;This post addresses some of the issues with texture and 2D/3D array address mappings when mixing compute and fixed function as well as dealing with cache lines and the higher granularity of global memory access of GPUs. &lt;br /&gt;&lt;br /&gt;For those developers moving on to compute (CUDA, OpenCL, DX11 CS4/CS5), this is going to be an important thing to have a full understanding of!&lt;br /&gt;&lt;br /&gt;I would suggest reading Jake Cannell's &lt;a href="http://enterthesingularity.blogspot.com/2009/07/winning-my-own-little-battle-against.html"&gt;Winning my own little battle against Cuda Post&lt;/a&gt; for a great example of the problems with mixing fixed function texturing and compute.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Review&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;CACHE LINES&lt;br /&gt;&lt;br /&gt;I'm using the term cache line to refer to both the cached and un-cached case. In the un-cached case, it would be the optimal size of a global memory operation. Cache line size for example,  is often 64-bytes for current PCs, nearly sure 64-bytes for Larrabee, and at least for later model NVidia GPUs, global memory segment size is a mix of 32/64/128-bytes depending on coalescing (not sure what actual texture cache line size is). &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;LINEAR MAPPING&lt;br /&gt;&lt;br /&gt;2D array elements (or image pixels) are addressed in linear order. A 64-byte cache line of 32-bit elements provides the following,&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;0123456789ABCDEF&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;Often array width is a power of two, and then element address for this 32-bit/element array can be computed using the following equation (3 OPS : 2 SHIFTs, 1 ADD),&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;address = (x &lt;&lt; 2) + (y &lt;&lt; constant)&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;Of which,&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;constant = 2 + log_base_2(width)&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;SPACE FILLING CURVE AND/OR TILED MAPPING&lt;br /&gt;&lt;br /&gt;Elements are addressed using some kind of space filling curve. This space filling curve in should result in a cache line representing a tile of elements in the array. The tiles themselves can be addressed in linear order (easier for software to do the address translation). Or perhaps the tiles are also addressed in a space filling curve (which might provide benefits in the case of hardware predictive cache line pre-fetching). Or perhaps the entire array is accessed using a consistent space filling curve (no set tiles), in the case where the hardware provides a full bit-interleave instruction (Larrabee provides 2 such instructions).&lt;br /&gt;&lt;br /&gt;The space filling curve might be Z-order (or full bit-interleaved), which would be a very expensive transform to do in software without special instructions (following the 16 elements per cache line example),&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;0145&lt;br /&gt;2367&lt;br /&gt;89CD&lt;br /&gt;ABEF&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;Or perhaps the tiles are wide Z-order (multi-bit-interleaved), which is much easier to compute in software when the ISA is lacking a bit-interleave instruction,&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;0123&lt;br /&gt;4567&lt;br /&gt;89AB&lt;br /&gt;CDEF&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;Still address translation can be expensive (11 OPS : 4 ANDs, 4 SHIFTs, 3 ADDs) even for just linear tile address with multi-bit-interleave for the tiles and power of two sized array width (note only need 3 shifts if elements are one byte, or if one can make use of a ISA which provides a "free" shift in the instruction addressing, like x86),&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;hiX = x &amp; mask0&lt;br /&gt;hiY = y &amp; mask1&lt;br /&gt;loX = x &amp; mask2&lt;br /&gt;loY = y &amp; mask3&lt;br /&gt;&lt;br /&gt;address = hiX &lt;&lt; shift0 + hiY &lt;&lt; shift1 + loX &lt;&lt; shift2 + loY &lt;&lt; shift3&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The Problems&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;2D/3D LOCALITY AND JOB DISTRIBUTION&lt;br /&gt;&lt;br /&gt;GPUs group pixels and compute jobs into 2D tiles (or blocks) for efficient computation (to maintain good 2D data locality). DX11 provides 3D blocks for CS5. Job or block distribution across cores in the GPU is NDA material, but it is possible to reverse engineer this via careful programming (in some cases). There are all sorts of complexities here, such as a cluster of cores sharing a texture cache.&lt;br /&gt;&lt;br /&gt;Linear addressing of 2D data often leads to cache line utilization being low, and cache line thrashing. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;COMPUTE ACCESS TO TEXTURES LIMITED COMPARED TO FF/ROP&lt;br /&gt;&lt;br /&gt;To my knowledge, texture access in compute is limited to linear addressed textures. This either may or might NOT change for DX11. And if it does, it might not change for all vendors. In theory it should be possible for Larrabee, all depends on what the drivers do!&lt;br /&gt;&lt;br /&gt;ROP units provide coalescing, might be in the form of write-combine caching, and/or in perhaps in the form of queuing ROP requests. Either way fixed function render targets are rather impressively fast, even for fine granularity scatter (try on a middle-end NVidia DX10 card to point scatter via drawing points in GL vs global memory scatter in CUDA). Fixed function ROP also provides write to MSAA render targets.&lt;br /&gt;&lt;br /&gt;Compute global accesses are coarse granularity accesses, and likely might stay un-cached on some GPUs.&lt;br /&gt;&lt;br /&gt;The point here is that compute access to render targets might NOT be as fast as the fixed function rendering path. Clearly Larrabee could be an exception here.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;MIXING CS AND PS&lt;br /&gt;&lt;br /&gt;If a CS pass builds something for a PS pass, then the PS pass might have to sample from a linear addressed texture. If texture cache access or memory bandwidth becomes a bottleneck, and only IF, there might be a few possible work-arounds.&lt;br /&gt;&lt;br /&gt;If the texture is not filtered then one can work in a linear texture and do address translation manually. Even if one doesn't know the actual GPU texture address to texel mapping, one could test different software efficient mappings to see if a good ALU to TEX trade-off can be made.&lt;br /&gt;&lt;br /&gt;Often thread groups (or blocks) in compute passes will all be accessing a shared "tile" of data anyway, so tile address mapping in CS might not be as bad if one computes a linear tile address, and then uses constants for the in-tile pixel ordering. &lt;br /&gt;&lt;br /&gt;In the filtered case, CS producing data for filtered TEX lookups, the graphics and compute APIs don't currently provide a good solution (unless that solution is hidden behind the driver and hardware). &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;BYTES PER PIXEL VS BYTES PER CACHE LINE&lt;br /&gt;&lt;br /&gt;In the physical cached case, as the number of bytes per pixel approaches bytes per cache line, linear vs tiled address order "should" become less important. For example with 64-byte cache lines, &lt;br /&gt;&lt;br /&gt;&lt;tt&gt;4... 8... 16 32 --- bytes per pixel&lt;br /&gt;---- ---- -- --&lt;br /&gt;0123 0123 01 01 --- possible software cache line tile mapping&lt;br /&gt;4567 4567 23&lt;br /&gt;89AB&lt;br /&gt;CDEF&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;A G-buffer or some other fat "logical texture", might have a logical 16-bytes/pixel, but this is divided perhaps across 4 textures with 4-bytes/pixel physically, resulting in the problem case again.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;DISCLOSURE OF CACHE DESIGN&lt;br /&gt;&lt;br /&gt;In the CPU space, cache details are mostly fully public knowledge. Often the OS even provides a way for software to query cache line details of all the cache levels. Some algorithms actually adapt themselves to cache line size (cache-aware), and there is growing research into cache-oblivious algorithms which naturally work well regardless of cache details.&lt;br /&gt;&lt;br /&gt;In the GPU space, cache details (cache as in TEX) are to my knowledge all NDA material, and also likely a bit part of the secret sauce which enables some vendor advantage. GPUs caches likey have various important differences from CPU caches, such as in N-way where N is larger because DX/GL provide access to a much larger number of simultaneous arrays (textures) than would be typical in CPU space, or perhaps different pre-fetching logic (if there is even any automatic pre-fetching at all). &lt;br /&gt;&lt;br /&gt;Note a smart programmer can reverse engineer cache design using careful programming, however often those who don't have NDA access also don't have a full set of graphics cards representing the market to test.&lt;br /&gt;&lt;br /&gt;In terms of cache as program managed local store, NVidia's CUDA docs provide great documentation on most of the important issues regarding global memory access on NVidia's architecture. Exceptions such as how memory requests are parallalized to the memory controllers can be reversed engineered via software, but would again require having a large number of cards to test (cards have different numbers of MCs).&lt;br /&gt;&lt;br /&gt;ATI also likely provides great docs, but I haven't looked into detail on their latest R600/R700 memory architecture yet. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;DISCLOSURE OF TEXTURE ADDRESS MAPPING&lt;br /&gt;&lt;br /&gt;All NDA material, sometimes possible to reverse engineer. Details such as tile size, space filling curve, etc, all hidden, and likely for good reason. Many programmers would cry themselves to sleep at night attempting to deal with all these details on all GPUs. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Possible Future Solutions To This Mess&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The largest problem here is that vendors would all have to agree on a good solution to make it a "good" solution. Vendors opening up their texture address mapping would result in a mess of lots of vendor or card specific code in engine.&lt;br /&gt;&lt;br /&gt;Larrabee's 2 bit-interleave instructions and NO ROP design, is an interesting solution to the problem. Yet I don't know anything about Larrabee's texture address mapping to know if this actually is a solution to the problem of CS stages writing to filterable cache friendly textures.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;I've got another idea in mind for a general solution, but it requires a change in hardware, which likely is NOT going to be adopted via all vendors unless some central controlling entity like Microsoft decided to force it! So never going to happen!&lt;/i&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-8570142190677291933?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/8570142190677291933/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/gpu-address-space-mapping.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8570142190677291933'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8570142190677291933'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/08/gpu-address-space-mapping.html' title='GPU Address Space Mapping'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-4146119382118798338</id><published>2009-07-29T23:36:00.002-05:00</published><updated>2009-07-29T23:49:13.633-05:00</updated><title type='text'>Arch Linux</title><content type='html'>&lt;a href="http://www.archlinux.org/"&gt;Arch Linux Website&lt;/a&gt; | &lt;a href="http://www.archlinux.org/packages/?sort=&amp;arch=x86_64&amp;repo=&amp;q=&amp;last_update=&amp;limit=all"&gt;x86-64 Package List&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;At home for OpenGL and CUDA development I now have Arch Linux on my new AMD64 machine. Lots of thanks to Judd Vinet and Aaron Griffin for the best Linux distro out there! &lt;br /&gt;&lt;br /&gt;&lt;i&gt;pacman -S awesome&lt;/i&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-4146119382118798338?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/4146119382118798338/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/arch-linux.html#comment-form' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/4146119382118798338'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/4146119382118798338'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/arch-linux.html' title='Arch Linux'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-3862388136335214565</id><published>2009-07-29T23:07:00.002-05:00</published><updated>2009-07-29T23:30:53.002-05:00</updated><title type='text'>MinWM</title><content type='html'>&lt;i&gt;Not sure why I didn't do this years ago, I open sourced my tiny X window manager on google code.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Mimimal Window Manager [MinWM]&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://code.google.com/p/minwm/"&gt;Project Hosted on Google Code&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Simple and tiny (10KB) single screen tiling X Window Manager. No messy configuration files. Provides the following keyboard interface to manage windows,&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;ALT+ESC .......... Close window.&lt;br /&gt;ALT+TAB .......... Cycle through window list (just like Windows).&lt;br /&gt;ALT+` ............ Cycle window shape between full, and left/right halves.&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;And that is it!&lt;br /&gt;&lt;br /&gt;MinWM requires that the user starts a terminal, prior to running MinWM, which can be used to start other applications.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;History&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Moving windows around the screen is a waste of time. I got sick of traditional X Window Managers and decided to write my own. 100% of the time I either wanted one window full screen, or two windows split screen, and a very simple way to manage windows without the mouse. From this MinWM was born.&lt;br /&gt;&lt;br /&gt;I have been using it for maybe a decade, and still the current tiny tile based X Window Managers are too complicated and get it all wrong. So MinWM is now open source so anyone can enjoy working fast!&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Compile From Source&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Grab &lt;a href="http://minwm.googlecode.com/files/minwm.c"&gt;minwm.c&lt;/a&gt; from google code. Might have to modify your include and library path, but something like this should work,&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;gcc minwm.c -Os -o minwm -I/usr/X11/include -L/usr/X11/lib -lX11&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Install&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Compile yourself a binary. Setup your &lt;tt&gt;.xinitrc&lt;/tt&gt; file to startup a terminal and then minwm,&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;xterm -rv -ls +sb -sl 4096 &amp;&lt;br /&gt;$HOME/minwm&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;Then run &lt;tt&gt;xinit&lt;/tt&gt; to start X.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-3862388136335214565?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/3862388136335214565/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/minwm.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3862388136335214565'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3862388136335214565'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/minwm.html' title='MinWM'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-1093620650551480211</id><published>2009-07-29T08:31:00.003-05:00</published><updated>2009-07-29T09:35:08.113-05:00</updated><title type='text'>Morphological AA Part III - GPU Ideas</title><content type='html'>&lt;b&gt;Filtering Via Fractional Pixel Offset&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Seems like a filtering pass could be done using one bilinear fetch per pixel using a computed fractional pixel offset to blend between the neighborhood of 4 pixels. This should be enough to remove aliasing. One could look at the simple case of a horizontal span,&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;        XXXX           .,xXX -- example horizontal span&lt;br /&gt;    XXXXXXXX       .,xXXXXXX&lt;br /&gt;XXXXXXXXXXXX to ,xXXXXXXXXXX&lt;br /&gt;XXXXXXXXXXXX    XXXXXXXXXXXX&lt;/pre&gt;&lt;br /&gt;In this case (horizontal) the fractional pixel offset would be mostly a vertical adjustment. This offset would always be on the side of the pixel center with the highest intensity gradient. As drawn in ascii below, offset would be highest in the pixels CC, would taper off to low offset by the Zs, and would reverse direction in area CC (left side would point down, right side would point up). The CC area is where the edge crosses the pixel center.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;   Z  CC  Z  &lt;br /&gt;   |  ||  |   &lt;br /&gt;       XXXXXXX -- talking about this span&lt;br /&gt;XXXXXXXXXXXXXX&lt;br /&gt;XXXXXXXXXXXXXX&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Temporal Amortizing Searching?&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;After re-thinking this a little: not worth it!&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;The expensive part is finding the "CC" points on the rows and columns of pixels in the image, and then the "Z" to "Z" length, which is used to compute the sub pixel offset to filter with.&lt;br /&gt;&lt;br /&gt;The first part, finding "CC" points, is NOT going to have much temporal coherence other than being close to edges in the image. The span length ("Z" to "Z" length) however should have some good temporal coherence, but this span length can be easily computed from the "CC" point data.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Making This Fast on the GPU&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;This is just a bunch of notes for some later point when I have both some free time to try stuff and a desire to use this...&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;I have a feeling that using the offset method one could fully separate this into vertical and horizontal passes (if one wanted to do this, which I wouldn't). Meaning fractional pixel offset in X direction is dependent on Y spans only, and Y direction is dependent on X spans only.&lt;br /&gt;&lt;br /&gt;Problem with the GPU is that one needs tens of thousands of threads to be fast. So the trivial case of one scan line (or vertical line) per thread isn't going to work. The algorithm will have to be tiled.&lt;br /&gt;&lt;br /&gt;Searching for "CC" points in parallel might be possible with the following process. &lt;br /&gt;&lt;br /&gt;1. First map the image into X and Y intensity gradients (fold this into the first reduction pass), then reduce by highest gradient, keeping position of highest gradient during the reduction. With CUDA/CS/CL one might be able to do the entire reduction in one kernel pass (unlike with traditional GPGPU methods). &lt;br /&gt;&lt;br /&gt;2. Then run a full screen pass parallelized in tiles, use the reduction tree to search for a set of important "CC" points per the vertical and horizontal spans in the tile (note the search will have to extend outside the area of the tile to support AA of near vertical and horizontal edges). Use the diff in closest "CC" coordinates between lines to compute the "Z" to "Z" length per span. From this information, the fractional pixel offset could be computed...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-1093620650551480211?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/1093620650551480211/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/morphological-aa-part-iii-gpu-ideas.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1093620650551480211'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1093620650551480211'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/morphological-aa-part-iii-gpu-ideas.html' title='Morphological AA Part III - GPU Ideas'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-8794049048161854665</id><published>2009-07-28T16:19:00.006-05:00</published><updated>2009-07-29T09:37:03.171-05:00</updated><title type='text'>Morphological AA Part II</title><content type='html'>&lt;i&gt;This is a follow up to Justin's question on the &lt;a href="http://farrarfocus.blogspot.com/2009/07/morphological-antialiasing.html"&gt;Morphological Antialiasing Post&lt;/a&gt; and more crazy AA ideas.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;What I think is most interesting about these types of post filtering techniques, is that it likely might help with things like specular, cubemap, and alpha test aliasing which can be tough to solve using other methods.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Filtering via Fractional Pixel Offset&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;EDIT. Nope that idea needed a lot of work... so I killed it!&lt;/i&gt;&lt;br /&gt;&lt;i&gt;EDIT. Actually, idea is sound, see &lt;a href="http://farrarfocus.blogspot.com/2009/07/morphological-aa-part-iii-gpu-ideas.html"&gt;third post&lt;/a&gt;!&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Temporal Feedback&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;I did a prior test on a polygon based renderer in which I only used motion vector based framebuffer reprojection feedback and a little bit of smooth camera breathing motion even when still (as you often get in a FPS) to remove aliasing. Results were promising, but I never did have the opportunity to use it anywhere (yet).&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Note, any method which is amortized temporally will require some form of data feedback, which will at a minimum be another RT, and thus be costly like a full screen post pass. Another important thing to note is that under motion the eye won't see over-filtering or noise as much, as long as it converges to a good enough looking sharp but not aliased edge. One could take advantage of noise or excess filtering under motion when temporal feedback likely won't help as much.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Crazy Depth Pre-Pass for AA&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://enterthesingularity.blogspot.com/2009/06/deferred-rendering-w-msaa-msaa-z.html"&gt;Jake Cannell's Deferred Rendering Without MSAA but With MSAA Z Prepass Idea Post&lt;/a&gt; talks about doing a 4x MSAA depth only pass, followed by careful (to keep accelerated coarse Z cull working) Z restore at non-MSAA resolution for shading, followed by bilateral up-sample based on MSAA depth for antialiasing. This would be crazy expensive (Z pre-pass with deferred, and bilateral up-sample), but with DX11 CS maybe not as bad.&lt;br /&gt;&lt;br /&gt;Could use depth information to compute and X,Y sub-pixel offset instead of the full bilateral up-sample.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-8794049048161854665?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/8794049048161854665/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/morphological-aa-part-ii.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8794049048161854665'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8794049048161854665'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/morphological-aa-part-ii.html' title='Morphological AA Part II'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-4610000237120693512</id><published>2009-07-28T10:59:00.006-05:00</published><updated>2009-07-28T11:30:37.628-05:00</updated><title type='text'>Twitter: Database Lag or Account Broken?</title><content type='html'>Decided to start using Twitter: &lt;a href="http://twitter.com/TimothFarrar"&gt;@TimothFarrar&lt;/a&gt; (Timothy sans y)&lt;br /&gt;&lt;br /&gt;Except the interface seems rather broken right now (at least for me)?&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Sorry if a few of you got multiple follow notifications!&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Follow links act as if they work. Got emails back, yet nothing shows in my following list. Tweets from those I followed showed in my home page until today and everything vanished. Nice. Help says "database replication lag"... also Firefox likes to spin forever on twitter.com. I smell multidimensional lag.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-4610000237120693512?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/4610000237120693512/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/twitter-database-lag-or-account-broken.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/4610000237120693512'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/4610000237120693512'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/twitter-database-lag-or-account-broken.html' title='Twitter: Database Lag or Account Broken?'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-6300988596411975579</id><published>2009-07-28T01:15:00.003-05:00</published><updated>2009-07-29T08:27:49.372-05:00</updated><title type='text'>Morphological Antialiasing</title><content type='html'>Linked off &lt;a href="http://www.realtimerendering.com/blog/morphological-antialiasing/"&gt;Real-Time Rendering&lt;/a&gt;, Alexander Reshetov's &lt;a href="http://visual-computing.intel-research.net/publications/mlaa.pdf"&gt;Morphological Antialiasing Paper&lt;/a&gt; is quite awesome. Unoptimized approach is about 150 cycles/pixel in a CPU implementation. &lt;b&gt;For real-time performance, I bet you could amortize the cost of something like this over multiple frames!&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;EDIT: Yeah, after a second look, amortizing this over multiple frames would be tough! See second post...&lt;/i&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-6300988596411975579?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/6300988596411975579/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/morphological-antialiasing.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6300988596411975579'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6300988596411975579'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/morphological-antialiasing.html' title='Morphological Antialiasing'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-738621932899797553</id><published>2009-07-27T09:02:00.003-05:00</published><updated>2009-07-27T10:44:14.855-05:00</updated><title type='text'>Final DX11 Hardware Predictions</title><content type='html'>&lt;i&gt;My last few round(s) of hardware predictions (or better guesses) was quite wild, this one is perhaps more conservative! BTW, I don't ask or receive any info about yet to be released hardware but what I gather from public sources which anyone has access to, otherwise I wouldn't be writing any of this.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;A little on the progression of NVidia hardware. Naming might be a little off because I don't follow NVidia's wild naming changes all too well...&lt;br /&gt;&lt;br /&gt;G80 - base GPU arch&lt;br /&gt;G92 - global atomics &lt;br /&gt;GT200 - shared memory atomics, global coalescing, double precision&lt;br /&gt;G210 - added DX10.1 support&lt;br /&gt;G300 - ?&lt;br /&gt;&lt;br /&gt;Each line of GPUs sees a steady progression of improvements on the base arch set in the G80 line. Many of these improvements are things required for DX11. Almost like each generation of hardware was a chance to test and later improve upon the design of individual components required for DX11. NVidia was early on CS, while ATI was early on framebuffer and tessellation.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;IMO, DX11 = DX10.1 + NVidia's CUDA + ATI's Tessellation&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Seems like nothing revolutionary will be required this DX round, perhaps both primary vendors will be running with a similar design as for DX10. I would NOT be surprised if some DX11 features are "emulated" in software instead of having dedicated hardware. For example, append/consume, or maybe part of the tessellation pipeline pipes through memory instead of having direct GPU routing, or even interleaved MPMD via software emulated dynamic branching (link a bunch of pipeline stage kernels into one binary)?&lt;br /&gt;&lt;br /&gt;Also not surprised that ATI is ready with DX11 in the R800 series for desktop GPUs, but perhaps &lt;a href="http://forum.beyond3d.com/showpost.php?p=1315303&amp;postcount=779"&gt;DX11 notebook GPUs not until next year&lt;/a&gt;. R900 is likely to be the real interesting arch from ATI for the DX11 generation.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Likely DX11 Games Will be Designed for DX9/DX10 Hardware First&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Lets rewind to DX10 generation history. It seems as if the DX10 generation was largely a generation of faster DX9 cards in the eyes of the average user. We can place the blame on many factors: such as a Vista upgrade requirement, consoles based on mostly DX9 generation hardware, developers not being in a situation conducive to pushing DX10 feature set, etc. &lt;br /&gt;&lt;br /&gt;Not much of this has changed. DX11 is way beyond consoles, and next generation of consoles is well off. Many PC gamers still using XP (like 50% on Valve Survey). Hardcore PC devs are leaving the PC market behind (Id, Epic, Crytec all now have a console focus). When DX10 and DX11 support is used only to improve a core engine which must run on DX9+ consoles, this is crippling for tech. &lt;br /&gt;&lt;br /&gt;&lt;i&gt;Given this, I'd bet that DX11 supporting engines will be mostly designed for DX11 CS4 for some time. Which means that many games will likely run just fine on current generation DX10 hardware under DX11 drivers...&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;What about NVidia and DX11?&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;My current guess is that the 300 series will be all about increased ALU performance and power efficiency.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Not sure if we see a arch revision or a arch derived from G80 evolution. Could be no core arch changes from a programming perspective, simply better DX9 and DX10 performance all around, with a tiny minor amount of DX11 stuff added (maybe FF tessellation unit). Looks like &lt;a href="http://translate.google.com/translate?u=http%3A%2F%2Fpctuning.tyden.cz%2Fcomponent%2Fcontent%2Farticle%2F1-aktualni-zpravy%2F14552-geforce-g300-jiz-za-par-tydnu&amp;sl=cs&amp;tl=en&amp;hl=en&amp;ie=UTF-8"&gt;at least some rumors&lt;/a&gt; agree with this speculation. Speculation that NVidia adopted a G80 derived arch for DX11 launch, and the next major revolution arch happens post DX11 launch GPUs (perhaps like ATI getting R800 early, with a later DX11 arch in R900)?&lt;br /&gt;&lt;br /&gt;What about CUDA 3.0?&lt;br /&gt;&lt;br /&gt;If the above speculation is true, something must be new for 3.0 and 300 series hardware. Maybe now the hardware can run different kernels simultaneously instead of in series? Maybe CUDA 3.0 is this plus something like DX11's better dynamic linking via better JIT and shader patching?&lt;br /&gt;&lt;br /&gt;Or perhaps 300 series is a completely new arch?&lt;br /&gt;&lt;br /&gt;&lt;i&gt;It is all a very intriguing mystery!&lt;/i&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-738621932899797553?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/738621932899797553/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/final-dx11-hardware-predictions.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/738621932899797553'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/738621932899797553'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/final-dx11-hardware-predictions.html' title='Final DX11 Hardware Predictions'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-1526647120125614021</id><published>2009-07-23T13:22:00.002-05:00</published><updated>2009-07-23T13:28:37.793-05:00</updated><title type='text'>Pixel Junk Shooter Video with Ice</title><content type='html'>&lt;object width="480" height="385"&gt;&lt;param name="movie" value="http://www.youtube.com/v/ZaBvV7Odd_0&amp;hl=en&amp;fs=1&amp;hd=1"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/ZaBvV7Odd_0&amp;hl=en&amp;fs=1&amp;hd=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;Lots of really nice attention to detail in this game: shooting water out of pools from under the surface, ice melting and cooling your player, dunking to quench overheating.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-1526647120125614021?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/1526647120125614021/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/pixel-junk-shooter-video-with-ice.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1526647120125614021'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1526647120125614021'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/pixel-junk-shooter-video-with-ice.html' title='Pixel Junk Shooter Video with Ice'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-8774723231817281037</id><published>2009-07-23T08:43:00.002-05:00</published><updated>2009-07-23T09:10:20.964-05:00</updated><title type='text'>Gather Centric Parallel Programming</title><content type='html'>The tool box of ideas to efficiently tackle complex problems on single threaded serial machines is rather well explored. IMO one of the core themes is scatter centric design. Many linear time algorithms depend on efficient scatter. One common use for scatter is binning (linear pass through N objects, output object pointer or index into 1 or more of M bins). A step up from binning is full sorting at which point when N is 1000 (or larger), radix sort is used. Radix sort of 32-bit integers on PCs is often fastest with one parallel histogram generation pass (binning), followed by 3 scatter passes. Binning or sorting (both of which depend on scatter) is crucial for many forms of optimization. For example factoring out branching by sorting objects into categories and processing all similar cases together. &lt;br /&gt;&lt;br /&gt;Given that general purpose APIs for GPU parallel programming are relatively new, IMO there is very much left to discover (or re-discover) in terms of methods and tools to solve complex problems on highly parallel GPUs. IMO one of the core themes in GPU programming is gather centric design, which contrasts greatly with serial optimization.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Why Gather Centric?&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;(1.) LIMITED ATOMIC OPERATION FUNCTIONALITY. Atomic operations are designed for unordered usage because fine grain ordered atomic operations simply do NOT scale to massively parallel machines. Scattering data via global atomics is limited to a small subset of single ALU operations, and these operations are relatively expensive. So global communication via global atomic operations is limited in usefulness. Instead data must be gathered to an independent in-order thread for complex logic.&lt;br /&gt;&lt;br /&gt;(2.) READ ONLY CACHES. Excluding Larrabee, current GPUs effectively have only read-only caches. Write-able caches used for scatter centric communication does NOT scale because of cache contention. The most important feature caches provide on GPUs is more bandwidth efficient fine granularity gather (at the cost of high latency). Global vector gather is expensive because performance is a function of the number of cache lines or memory segments required to service the gather.&lt;br /&gt;&lt;br /&gt;(3.) SCATTER IS EXPENSIVE. Global scatter is not bandwidth efficient as it is a function of the number of cache lines or memory segments touched. In the case of un-cached global masked writes (non-full writes), hardware would have to both read and write back all touched lines or segments. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Gather Centric Rules for SIMD GPUs&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;(1.) Algorithm split into independent stages.&lt;br /&gt;(2.) Coarse gather from vector sized and aligned global reads.&lt;br /&gt;(3.) Fine granularity gather from texture reads.&lt;br /&gt;(4.) Reuse read data in a program managed local store.&lt;br /&gt;(5.) All threads do the same amount of work.&lt;br /&gt;&lt;br /&gt;Likely this means that a tree structure is accessed via TEX reads but object data is carefully setup in vector sized packets (for efficient global memory access). Parallel methods where all threads of a vector process the same data are fast on all platforms. Parallel methods where all threads of a vector gather from different local objects require an extra transpose pass on Larrabee but on CUDA only require that objects be loaded with different bank offsets (or that threads access data from a different bank given objects loaded on the same bank offset).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Gather Centric Toolbox&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Note, some serial programming techniques live on inside the thread group (or CUDA block). More serial program techniques live on within a Larrabee thread because of the larger write-able cache.&lt;br /&gt;&lt;br /&gt;What to do with no scatter? Scatter via gather.&lt;br /&gt;&lt;br /&gt;Scatter via gather means emulating scatter by having the scatter destination search for the result of the scatter. This REQUIRES data in a prior pass to be efficiently searchable. Which means one of two things,&lt;br /&gt;&lt;br /&gt;(1.) Either do a hierarchical parallel reduction in a prior pass to generate some kind of acceleration structure to reduce search cost to log time. If the data is static then this prior pass was done offline.&lt;br /&gt;&lt;br /&gt;(2.) Or have the data kept and updated at run-time in a hierarchical tree structure such that no extra parallel reduction is required to generate the acceleration structure for searching.&lt;br /&gt;&lt;br /&gt;Gather centric optimization has more to do with the following,&lt;br /&gt;&lt;br /&gt;(1.) Limiting the amount of gather searching by "good for the GPU" acceleration structures. Often this means having fixed N-ary trees where N packs well with the SIMD vector size of the machine.&lt;br /&gt;&lt;br /&gt;(2.) Keeping thread groups doing the searching grouped by good data locality, and maintaining a high amount of local store data reuse (to avoid getting bandwidth bound). &lt;br /&gt;&lt;br /&gt;(3.) Keeping all threads doing a fixed (instead of variable) amount of work. To me this screams "amortize sorting and/or searching over multiple passes".&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The "Holy Grail" of GPU Ray Tracing&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;This post wouldn't be complete without some talk about GPU ray tracing, given that ray tracing is a perfect parallel example of gather centric design.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;GPU ray tracing breaks rule (5) which states that all threads do the same amount of work. GPUs lack ability to efficiently do fine granularity regrouping of divergent threads. However a recent CUDA paper found 2x speed up from bringing down the cost of long running threads from thread block granularity to warp granularity. This is the current state of the art.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;The problems of dynamic objects and performance have yet not been solved for ray tracing (or ray casting). This fact alone has made the topic a worthy background obsession of mine.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Temporal Coherence&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;I'm convinced that a solution can be found in use of temporal (in combination with spacial) coherence to lower search cost. I've managed getting part way to a solution using a scatter centric design in the form of what I call stochastic visibility. However if I want something which will scale on GPUs and something which I can port to OpenCL to run on any GPU, I need to re-formulate a gather centric solution to the problem of real-time tree-traversal of dynamic data. &lt;br /&gt;&lt;br /&gt;&lt;i&gt;A solution &lt;b&gt;has&lt;/b&gt; been eluding me for some time!&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Last night / this morning, a new possible solution became evident thanks to exploring a combination of unrelated topics (including thinking about doing a hierarchical cellular automata to increase the "speed of sound" in a fluid computation), writing out the above rules again of gather centric design for SIMD GPUs, and rethinking how to solve reprojection without scatter. The solution involves maintaining a GPU side scene tree structure of which I use the previous frame's tree traversal to accelerate the search for the solution to the next frame's visibility...&lt;br /&gt;&lt;br /&gt;More on this later, if it works!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-8774723231817281037?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/8774723231817281037/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/gather-centric-parallel-programming.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8774723231817281037'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8774723231817281037'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/gather-centric-parallel-programming.html' title='Gather Centric Parallel Programming'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-7671327437997890070</id><published>2009-07-20T10:39:00.002-05:00</published><updated>2009-07-20T10:46:47.580-05:00</updated><title type='text'>My Ideal Parallel Machine</title><content type='html'>&lt;i&gt;Clearly I am NOT a hardware engineer, but I am seeing a very consistent pattern of what I want in massively parallel hardware from what I attempt to create in software, and from this I've derived my ideal parallel machine below.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;The machine is a very dense 3D computing/memory matrix of which a 2D slice looks somewhat like this, &lt;br /&gt;&lt;br /&gt;&lt;tt&gt;(CNM)(CNM)(CNM)(CNM)(CNM)...(CNM)&lt;br /&gt;(CNM)(CNM)(CNM)(CNM)(CNM)...(CNM)&lt;br /&gt;(CNM)(CNM)(CNM)(CNM)(CNM)...(CNM)&lt;br /&gt;(CNM)(CNM)(CNM)(CNM)(CNM)...(CNM)&lt;br /&gt;(CNM)(CNM)(CNM)(CNM)(CNM)...(CNM)&lt;br /&gt;.................................&lt;br /&gt;.............SCALES_IN_X_Y_Z.....&lt;br /&gt;.................................&lt;br /&gt;(CNM)(CNM)(CNM)(CNM)(CNM)...(CNM)&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;- Each (CNM) is a compute/network/memory node in the matrix.&lt;br /&gt;- Each node is connected to neighboring nodes in the matrix.&lt;br /&gt;- All memory is distributed among the nodes.&lt;br /&gt;- There is no external memory.&lt;br /&gt;- There is no hardware cache coherency between nodes.&lt;br /&gt;&lt;br /&gt;The network (N) part of each node would route data from local memory (M) in the background of computation (C). Total interconnect bandwidth would be tremendous, do to the highly parallel neighbor to neighbor routing. Network latency would be mostly a function of the distance between the sending and receiving nodes.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Crucial Change from Current Practice&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;I am just starting to really understand how I would program for this kind of machine in a general purpose way, and have algorithms scale to a variable number of compute nodes.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;The crucial change is that memory is no longer external to the CPU, and we abandon the "load, compute, store" model of programming at larger scales.&lt;br /&gt;&lt;br /&gt;Parallel computation on current hardware is either distributed to fixed locations or variable locations based on core load balancing, and data is either distributed among local and distant caches, or distant banks of memory. Effectively computation becomes fine grain loop of loading data from distant locations, doing work, and then storing the results back to other distant locations. This does NOT scale well at all.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;The problem is that the physical data flow in the machine does NOT match the needs of the program, and computation does NOT follow the required data flow.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Micro Scale Example&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;When looking at the current model of CPU atomic operations, parallel computation interfacing with shared data is distributed to many nodes, requiring an expensive serialization of computation because the atomic logic plays out finely interleaved across nodes in which there is a long communication latency between nodes.&lt;br /&gt;&lt;br /&gt;For example, with CPU nodes fetching or writing to a shared queue, the queue pointer ends up on a shared cache line, and each add to the pointer happens on a different node requiring a synchronous transfer of the full cache line over long distances before an operation can finish.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Data and computation is NOT flowing through the machine in sync with the needs of the algorithm. Data and computational locality should be in parallel with each other!&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;I believe GPUs, excluding Larrabee, have traditionally had atomic operations on globally shared memory computed at the point of communication synchronization near the memory controller (MC) or ROP, with no need to transfer "cache lines" back and forth around the machine for the ALU operations which compose the atomic operation. &lt;br /&gt;&lt;br /&gt;&lt;i&gt;In this case data is flowing through the machine in sync with the needs of the algorithm. Computation on data which needs to be joined physically happens at the point where data is routed to a common location.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;The DX11 CS, OpenCL, and CUDA parallel computing models (on non-Larrabee hardware) logically terminate here, and are vastly limited by a small microcosm of transfer of "compute locality" expressed in the set of supported atomic operations. Any other transfer of "compute locality" requires complex routing to/from distant memory and node-to-node synchronization. Clearly in the case of Larrabee, this routing might be of lower cost (or lower latency) because of the coherent caches, but still suffers from the same problem, just less so!&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Macro Scale Example&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;A more macro example would be the non-general GPU fixed function ROP and framebuffer logic. Data is transferred on chip, results pass from compute units to distributed serialization points (ROP units) for a given tile of the framebuffer. Work distribution is to my knowledge a fixed layout to provide good load balancing and to better parallelize framebuffer memory requests. Note that a complex set of logic happens with respect to Z, stencil, and blending in the ROP units themselves. Compute locality transfers from ALU units to ROP units as does the data. &lt;br /&gt;&lt;br /&gt;This is a macro example because the entire set of ROP blend operations is conditionally dependent on fine granularity Z and stencil checks. This sort of thing cannot be done efficiently in software via the small subset of unordered atomic operations provided by DX11 CS, OpenCL, or CUDA.&lt;br /&gt;&lt;br /&gt;An attempt to emulate this ROP unit in software on non-Larrabee GPUs is a disaster. The problem results from high overhead in all the many possible ways to do the required node to node communication to provide framebuffer functionality. At some point node to node communication must go through global memory, or data must be fetched from global memory (or TEX cached) many times (gather/searching) to avoid the node communication (from scatter, etc). Either way, performance is screwed compared to the fixed function hardware. Note, Larrabee should be better at software ROP emulation because of lower overhead (lower communication latency) do to caches, but looks to me as if it digresses into worst case behavior as the granularity of writes to the framebuffer become small (either the binning costs grow huge, or the synchronization on Z buffer check/write grow huge, choose your poison).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The Big Picture&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;The programming model on my suggested compute/memory matrix becomes one of direct data flow. Data flows through the machine based on the needs of program synchronization and communication, sync points become physical nodes where data is transferred to be jointly processed (or serialized), the program's next step of computation (the joint processing) happens at the destination node, instead of the compute node where the data originated (as with traditional programming models).&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;In the ROP example, compute nodes would processes fragment shader kernels, and then output a network packet of fragment shader intermediate results. This network packet gets background routed to the node(s) assigned to the corresponding ROP tile(s) hit by the output samples. These ROP nodes do any programmable blending and can even continue doing more programmable shading (which might include transferring new packets between other nodes for computation). The ROP tile nodes can route back coarse Z/stencil data packets to raster nodes to early cull.&lt;br /&gt;&lt;br /&gt;For load balancing, nodes themselves could easily service many different destination compute domains at the same time. Also note with this programming model, data need not be transferred to and from fixed long term set locations of memory, data structures can instead continuously be dynamically moved around in the distributed memory.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Undoubtedly more posts on this later when I get more time...&lt;/i&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-7671327437997890070?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/7671327437997890070/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/my-ideal-parallel-machine.html#comment-form' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/7671327437997890070'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/7671327437997890070'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/my-ideal-parallel-machine.html' title='My Ideal Parallel Machine'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-1581652350695992369</id><published>2009-07-19T12:22:00.004-05:00</published><updated>2009-07-19T17:05:16.420-05:00</updated><title type='text'>Windows SDK, Windows from eSATA, ETC</title><content type='html'>&lt;i&gt;This post is an open request for comments from anyone who wants to share advice regarding the topics below.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Visual Studio Express 2008 or Windows SDK&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;For personal development I really need just one thing, a compiler/linker binary which runs from the command line. Clearly I also need the system headers, but I absolutely DO NOT need an IDE or debugger, those just slow me down.&lt;br /&gt;&lt;br /&gt;Anyone have experience with the free Visual Studio Express 2008?&lt;br /&gt;&lt;br /&gt;I didn't even know it existed until today, and it seems easier to just to get Express than attempting to maintain a manual install of MinGW/MSYS for GCC.&lt;br /&gt;&lt;br /&gt;What about just installing the Windows SDK without Express?&lt;br /&gt;&lt;br /&gt;Looks like I really only need the SDK and NOT Express. If this is the case, damn, thanks Microsoft! Love to get any feedback anyone has as to this. For example, are there any limitations as to the Windows SDK compilers? Like hidden missing support for assemply or something crippling like that? What about with Windows 7 Beta? Are Windows 7 GPU drivers good or buggy?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Bootable Thumb/USB or eSATA&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The dead machine is helping me rethink my development environment. I'm now looking at the option of building bootable external drives for development, to make it easy to switch between operating systems and physical machines (I used to switch internal drives). I made sure to get eSATA on the new motherboard. Too bad the iMac requires boot-camp. &lt;br /&gt;&lt;br /&gt;XP from USB : &lt;a href="http://ccollomb.free.fr/blog/?p=64"&gt;Cedrick Collomb Blog&lt;/a&gt; | &lt;a href="http://www.ngine.de/index.jsp?pageid=4176"&gt;Ngine.de&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Anyone run XP or Vista from eSATA?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Switching Linux Distro&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;My first Linux experience I think was Slackware, and post that I did my own from scratch because I wanted pthreads before it was available and later wanted everything to run from a ZIP drive. I've also played with NetBSD, OpenBSD, and FreeBSD. Settled with Crux for a long while (KISS), but went to Gentoo on the last machine for easy 64-bit support (big mistake). Now Slackware is just getting 64-bit support (in current branch only), and FreeBSD needs some kernel changes to get 64-bit NVidia drivers, so those are out for me.&lt;br /&gt;&lt;br /&gt;Turns out that &lt;a href="http://wiki.archlinux.org/index.php/Arch_vs_Others"&gt;Arch Linux&lt;/a&gt; is a very Crux like (KISS) distro with good 64-bit support, so I'm going to move to it on the new machine.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-1581652350695992369?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/1581652350695992369/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/windows-sdk-windows-from-esata-etc.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1581652350695992369'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1581652350695992369'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/windows-sdk-windows-from-esata-etc.html' title='Windows SDK, Windows from eSATA, ETC'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-9096348726234747755</id><published>2009-07-18T18:15:00.002-05:00</published><updated>2009-07-18T20:08:59.021-05:00</updated><title type='text'>Weak Computer Took One for the Team</title><content type='html'>&lt;a href="http://michelgagne.blogspot.com/"&gt;Insanely Twisted Shadow Planet&lt;/a&gt; looks really cool. &lt;br /&gt;&lt;br /&gt;Had a weekend of freedom to really work (wife is out of town), and the damn PC Linux dev machine died today. After testing 3 different GPUs and 3 different power supplies I've ruled out everything but one or more of motherboard, memory, or CPU. So instead my bother is driving up to visit and I'm catching up on PS3 games. It is amazing how much waiting on cheep parts costs in opportunity cost. &lt;br /&gt;&lt;br /&gt;I'd really like to just immediately transition development to my older iMac (with it's 8800M GTS GPU), but STILL waiting on September for Snow Leopard for OpenGL 3 support (I'm not a ADC Premier or Select Member so don't have access to the beta, assuming GL3 was in the beta, which I don't really know). Going back to pre-GL3 isn't really an option.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Atom Update&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Tested out having 4K unique scene tree roots with heavy overlap and sparse l-systems. Visibility worked with everything moving. Drawing is mostly constant regardless of how many root nodes I have (as long as I stay within a low percent of root nodes to total nodes). Will have no problem with the 64K dynamic object target. Also pushed the GTX275 to 8M nodes (4x my high-end GPU target), but couldn't hit 60Hz. BTW, I'm now targeting 60Hz on low end, 30 just isn't smooth enough for me.&lt;br /&gt;&lt;br /&gt;I have a new better way to store the node tree in a modified 360 fisheye projection mapping alone instead of having to reproject between an octahedron (scene tree storage) and fisheye mapping (drawn for player). The new mapping is a spherical 360 degree fisheye with resulting xy coordinates modified by sin() to push the projection into half way between a rectangle and sphere. Screenshot would have been included if computer was alive.&lt;br /&gt;&lt;br /&gt;This was an accidental discovery, I didn't expect it to work well. It has the side effect of a more rectangular (normal) view in the center, and actually performs better with the stochastic visibility. The view expansion near the end of projection (the edge of the projection is from one "point" behind the viewer) actually helps leave room for more of the tree structure without collisions, and the view "compression" under fast forward motion as objects fly behind the viewer requires less frames to fully hole fill. &lt;br /&gt;&lt;br /&gt;The result of all this is that the scene tree itself it stored directly in a very thick "frame buffer", and it works do to having some kind of 360 degree view mapping.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;If this sounds alien, or like I'm in crazy land, it should. I spend more than 40 hours a week programming in reality (at Humanhead), so going off the deep end is quite refreshing indeed!&lt;/i&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-9096348726234747755?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/9096348726234747755/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/weak-computer-took-one-for-team.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/9096348726234747755'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/9096348726234747755'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/weak-computer-took-one-for-team.html' title='Weak Computer Took One for the Team'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-6285635118423961047</id><published>2009-07-18T18:04:00.002-05:00</published><updated>2009-07-18T18:12:34.721-05:00</updated><title type='text'>Console Art Detailed</title><content type='html'>Neat to see how others build things, awesome character artist at work shared these links from Ben Sprout Artworks : &lt;a href="http://bensprout.com/killzone.html"&gt;Prefabs in KZ2&lt;/a&gt; and &lt;a href="http://bensprout.com/mp3.html"&gt;Environments in MP3&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-6285635118423961047?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/6285635118423961047/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/console-art-detailed.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6285635118423961047'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6285635118423961047'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/console-art-detailed.html' title='Console Art Detailed'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-237884862424332713</id><published>2009-07-18T16:27:00.002-05:00</published><updated>2009-07-18T17:55:24.985-05:00</updated><title type='text'>Alaska : Juneau</title><content type='html'>We ended our trip in Juneau to catch our flight back to Chicago (which went indirectly through Anchorage). Juneau is one of the very common ports for Alaska cruise ships so be warned. I'd suggest skipping the Mount Roberts Tramway and instead hiking up/down the deadly path (very dangerous in the rain) which leads to/from the top (it is only 1800 feet, and the local kids run up the path). &lt;br /&gt;&lt;br /&gt;Best local food we found in a very short amount of time was the Sandpiper Cafe. Other places of interest included the &lt;a href="http://www.museums.state.ak.us/asm/asmhome.html"&gt;Alaska State Museum&lt;/a&gt; (where they had a great Science on a Sphere exhibit with hundreds of different visualizations of planetary data), the &lt;a href="http://www.alaskanbeer.com/main.html"&gt;Alaskan Brewery Company&lt;/a&gt; tour, and the salmon hatchery sporting a really nice set of aquariums to keep people visiting outside of spawning season,&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2527/3723260641_7c91f0b2b5.jpg"&gt;&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2676/3724071376_138f202a1b.jpg"&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;This is it From Alaska Folks&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;After this I will stick to the usual crazy programming stuff for a while. By the way, nearly all the photos thanks to the lovely lady below,&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm4.static.flickr.com/3471/3699281084_82813abc00.jpg"&gt;&lt;br /&gt;&lt;br /&gt;Hope you all enjoyed a glimpse of Alaska! &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;One Last Departing Gift : Alaska Stream Crossing Epic Fail!&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;An attempt to do something really stupid for the camera turned out horribly right, we laughed non-stop for about an hour after this... &lt;br /&gt;&lt;br /&gt;&lt;object type="application/x-shockwave-flash" width="400" height="300" data="http://www.flickr.com/apps/video/stewart.swf?v=71377" classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000"&gt; &lt;param name="flashvars" value="intl_lang=en-us&amp;photo_secret=a6aaa5a5fb&amp;photo_id=3703823149"&gt;&lt;/param&gt; &lt;param name="movie" value="http://www.flickr.com/apps/video/stewart.swf?v=71377"&gt;&lt;/param&gt; &lt;param name="bgcolor" value="#000000"&gt;&lt;/param&gt; &lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;embed type="application/x-shockwave-flash" src="http://www.flickr.com/apps/video/stewart.swf?v=71377" bgcolor="#000000" allowfullscreen="true" flashvars="intl_lang=en-us&amp;photo_secret=a6aaa5a5fb&amp;photo_id=3703823149" height="300" width="400"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-237884862424332713?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/237884862424332713/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/alaska-juneau.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/237884862424332713'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/237884862424332713'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/alaska-juneau.html' title='Alaska : Juneau'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm3.static.flickr.com/2527/3723260641_7c91f0b2b5_t.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-4273540252737587846</id><published>2009-07-17T14:34:00.003-05:00</published><updated>2009-07-17T14:46:48.820-05:00</updated><title type='text'>Where is my Raytraced Physics Toy?</title><content type='html'>Someone really needs to join GPU physics with GPU raytracing into an interactive physics toy. The scene has shown the way, interesting 11k Intro by FRequency,&lt;br /&gt;&lt;br /&gt;&lt;object width="480" height="385"&gt;&lt;param name="movie" value="http://www.youtube.com/v/hYD58OdL_bA&amp;hl=en&amp;fs=1&amp;hd=1"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/hYD58OdL_bA&amp;hl=en&amp;fs=1&amp;hd=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;Another 4k by FRequency,&lt;br /&gt;&lt;br /&gt;&lt;object width="480" height="385"&gt;&lt;param name="movie" value="http://www.youtube.com/v/mswPy5bt3TQ&amp;hl=en&amp;fs=1&amp;hd=1"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/mswPy5bt3TQ&amp;hl=en&amp;fs=1&amp;hd=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;Another 1k by FRequency,&lt;br /&gt;&lt;br /&gt;&lt;object width="480" height="385"&gt;&lt;param name="movie" value="http://www.youtube.com/v/viqkXLxaVxo&amp;hl=en&amp;fs=1&amp;hd=1"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/viqkXLxaVxo&amp;hl=en&amp;fs=1&amp;hd=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-4273540252737587846?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/4273540252737587846/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/where-is-my-raytraced-physics-toy.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/4273540252737587846'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/4273540252737587846'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/where-is-my-raytraced-physics-toy.html' title='Where is my Raytraced Physics Toy?'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-7058296408716606612</id><published>2009-07-16T16:06:00.000-05:00</published><updated>2009-07-16T16:07:55.558-05:00</updated><title type='text'>DX10/GL3 GPGPU : Stream Out with Scatter</title><content type='html'>In regards to DX10/OpenGL3 level hardware and point scatter, moving PS work to the VS stage was a huge win. In the case where there isn't enough VS work to amortize the setup cost, perhaps using stream out (or GL transform feedback) would enable adding a secondary task to be done in the VS stage. So VS would stream out a set of outputs which are not passed along to the PS stage (PS stage gets separate VS outputs).&lt;br /&gt;&lt;br /&gt;This might be perfect for the case where an object needs to self update (do that via stream out), and then scatter its index into another data structure (via point drawing).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-7058296408716606612?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/7058296408716606612/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/dx10gl3-gpgpu-stream-out-with-scatter.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/7058296408716606612'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/7058296408716606612'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/dx10gl3-gpgpu-stream-out-with-scatter.html' title='DX10/GL3 GPGPU : Stream Out with Scatter'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-5174394562761724843</id><published>2009-07-14T11:39:00.003-05:00</published><updated>2009-07-14T12:01:49.434-05:00</updated><title type='text'>Wow.</title><content type='html'>Damn, Naughty Dog wizards have set the graphics bar again! Particles colored to match scene tint, particles getting shadowed, perhaps sub-surface scattering, huge numbers of triangles, and more...&lt;br /&gt;&lt;br /&gt;&lt;div style="width: 480px;"&gt;&lt;object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=8,0,0,0" id="gtembed" width="480" height="392"&gt; &lt;param name="allowScriptAccess" value="sameDomain" /&gt; &lt;param name="allowFullScreen" value="true" /&gt; &lt;param name="movie" value="http://www.gametrailers.com/remote_wrap.php?mid=52604"/&gt;&lt;param name="quality" value="high" /&gt; &lt;embed src="http://www.gametrailers.com/remote_wrap.php?mid=52604" swLiveConnect="true" name="gtembed" align="middle" allowScriptAccess="sameDomain" allowFullScreen="true" quality="high" pluginspage="http://www.macromedia.com/go/getflashplayer" type="application/x-shockwave-flash" width="480" height="392"&gt;&lt;/embed&gt; &lt;/object&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-5174394562761724843?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/5174394562761724843/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/wow.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/5174394562761724843'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/5174394562761724843'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/wow.html' title='Wow.'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-2014674128983329770</id><published>2009-07-14T10:52:00.004-05:00</published><updated>2009-07-14T10:58:12.302-05:00</updated><title type='text'>Hybrid Cached Raycast and Reproject Raster</title><content type='html'>&lt;i&gt;I've been suggesting this idea and permutations of this idea partly because I'd like to see someone try it, and I'm busy "traversing" a different solution to a similar problem (just geared towards dynamic rather than more static geometry). Undoubtedly there are those who have thought of this, just I haven't seen a paper on it yet.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;The rough idea is to amortize the ray traversal cost over many frames and to make use of both spacial and temporal coherence. This shares a lot of similarities to video compression, in regards to reprojection being like motion prediction in B/P-frames, and the tile cache being like bits of I-frames.&lt;br /&gt;&lt;br /&gt;The idea uses a traversal tile cache. A tile being a warp or vector (SIMD) friendly packet of the result of raycasts into an object in the scene. For example for CUDA, a tile might be a 8x4 grid of raycasts. An object might be the entire static world, or a character, or a dynamic object. Tiles are associated with an object ID so that the tile can be correctly reprojected as the object and view moves. &lt;br /&gt;&lt;br /&gt;Tracing into a skinned character is a separate problem for which I don't have a good direct solution. However, given a solution to this problem, one could use the per tile object ID to "re-skin" the tile during reprojection. Actually one would want to store the tile in the base pose coordinate space, and simply apply the bones to the fragments in the tile to do the reprojection.&lt;br /&gt;&lt;br /&gt;Tiles in the cache become less valid as objects move or the view changes. Each frame the cache gets updated to fill areas of the view(s) (note, there might be multiple views such as shadowmaps, etc) which have sparse coverage. Tile cache update rate would scale to match the capacity of the GPU.&lt;br /&gt;&lt;br /&gt;Drawing the frame would be done via a composite pass which re-projects, lights, and blends all the cached tiles (and fills in any holes). Part of lighting could very well be done in tile space, and at a frequency lower than the screen refresh for low frequency lighting, and perhaps in image space at the screen refresh rate for high frequency lighting like specular and hard shadows. &lt;br /&gt;&lt;br /&gt;&lt;i&gt;Ultimately the success and failure of this type of system depends on if the composite/reprojection cost is lower than the full traversal.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Compared to What I'm Doing&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;With classic ray traversal the structure is geared towards answering the question &lt;i&gt;"if a ray intersects the object, what does it hit?".&lt;/i&gt; With my structure the traversal is geared towards answering the question, &lt;i&gt;"given a lower quality version of the object in this 3D region, what does the next higher quality level of detail look like".&lt;/i&gt; In my case the traversal could be of any tree, even if the tree self intersects. In the case of a skinned object, classic ray traversal is a complex problem, in my case it is trivial. &lt;br /&gt;&lt;br /&gt;I go directly down the tree without divergence, because my traversal doesn't go by ray intersection. However I depend on a reprojection scatter pass (which can be redone as a coarse binned scatter, local gather, or whatever form is fastest on the hardware). This reprojection scatter pass is used to prune out traversals of the tree which are not adding to visible nodes.&lt;br /&gt;&lt;br /&gt;My scene is that of a forest of l-systems (64K roots, 2M nodes). Each traversal step is amazingly expensive (decompress rule, apply rule) compared to a triangle or voxel raycast or raytrace, and in my case the composite/reprojection step isn't as costly as traversal, so the hybrid traverse/raster works.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Not sure if the hybrid cached trace/raster approach will work well for classic raycasting, but I'd bet one of you out there will be answering this question in the next year or so!&lt;/i&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-2014674128983329770?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/2014674128983329770/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/hybrid-cached-raycast-and-reproject.html#comment-form' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2014674128983329770'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2014674128983329770'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/hybrid-cached-raycast-and-reproject.html' title='Hybrid Cached Raycast and Reproject Raster'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-6498464227206045457</id><published>2009-07-13T10:51:00.003-05:00</published><updated>2009-07-13T10:54:02.936-05:00</updated><title type='text'>Amusement Machine Video</title><content type='html'>The Amusement Machine guy posted an interesting video (a month ago), real-time gpu raycasting on ATI 4850.&lt;br /&gt;&lt;br /&gt;&lt;object width="560" height="340"&gt;&lt;param name="movie" value="http://www.youtube.com/v/idkevMDw08I&amp;hl=en&amp;fs=1&amp;hd=1"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/idkevMDw08I&amp;hl=en&amp;fs=1&amp;hd=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="560" height="340"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-6498464227206045457?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/6498464227206045457/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/amusement-machine-video.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6498464227206045457'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6498464227206045457'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/amusement-machine-video.html' title='Amusement Machine Video'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-2993529674655022165</id><published>2009-07-13T08:12:00.004-05:00</published><updated>2009-07-13T09:41:53.730-05:00</updated><title type='text'>Alaska : Gustavus</title><content type='html'>Gustavus, with its somewhere over 300 residents (you know a town has its priorities right when the airport is larger than the town itself), is a launching point for all things done in Glacier Bay. We stayed at &lt;a href="http://www.gustavus.com/guesthouse/"&gt;Aimee's Guest House&lt;/a&gt; and would highly recommend Aimee to anyone interested in staying in Gustavus,&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2449/3704608442_652e42a4d9.jpg"&gt;&lt;br /&gt;&lt;br /&gt;The most dangerous animal in Alaska (2nd to people of course) viewed from the deck of the guest house during breakfast,&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2480/3704602510_e41a7f47eb.jpg"&gt;&lt;br /&gt;&lt;br /&gt;Dinner was mostly a combination of biking to &lt;a href="http://gustavus.com/peps/"&gt;Pep's Packing&lt;/a&gt; for some smoked salmon and the local grocery. The best dinner in town was the &lt;a href="http://gustavusinn.fotki.com/your_table/"&gt;Gustavus Inn&lt;/a&gt; where you have to reserve your seat before noon (so they know how much fish to get), food is served family style, and the highlight of dinner is a mix of local fish/crab and greens/fruit from the owner's extensive garden. &lt;br /&gt;&lt;br /&gt;Eight mile bike ride from the guest house is Glacier Bay National Park which only really sports 3 hiking trails from the lodge (and only one sized for a day trip). Aimee's Guest House provides vintage 1-speeds for the journey,&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2481/3682013519_bf9d0d6313.jpg"&gt;&lt;br /&gt;&lt;br /&gt;And on route is path to a military plane crash site (sans the awesome radial motors),&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm4.static.flickr.com/3468/3703983673_07b9507b73.jpg"&gt;&lt;br /&gt;&lt;br /&gt;On the other direction is the road into town, from that, on route to the pier is the Mount Fairweather Golf Course,&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm4.static.flickr.com/3613/3704554328_9fb04fe1dd.jpg"&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-2993529674655022165?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/2993529674655022165/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/alaska-gustavus.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2993529674655022165'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2993529674655022165'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/alaska-gustavus.html' title='Alaska : Gustavus'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm3.static.flickr.com/2449/3704608442_652e42a4d9_t.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-271783370738319449</id><published>2009-07-10T10:13:00.003-05:00</published><updated>2009-07-10T10:50:11.741-05:00</updated><title type='text'>Video of Particles via L-System and Stochastic Visibility</title><content type='html'>I talked about this a while back and never did post a video, and I had to post something cool before SigGraph 2009, so below are screen shots and a quick video of an old stochastic visibility with particle generation effect test. The OpenGL2 (NVidia Specific) GLSL demo was built a long time ago for a 8600 GTS, and runs way too fast on the GTX275 (even bottle-necked by triangle setup in point scatter). Unfortunately all I had time to do was add a very slow serial frame capture and write MJPEG, which when turned on slows the program down too much to make a good video (it is hard to think in slow motion). The conversion from MJPEG to MPEG4 isn't ideal either (it is a huge file) at 60fps 720p.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;There are a lot of things I never bothered fixing in the demo, such as the boundary between black background and the fractal (you will see a blurry 16x16 block pattern outline late in the video), or the pops when a bad floating point number gets into the particle buffer. This only shows the result of the tree expansion limited to one level per frame (fill artifacts from edges of occluders). Also hole filling was broken. So it isn't really representative of what I'm planning on for Atom, but it was a huge milestone in learning GPGPU techniques!&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Video&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.farrarfocus.com/atom/mov/sg2009.mov"&gt;Link to MPEG4 Video - 720p 60fps (115 MB)&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Other Screen Shots&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.farrarfocus.com/atom/img/sg2009a.jpg"&gt;&lt;img src="http://www.farrarfocus.com/atom/img/sg2009a2.jpg"&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.farrarfocus.com/atom/img/sg2009b.jpg"&gt;&lt;img src="http://www.farrarfocus.com/atom/img/sg2009b2.jpg"&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;What is it?&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;I'd qualify it as bad coder art used to test GPGPU tree data structures (everything is computed on the GPU, the CPU just sends in view position and view direction). The full fractal structure is huge, something like 2^24 pixels wide before precision problems sink in?&lt;br /&gt;&lt;br /&gt;The scene tree is an 8-arry tree with one simple l-system rule for the 8 children based on the parent (position/scale/quaternion). It isn't limited to just one rule, I'm just lazy, can lookup any rule from a texture. The engine maintains the scene tree on the GPU (256K coarse nodes, 2M fine nodes). The projection is a 360 fisheye, and the particle effect runs in projected fisheye space based on motion vectors and parent tree position relative to child tree position. The visibility is computed in an octahedron space (different mapping from the view), so the edges of the fisheye projection loose quality fast. Color is instanced via a direct visualization of eye relative world position,&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;vec3 rgb = vec3(0.0);&lt;br /&gt;rgb += vec3(0.7,0.625,0.5) * vec3(pow(abs(sin(sqrt(abs(y/4096.0))*2.0)),16.0));&lt;br /&gt;rgb += vec3(0.3,0.4,0.5) * vec3(pow(abs(sin(sqrt(abs(x/4096.0))*32.0)),4.0)); &lt;br /&gt;rgb += vec3(0.5,0.4,0.3) * vec3(pow(abs(sin(z/4096.0*256.0)),4.0));&lt;br /&gt;return rgb * rgb;&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Future&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;I'm still fascinated with the problem of solving visibility for fully dynamic geometry requiring very expensive ray traversals (if ray cast or traced), but without ray casting (or tracing) and instead using what I've termed as "stochastic visibility" and blogged about. Effectively keeping a consistent tree structure of the scene as required by visibility (and shading), then only expanding and contracting the tree structure enough each frame to service rendering to the quality level required for human frame to frame perception. Stochastic visibility collides a point per scene tree node in an view space mapping to both help compute tree updates and solve for visibility. It is stochastic because the points collided are randomly positioned inside the bounding volume of the node. Collisions help prune out nodes which yield non-visible geometry. Collisions also directly solve the tree node memory allocation problem (there is no memory allocation).&lt;br /&gt;&lt;br /&gt;Vacation has provided some new ideas to improve upon my old non-CUDA/OpenCL "stochastic visibility" which I will be trying soon,&lt;br /&gt;&lt;br /&gt;(1.) IMPROVED TEMPORAL CONSISTENCY VS HISTORY BUFFER. Collisions would sometimes prune out a part of the required scene tree branch, and keeping persistent nodes was way too costly to do in OpenGL3 with GPGPU methods on a 8600 level graphics card. One trivial solution to this problem would be to run a pass which checks the source data for collisions and does a reduction of the source nodes keeping the highest priority node which had a collision. If I had a set of reducing resolution history buffers, I could likely insure good temporal consistency (solving the random pruning problem).&lt;br /&gt;&lt;br /&gt;(2.) ADD PARENT LINKS. My scene tree nodes didn't have links to the parents. Was a problem for animated l-systems and particle fluid dynamics, because I wanted a physical constraint which pulled nodes back to the static position in parent (as defined as a possibly animated l-system rule). Turns out that the history buffer pass enables me to correct parent links in child nodes, even though parents move to different memory locations each frame. If this works, this GPU data structure truly becomes awesome because it solves the 1M node memory allocation problem per frame with fully dynamic trees, automatic "defragmentation" of memory, automatic regrouping for good data/branch locality and cache performance, and now nodes maintaining parent links even though all memory locations change per frame.&lt;br /&gt;&lt;br /&gt;(3.) MULTI-LEVEL TREE UPDATE PER FRAME. With LOD transparent blend-in and a triangle based scene, I've found in previous results that I can add/prune nodes at only a LOD level per frame and effectively service visibility if I have a conservative amount of overlap. However when the screen tree goes down to the pixel or near pixel/level, the tree must be able to expand by more than one level per frame to fill visibility gaps in dynamic geometry. I'm planning on a new method where I first re-project the current node set, then do a hierarchical image space reduction to choose the highest priority nodes for varying level multi-level tree update per frame.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-271783370738319449?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/271783370738319449/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/video-of-particles-via-l-system-and.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/271783370738319449'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/271783370738319449'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/video-of-particles-via-l-system-and.html' title='Video of Particles via L-System and Stochastic Visibility'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-2674942925989838833</id><published>2009-07-09T03:25:00.003-05:00</published><updated>2009-07-09T04:36:31.266-05:00</updated><title type='text'>Cool Blog Posts</title><content type='html'>I'm starting to really like the Followers tool on the blog,&lt;br /&gt;&lt;br /&gt;&lt;a href="http://sandervanrossen.blogspot.com/"&gt;Sanders' Blog : Deferred Virtual Texture Shading Post Series&lt;/a&gt; - Sander van Rossen has a really neat idea to combine virtual texturing with texture space shading and do shading in the physical pages of the texture cache itself. Dynamic geometry would require updating normals (and coordinates) of the texels in the physical texture page cache. But lighting calculations could be cached (or not done every frame) and transparent rendering is simple. The challenge is efficiently rendering lights into physical texture page cache. I think this would be a highly worthy problem to solve in DX11 with CS5. With unique texturing, for each texture page in the page cache, have an associated bounding geometry to intersect with lights. Bin lights GPU side per texture page in the CS. Do highly coherent shading in the page cache with the lights in the page bin. This could be done by running enough threads to fill the hardware and just extracting work from the bins to load balance all the shading.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://enterthesingularity.blogspot.com/"&gt;EnterTheSingularity&lt;/a&gt; - Jake Cannell has a bunch of cool ideas, such as this, &lt;i&gt;"Taking the spatial-temporal coherence optimization idea to its logically conclusion results in a very interesting radical way to approach a renderer. Really what you want is something like a real-time video compression loop, but without the encoding step. When you need new fresh image macrotiles, you invoke the rasterizer, sometimes rendering them at reduced spatial resolution depending on the amount of motion blur, and then these are fed into a forward reprojection system which composites most of the image, along with some holes and errors introduced by new/unpredicted moving objects, lighting changes, etc. But that would be an entire new line of research, and is more suited to a ray tracer or really tight tree-based micro-rasterizer that can effeciently handle micro draw call invocations."&lt;/i&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-2674942925989838833?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/2674942925989838833/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/cool-blog-posts.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2674942925989838833'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2674942925989838833'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/cool-blog-posts.html' title='Cool Blog Posts'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-8709763556064968310</id><published>2009-07-09T01:21:00.000-05:00</published><updated>2009-07-09T01:22:23.818-05:00</updated><title type='text'>GPU Ray Traversal Efficiency</title><content type='html'>&lt;a href="http://www.tml.tkk.fi/~timo/"&gt;Understanding the Efficiency of Ray Traversal on GPUs&lt;/a&gt;&lt;br /&gt;by Timo Aila and Samuli Laine of NVidia Research&lt;br /&gt;&lt;br /&gt;I really like this HPG2009 paper which explores the primary factors effecting performance of ray traversal on NVidia GPUs. They compared actual results with results from a simulator which had zero cost memory performance. So actuals vs upper bound on best (unobtainable) GPU performance of only ALU cycle counts.&lt;br /&gt;&lt;br /&gt;They found work distribution yielding lower SIMD efficiency was the primary bottleneck for ray traversal and NOT the memory system. Quote from the paper, &lt;i&gt;"We have also shown that, contrary to conventional wisdom, ray tracing performance of GTX285 is not significantly hampered by the lack of cache hierarchy. In fact, we can also expect good scaling to more complex scenes as a result of not relying on caches."&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Upper End Result in the Paper on a GTX285&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;- 140M rays/sec for primary rays (coherent)&lt;br /&gt;- 40M rays/sec for secondary randomly shuffled GI rays (non-coherent)&lt;br /&gt;&lt;br /&gt;They found around (and often better than) a 2x performance increase via two primary optimizations. With these changes they managed between 77% to 91% of the theoretical maximum GPU performance assuming infinite memory speed.&lt;br /&gt;&lt;br /&gt;(1.) PERSISTENT THREADS. Instead of launching a thread per pixel, launch enough threads to fill the GPU once, and have the threads pull work from a global pool. The problem is that ray traversal has varying execution time. The advantage of threads grabbing their own work is that long running warps only stall the warp instead of stalling the entire block (another block would have to wait until the longest running warp finished before being issued). Also note paper pulls work from the global pool into small local (shared memory) pools to reduce pressure on a single global atomic counter.&lt;br /&gt;&lt;br /&gt;(2.) PER-RAY TRAVERSAL. Typical CPU SIMD tracers do packet traversal partly because of lack of vector gather and because CPUs don't automatically handle branch divergence. The paper found that per-ray traversal is faster than packet traversal on the GTX285 in all cases, and for diffuse primary rays about 2x faster (even though per-ray traversal decreases memory coherency)!&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Implementation&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;- 192 threads/block&lt;br /&gt;- 21-25 registers/thread&lt;br /&gt;- node data fetched via 1D texture fetch (cached)&lt;br /&gt;- triangle data fetched via direct global memory access (non-cached)&lt;br /&gt;- per-ray stacks via global memory access (non-cached)&lt;br /&gt;- binary BVH traversal with maximum leaf size of 8&lt;br /&gt;- tree built using greedy surface-area heuristic &lt;br /&gt;- 64 bytes per the two BVH nodes (one full global memory segment access)&lt;br /&gt;- pair of BVH nodes always fetched and tested together&lt;br /&gt;- Woop's unit triangle intersection test&lt;br /&gt;- 48 bytes per triangle (1 to 2 global memory segments per access)&lt;br /&gt;- rays assigned in Morton order (good node texture coherency on primary rays)&lt;br /&gt;&lt;br /&gt;I'm making some assumptions here (which I didn't see in the paper directly but should be obvious?), that the pairs of BVH nodes and 48 byte triangle leaf data (referred to below as a "data block") are fetched using 1-2 global memory transaction into shared memory. &lt;br /&gt;&lt;br /&gt;(1.) Threads write address of data block into shared memory.&lt;br /&gt;(2.) All threads share in loading the data block for one thread at a time.&lt;br /&gt;(3.) Data blocks are loaded with increasing bank offsets.&lt;br /&gt;&lt;br /&gt;This insures very fast memory transactions both globally (perfect coalescing on GT200), and no bank conflicts locally afterward when threads move from coordinated loading to individual parallel computation. &lt;br /&gt;&lt;br /&gt;Notice stack data is also thread-local storage which is non-cached global memory accesses. I'm going to assume again that they insured good global memory segment coalescing (GT200) via keeping per thread stacks spaced out by a stride of 16. This way global memory segments required for stack update would be a function of divergence in stack size. So stacks of similar size would get serviced in a single global memory transaction (aside, this is exactly the same thing required on Larrabee to do fast parallel stacks since gather performance is a function of cache lines accessed). &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Dedicated Warp-Wide Scan or Prefix Sum Instruction&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;This enables lower software overhead for fine thread granularity work regrouping on the GPU. IMO NVidia needs this (or something like Larrabee's 2 related instructions), covered it in prior blog posts, have hopes that this might be in NVidia's GT300 arch. The paper covers this in the section on &lt;i&gt;"Improving the SIMD efficiency further"&lt;/i&gt;. &lt;br /&gt;&lt;br /&gt;Paper found that current software overhead for replacing terminated rays didn't yield a consistent performance update. However simulator suggested a max 20% improvement with the addition of two warp-wide hardware instructions (prefix sum and population count) assuming no memory bottleneck.&lt;br /&gt;&lt;br /&gt;Paper also has some very rough speculation towards going to fully dynamic "Work queues" using the simulated new warp-wide instructions showing a max potential of 40% and 80% performance improvements in two cases assuming no memory bottleneck.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;SIMD Width : 32-wide vs 16-wide vs 1-wide (Scaler)&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Given impossibly infinitely fast memory, and identical computational power, simulator results showed 16-wide SIMD to be 6-19% faster, and scaler to be 30-144% faster than 32-wide SIMD.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-8709763556064968310?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/8709763556064968310/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/gpu-ray-traversal-efficiency_09.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8709763556064968310'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8709763556064968310'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/gpu-ray-traversal-efficiency_09.html' title='GPU Ray Traversal Efficiency'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-5485535263024267819</id><published>2009-07-08T16:11:00.003-05:00</published><updated>2009-07-08T17:08:28.901-05:00</updated><title type='text'>Alaska : Glacier Bay by Air</title><content type='html'>A 75 minute flight-seeing tour of Glacier Bay Area is the best way to see the park. We waited for a clear day before doing the tour and got rewarded by the ability to fly over the Brady Ice Field. Flight-seeing provides a way to see many glaciers up close and personal, or choose specific areas of the park to look at, such as the peaks of mountains.&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2438/3698232835_fd32254cf9.jpg"&gt;&lt;br /&gt;&lt;br /&gt;Or the glaciers themselves. &lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2513/3698259627_acedc8a6da.jpg"&gt;&lt;br /&gt;&lt;br /&gt;Here are some of Margarie Glacier,&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2607/3698247351_4929b8a355.jpg"&gt;&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm4.static.flickr.com/3436/3698255425_e38dd75674.jpg"&gt;&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm4.static.flickr.com/3498/3698251061_10315e6048.jpg"&gt;&lt;br /&gt;&lt;br /&gt;Unfortunately a majority of the shots taken didn't turn out well (mixture of new camera with auto-exposure not adapting to the snow and overexposing everything, and rather would see stuff instead of take photos). The experience was hugely better than what you will ever see from a set of photos or even videos.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-5485535263024267819?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/5485535263024267819/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/alaska-glacier-bay-by-air.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/5485535263024267819'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/5485535263024267819'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/alaska-glacier-bay-by-air.html' title='Alaska : Glacier Bay by Air'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm3.static.flickr.com/2438/3698232835_fd32254cf9_t.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-4207585249362614015</id><published>2009-07-08T12:10:00.001-05:00</published><updated>2009-07-08T12:12:56.916-05:00</updated><title type='text'>Giga Voxels at Siggraph 2009</title><content type='html'>&lt;a href="http://artis.imag.fr/Membres/Cyril.Crassin/"&gt;Beyond Triangles : GigaVoxels Effects In Video Games&lt;/a&gt;&lt;br /&gt;Friday, 7 August | 3:45 PM | Room 260-262 &lt;br /&gt;&lt;br /&gt;&lt;a href="http://artis.imag.fr/Membres/Cyril.Crassin/Siggraph09/Siggraph09Poster.jpg"&gt;&lt;img src="http://artis.imag.fr/Membres/Cyril.Crassin/Siggraph09/Siggraph09Poster.jpg" width="600" height="424"&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Looks like a really neat talk,&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Mixing with existing triangle scenes.&lt;br /&gt;More efficient CUDA implementation of the pipeline.&lt;br /&gt;Faster octree traversal.&lt;br /&gt;Fully GPU side brick cache management.&lt;br /&gt;Efficient soft shadows and depth-of-field approximation.&lt;br /&gt;Instancing of millions of octree-based volumetric objects.&lt;br /&gt;Image space method supporting transparency with overlapping instances.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-4207585249362614015?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/4207585249362614015/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/giga-voxels-at-siggraph-2009.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/4207585249362614015'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/4207585249362614015'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/giga-voxels-at-siggraph-2009.html' title='Giga Voxels at Siggraph 2009'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-6280630909645617118</id><published>2009-07-06T11:28:00.003-05:00</published><updated>2009-07-06T12:41:23.987-05:00</updated><title type='text'>Alaska : Glacier Bay by Boat</title><content type='html'>The marine wilderness of Glacier Bay includes tidewater glaciers, snow-capped mountain ranges, ocean coastlines, deep fjords, and freshwater rivers and lakes. This diverse land and seascape hosts a mosaic of plant communities and a variety of marine and terrestrial wildlife and presents many opportunities for adventure. Here is the Glacier Bay Area viewed from google,&lt;br /&gt;&lt;br /&gt;&lt;iframe width="600" height="400" frameborder="0" scrolling="no" marginheight="0" marginwidth="0" src="http://maps.google.com/maps?f=q&amp;amp;source=s_q&amp;amp;hl=en&amp;amp;geocode=&amp;amp;q=Glacier+Bay+National+Park+and+Preserve&amp;amp;sll=37.579413,-95.712891&amp;amp;sspn=58.928785,78.310547&amp;amp;ie=UTF8&amp;amp;t=h&amp;amp;ll=58.605472,-135.626221&amp;amp;spn=1.144649,3.295898&amp;amp;z=8&amp;amp;iwloc=A&amp;amp;output=embed"&gt;&lt;/iframe&gt;&lt;br /&gt;&lt;small&gt;&lt;a href="http://maps.google.com/maps?f=q&amp;amp;source=embed&amp;amp;hl=en&amp;amp;geocode=&amp;amp;q=Glacier+Bay+National+Park+and+Preserve&amp;amp;sll=37.579413,-95.712891&amp;amp;sspn=58.928785,78.310547&amp;amp;ie=UTF8&amp;amp;t=h&amp;amp;ll=58.605472,-135.626221&amp;amp;spn=1.144649,3.295898&amp;amp;z=8&amp;amp;iwloc=A" style="color:#0000FF;text-align:left"&gt;View Larger Map&lt;/a&gt;&lt;/small&gt;&lt;br /&gt;&lt;br /&gt;The park provides an 8 hour boat tour of the area which visits many of the glaciers in the bay. The tour starts from the park dock at Bartlet Cove (which can be seen by zooming in the map above East of Gustavus).&lt;br /&gt;&lt;br /&gt;Kathryn and I decided to take the tour after seeing photos of the tour where people were getting on and off the boat. Turns out this was misleading, those photos were of picking up and dropping off kayakers and not off boat excursions as part of the tour as I expected. Eight hours is a long time even on the large boat, and I spend part of the tour (most of the return) sleeping. But in their defense, they did provide a great lunch, and the views in route were spectacular. &lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2486/3682830570_c54e3ed836.jpg"&gt;&lt;br /&gt;&lt;br /&gt;Stellar's Sea Lions (above), Brown Bear (aka Grizzly below), and other animals were seen on the tour. At one point we watched a Brown Bear traverse a patch of mostly vertical rock in route to the beach were we ended up picking up the first batch of kayakers. Apparently a few other kayakers were picked up before from the same spot with a bear taking a nap on the beach.&lt;br /&gt; &lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2594/3682835370_4cc190d060.jpg"&gt;&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm4.static.flickr.com/3542/3682835456_41748faf32_o.jpg"&gt;&lt;br /&gt;&lt;br /&gt;Glacier views are provided half way in the tour along with lunch. Below is Margarie Glacier plus detail shots. The gray silt when dispersed yields the turquoise water color which proliferates in the Glacier Bay area. We did see some calving at Margarie, but didn't get any good shots of it.&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2657/3682023757_397a4ebb93.jpg"&gt;&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2676/3682839312_b68262db27.jpg"&gt;&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2624/3682022975_cf9caccf6d.jpg"&gt;&lt;br /&gt;&lt;br /&gt;The Lamplugh Glacier currently sports an internal waterfall,&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2624/3682847608_f6ca8762a9.jpg"&gt;&lt;br /&gt;&lt;br /&gt;A far better way to see the bay is by air, and that post will have to wait until Kathryn gets the photos finished...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-6280630909645617118?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/6280630909645617118/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/alaska-glacier-bay-by-boat.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6280630909645617118'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6280630909645617118'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/alaska-glacier-bay-by-boat.html' title='Alaska : Glacier Bay by Boat'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm3.static.flickr.com/2486/3682830570_c54e3ed836_t.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-6079417459609113238</id><published>2009-07-06T10:15:00.004-05:00</published><updated>2009-07-06T11:35:27.782-05:00</updated><title type='text'>Alaska : Mendenhall Glacier</title><content type='html'>This 12 mile long and 1.5 mile wide glacier is located in Mendenhall Valley, about 12 miles from Juneau. &lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2673/3681126065_9bb83762d4.jpg"&gt;&lt;br /&gt;&lt;br /&gt;Trails in the valley lead around the lake created by the receding ice mass. This lake is filled with very blue icebergs. In the glacial areas, ice takes on this blue color due to air bubbles trapped inside the ice getting squeezed out in the glacier.&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2529/3682205962_704d9ba5c9.jpg"&gt;&lt;br /&gt;&lt;br /&gt;Around the lake there are 2 primary trails leading towards the glacier, the shorter easy hike ends at a huge waterfall (picture below). A secondary trail loop off this waterfall hike was closed (probably due to the trail being destroyed by a landslide).&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2438/3682187036_e828a5f616.jpg"&gt;&lt;br /&gt;&lt;br /&gt;On the other end of the lake starting at the camp area is another longer 7 mile trail which leads to a view over the glacier. We would have liked to do the trail, but we had to catch our flight back to Chicago in a few hours, so we opted to cheat and take a helicopter tour which landed on the glacier itself.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Helicopter Tour&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Looking back towards the visitors center. The trail which follows the shore leading to the huge waterfall (which is out of view) is on the left side of this image.&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm4.static.flickr.com/3542/3682208316_20b7b00eba.jpg"&gt;&lt;br /&gt;&lt;br /&gt;Detail of calved icebergs at the glacier's terminus,&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2658/3681377379_6a9f031873.jpg"&gt;&lt;br /&gt;&lt;br /&gt;Glacial detail viewed from the chopper was fantastic, &lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm4.static.flickr.com/3600/3682175950_397b45a695.jpg"&gt;&lt;br /&gt;&lt;br /&gt;Looking towards the landing spot,&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2517/3682179012_54917b70d0.jpg"&gt;&lt;br /&gt;&lt;br /&gt;Similar view after touching down on the glacial surface,&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2600/3682191352_4172f9d39e.jpg"&gt;&lt;br /&gt;&lt;br /&gt;And yes, glacial melt water is quite refreshing, &lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm4.static.flickr.com/3626/3682209206_ef1a12ca90.jpg"&gt;&lt;br /&gt;&lt;br /&gt;These melt water streams carve through the top of the glacier surface,&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm4.static.flickr.com/3585/3682199186_beb8c6cdd1.jpg"&gt;&lt;br /&gt;&lt;br /&gt;Views can be quite stunning,&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm4.static.flickr.com/3577/3681381569_cfe98905a9.jpg"&gt;&lt;br /&gt;&lt;br /&gt;But eventually you have to return to the helicopter holding your hat on, to return the ice boot covers they provided you. For some reason I felt compelled to duck my head down, perhaps as a subconscious reaction that I could self terminate by jumping upward. Also note how they position someone to help, so stupid tourists don't deviate into the rear tail rotor.&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2482/3681362787_225db0d2c1.jpg"&gt;&lt;br /&gt;&lt;br /&gt;The helicopter tour was the most expensive excursion we did. Something to do with turbine motor vs piston motor and amortizing the cost of the flying machine over the number of tourists... It was totally a last minute decision made along the lines of "we haven't done this before, and we probably are not going back to get a second chance to land on a glacier, might have enough time to do this before flying back to Chicago". Alaska's coastal weather is unpredictable, after a morning of near ground level clouds, and seeing the overcast cloud ceiling lifting, we took off for the airport looking for a building with lots of helicopters...&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Google Map View of the Area&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;iframe width="600" height="600" frameborder="0" scrolling="no" marginheight="0" marginwidth="0" src="http://maps.google.com/maps?f=q&amp;amp;source=s_q&amp;amp;hl=en&amp;amp;geocode=&amp;amp;q=Mendenhall+Valley&amp;amp;sll=58.389837,-134.57325&amp;amp;sspn=0.156735,0.434647&amp;amp;ie=UTF8&amp;amp;t=h&amp;amp;ll=58.405309,-134.560547&amp;amp;spn=0.215843,0.411987&amp;amp;z=11&amp;amp;output=embed"&gt;&lt;/iframe&gt;&lt;br /&gt;&lt;small&gt;&lt;a href="http://maps.google.com/maps?f=q&amp;amp;source=embed&amp;amp;hl=en&amp;amp;geocode=&amp;amp;q=Mendenhall+Valley&amp;amp;sll=58.389837,-134.57325&amp;amp;sspn=0.156735,0.434647&amp;amp;ie=UTF8&amp;amp;t=h&amp;amp;ll=58.405309,-134.560547&amp;amp;spn=0.215843,0.411987&amp;amp;z=11" style="color:#0000FF;text-align:left"&gt;View Larger Map&lt;/a&gt;&lt;/small&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-6079417459609113238?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/6079417459609113238/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/alaska-mendenhall-glacier.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6079417459609113238'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6079417459609113238'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/alaska-mendenhall-glacier.html' title='Alaska : Mendenhall Glacier'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm3.static.flickr.com/2673/3681126065_9bb83762d4_t.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-7432374973576899129</id><published>2009-07-06T09:55:00.002-05:00</published><updated>2009-07-06T10:15:20.792-05:00</updated><title type='text'>Alaska : Eagles</title><content type='html'>Gustavus has a lot of Bald Eagles which hang out around the pier at low tide. One evening, a bunch of them were fighting for what was likely fisherman scraps,&lt;br /&gt;&lt;br /&gt;&lt;object type="application/x-shockwave-flash" width="400" height="300" data="http://www.flickr.com/apps/video/stewart.swf?v=71377" classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000"&gt; &lt;param name="flashvars" value="intl_lang=en-us&amp;photo_secret=f3e43e42a9&amp;photo_id=3681870370"&gt;&lt;/param&gt; &lt;param name="movie" value="http://www.flickr.com/apps/video/stewart.swf?v=71377"&gt;&lt;/param&gt; &lt;param name="bgcolor" value="#000000"&gt;&lt;/param&gt; &lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;embed type="application/x-shockwave-flash" src="http://www.flickr.com/apps/video/stewart.swf?v=71377" bgcolor="#000000" allowfullscreen="true" flashvars="intl_lang=en-us&amp;photo_secret=f3e43e42a9&amp;photo_id=3681870370" height="300" width="400"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;One eagle managed to steal some skin, and flew away to an isolated post on the pier to consume it. Food flying into the mouth, not out...&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2672/3681076455_951fc92d98.jpg"&gt;&lt;br /&gt;&lt;br /&gt;The eagle was close enough to get a good video without the camera's sorry excuse of digital zoom for lack of telephoto,&lt;br /&gt;&lt;br /&gt;&lt;object type="application/x-shockwave-flash" width="400" height="300" data="http://www.flickr.com/apps/video/stewart.swf?v=71377" classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000"&gt; &lt;param name="flashvars" value="intl_lang=en-us&amp;photo_secret=e95fb622b4&amp;photo_id=3681075443"&gt;&lt;/param&gt; &lt;param name="movie" value="http://www.flickr.com/apps/video/stewart.swf?v=71377"&gt;&lt;/param&gt; &lt;param name="bgcolor" value="#000000"&gt;&lt;/param&gt; &lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;embed type="application/x-shockwave-flash" src="http://www.flickr.com/apps/video/stewart.swf?v=71377" bgcolor="#000000" allowfullscreen="true" flashvars="intl_lang=en-us&amp;photo_secret=e95fb622b4&amp;photo_id=3681075443" height="300" width="400"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-7432374973576899129?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/7432374973576899129/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/alaska-eagles.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/7432374973576899129'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/7432374973576899129'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/alaska-eagles.html' title='Alaska : Eagles'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm3.static.flickr.com/2672/3681076455_951fc92d98_t.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-6062434730671834884</id><published>2009-07-05T12:39:00.004-05:00</published><updated>2009-07-05T13:34:11.949-05:00</updated><title type='text'>Multicore IC for Realtime Ray Tracing</title><content type='html'>&lt;a href="http://techon.nikkeibp.co.jp/article/HONSHI/20090629/172373/"&gt;800TFLOPS Multicore IC for Realtime Ray Tracing&lt;/a&gt; describes a monster multi-core board dedicated to real-time ray tracing. Quick stats pulled from the diagram,&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Board Target&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;- 1000W.&lt;br /&gt;- 792 TFLOPs (only)!&lt;br /&gt;- 73 cores on chip x 9 chips.&lt;br /&gt;- 1920 x 1080 output resolution.&lt;br /&gt;- Images traced using 35 channel color.&lt;br /&gt;- Chips at 750MHz on 45nm process.&lt;br /&gt;- 130 million gates/chip at 17mm footprint.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Totals Across 9 Chips on Board&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;- 864 KB L2 I$.&lt;br /&gt;- 7056 KB L2 D$.&lt;br /&gt;- 4608 KB L1 I$.&lt;br /&gt;- 9216 KB L1 D$.&lt;br /&gt;- 954 KB registers.&lt;br /&gt;- 928 KB L0 memory.&lt;br /&gt;- 2592 GB/s bandwidth.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Each Chip&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;- 8 clusters of 9 cores.&lt;br /&gt;- 1 64-bit RISC CPU.&lt;br /&gt;- 96 KB L2 I$.&lt;br /&gt;- 784 KB L2 D$.&lt;br /&gt;- 512 KB L1 I$ (total over 8 clusters).&lt;br /&gt;- 1024 KB L1 D$ (total over 8 clusters).&lt;br /&gt;- 106 KB registers (total over 8 clusters).&lt;br /&gt;- 103 KB L0 memory (total over 8 clusters).&lt;br /&gt;- 288 GB/s bandwidth (total over 8 clusters).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Each Cluster&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;- 64 KB L1 I$.&lt;br /&gt;- 128 KB L1 D$.&lt;br /&gt;- 13.25 KB registers.&lt;br /&gt;- 12.9 KB L0 memory.&lt;br /&gt;- 36 GB/s via 3 interconnect buses each 12 GB/s and 128-bit. &lt;br /&gt;- 9 cores sharing registers (?).&lt;br /&gt;&lt;br /&gt;CORE 0,&lt;br /&gt;- Ray generation.&lt;br /&gt;- Reserved register file.&lt;br /&gt;- Register file : 8 bytes x 16 registers x 4 banks (512 B).&lt;br /&gt;- 256 B L0 memory.&lt;br /&gt;&lt;br /&gt;CORE 1,&lt;br /&gt;- Pixel Traverse.&lt;br /&gt;- 2-way SIMD.&lt;br /&gt;- Register file : (4x2) bytes x 16 registers x 4 banks (512 B).&lt;br /&gt;- ? B CAM (content accessible memory) L0 memory.&lt;br /&gt;&lt;br /&gt;CORE 2,&lt;br /&gt;- Bounding box processing.&lt;br /&gt;- 2-way SIMD x 3.&lt;br /&gt;- (4x2) bytes x 16 registers x 8 banks (1 KB) &lt;br /&gt;- 4 KB L0 memory.&lt;br /&gt;&lt;br /&gt;CORE 3,&lt;br /&gt;- Intersection processing.&lt;br /&gt;- 8-way SIMD x 24.&lt;br /&gt;- (4x8) bytes x 16 registers x 16 banks (8 KB)&lt;br /&gt;- 5.4 KB L0 memory.&lt;br /&gt;&lt;br /&gt;CORE 4,&lt;br /&gt;- Other ray-tracing processing.&lt;br /&gt;- 4 bytes x 16 registers x 16 banks (256 B)&lt;br /&gt;- 256 B L0 memory.&lt;br /&gt;&lt;br /&gt;CORES 5, 6, AND 7, &lt;br /&gt;- "Light Processing", write processing.&lt;br /&gt;- 4 bytes x 16 registers x 16 banks (1 KB)&lt;br /&gt;- 1 KB L0 memory.&lt;br /&gt;&lt;br /&gt;CORE 8,&lt;br /&gt;- Not shown in the diagram?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-6062434730671834884?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/6062434730671834884/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/multicore-ic-for-realtime-ray-tracing.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6062434730671834884'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6062434730671834884'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/multicore-ic-for-realtime-ray-tracing.html' title='Multicore IC for Realtime Ray Tracing'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-261104447351109174</id><published>2009-07-05T12:17:00.002-05:00</published><updated>2009-07-05T12:25:53.209-05:00</updated><title type='text'>Odd Transparent Rendering Ideas</title><content type='html'>Transparent content tends to require a trade-off between overdraw and sharpness. The ideal solution in my mind is to able to have sharp features (high resolution) with low amounts of overdraw (more opaque) intermixed with smooth features (low resolution) and high amounts of overdraw (more transparent).&lt;br /&gt;&lt;br /&gt;With 100% order-independent rendering, such as only additive transparency, one possible solution would be as follows,&lt;br /&gt;&lt;br /&gt;(1.) Have a set of accumulation buffers at reducing resolutions.&lt;br /&gt;(2.) Distribute draws to buffers based on sharpness and overdraw.&lt;br /&gt;(3.) Combine all buffers. &lt;br /&gt;&lt;br /&gt;Combine buffers requires repeated passes of up-sample and merge in order of the smallest buffer to the largest buffer. Note it might be a good idea to &lt;a href="http://iquilezles.org/www/articles/texture/texture.htm"&gt;filter better&lt;/a&gt; in the up-sample passes to avoid the typical mip maximize artifacts. Also note these combine passes are going to be bandwidth bound, so its a good idea to pair with some ALU bound work in the same kernel if possible!&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;TANGENT ON BLEND MODES&lt;br /&gt;&lt;br /&gt;Typical blend modes used in order-independent rendering are ADD and MUL. ADD to lighten, and MUL to darken. Often using both requires 2 passes, and when mixed with multiple resolutions, artifacts are found such as background MUL particles (say smoke) drawing over foreground ADD particles (say fire). &lt;br /&gt;&lt;br /&gt;Another issue is gamma-correct (linear space) blending, are transparent surfaces combined in linear space or after the linear to gamma transform?&lt;br /&gt;&lt;br /&gt;In the case of gamma space (non-linear) blending, I've never liked ADD blending at all for additive transparency. Perhaps this is because of years of photography where I go to complex lengths to avoid over-exposure. HDR in games is often (IMO) ridiculously overdone (really, should 25-50% of the frame be white). A visually better option (IMO) is Photoshop's SCREEN blend mode. SCREEN blending results in a soft approach to white, with blending limited at white. From a photography perspective, SCREEN works more like the way film responds to over-exposure. SCREEN is an inverted multiply, &lt;br /&gt;&lt;br /&gt;d = 1 - ( (1-a) * (1-b) )&lt;br /&gt;&lt;br /&gt;The inversion required by SCREEN can be factored out to doing MUL blending with an inverted framebuffer when attempting to port this over to DX or GL or GCM.&lt;br /&gt;&lt;br /&gt;In the case of linear blending, at least in Photoshop, I've found MUL for darkening unacceptable with both 8-bit and 16-bit modes. To test this in Photoshop, one must actually build a linear color profile (which is quite easy). The problem is the loss of precision (banding, etc) when the MUL results in colors near black (even in 16-bit mode). Perhaps with 16-bit float framebuffer outside of Photoshop this isn't a problem.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;BILLBOARDS AND SPHERICAL PARTICLES&lt;br /&gt;&lt;br /&gt;Particle billboards suffer from the problem of hard intersection with solid geometry, and in the case of sorted billboards, pop when sort order changes.&lt;br /&gt;&lt;br /&gt;Both of these problems have similar solutions, dynamically adjusting transparency across the particle to avoid the artifact. Spherical (or Z buffer aware) particles easily solves the first problem of intersecting with solid geometry. The second problem of pop on order change is more complex.&lt;br /&gt;&lt;br /&gt; &lt;br /&gt;TRANSFORMING ORDER DEPENDENT TO ORDER INDEPENDENT?&lt;br /&gt;&lt;br /&gt;&lt;i&gt;This is a wild completely untested idea I haven't had time to try. Note the application of this would be for a non-triangle renderer which doesn't have standard opaque triangle surfaces. Application of this to a typical game engine is likely not worth it (requires two alphas, one for weight and one for transparency, or other complicated things)!&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;The idea is to solve the following problems in one go,&lt;br /&gt;&lt;br /&gt;(1.) Removal of the order change pop on billboards.&lt;br /&gt;(2.) Ability to easily work in multiple resolutions for performance.&lt;br /&gt;(3.) Ability to do blend, lighten and darken.&lt;br /&gt;(4.) Ability to do all this in one "pass".&lt;br /&gt;&lt;br /&gt;The idea is similar in function to front first blending in which RGB accumulates color and A accumulates coverage. Except now RGB accumulates weighted color, and A accumulates weight. Actual RGB color is weighted RGB divided by weight A (assuming a FP16 buffer).&lt;br /&gt;&lt;br /&gt;However, drawing would be order-independent, billboards rendered by computing a weight factor of visibility across the billboard's surface (or perhaps for smaller billboards, constant weight factor across the billboard's surface). This visibility factor would take into account both the billboard's transparency and expected occlusion by other billboards and surfaces. RGB would be billboard color times weight. Also for lighten and darken, weight stored in A could be adjusted.&lt;br /&gt;&lt;br /&gt;Clearly this is adding another problem to solve (visibility estimation for weight factor). However I believe the weight factor could easily be computed at a lower effective frame rate and at a much lower resolution, possibly leading to better performance overall.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-261104447351109174?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/261104447351109174/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/odd-transparent-rendering-ideas.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/261104447351109174'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/261104447351109174'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/odd-transparent-rendering-ideas.html' title='Odd Transparent Rendering Ideas'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-2247671703668029566</id><published>2009-07-03T19:57:00.012-05:00</published><updated>2009-07-05T12:15:55.107-05:00</updated><title type='text'>Alaska : Whale Watching</title><content type='html'>&lt;i&gt;Hopefully flickr won't mind the deep linking...&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Alaska was incredible, not the typical vacation location, and IMO perfect for the independent traveler. My wife Kathryn and I spent over a week in the Glacier Bay area, and over a few blog posts I'm going to share a little of the experience.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The TAZ&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;I'd highly suggest the Icy Straits afternoon whale watching excursion of Tod Seben's Cross Sound Express, also known as &lt;a href="http://www.taz.gustavus.com/"&gt;The TAZ&lt;/a&gt; in Gustavus. Easy to see how the Humpbacks were nearly hunted to extinction. They are simple to find if one knows where they feed, grab a set of binoculars and head towards the sprays. Being up-wind of the spray provides the opportunity to experience "whale's breath". It is amazing that something so foul smelling could travel so far without dispersing (more). &lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2466/3679452934_bf24339460.jpg"&gt;&lt;br /&gt;&lt;br /&gt;Watching the Humpbacks is something else indeed, especially watching a 40 foot animal jump out of the water. Besides breaching, they spy-hop, tail-lob, slap, and wave. At one point in the afternoon, a pod of whales turned and closed the distance to the boat while we were stationary and watching (regulations specify a minimum distance to approach the whales, however they can choose to approach the boat on their own). Size becomes apparent when seeing something the size of the boat swim by.  &lt;br /&gt;&lt;br /&gt;Well worth it. At one point, Tod let Kathryn pilot the TAZ,&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2540/3679452430_815e5d903f.jpg"&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Bubble Net Feeding Viewed via Air Taxi&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Towns and cities in the Glacier Bay area are connected via air and water only. Roads don't exist except locally. Our taxi pilot mentioned that he had seen a pod of Humpbacks from the air on route, and said he would attempt to find them again as we took the air taxi from Gustavus to Juneau. By some act of random chance, he did indeed find the pod, and they were bubble net feeding, which according to our pilot must have been rare to see in the air because he had never see that before. We got a first hand look via a bunch of descending tight banked loops.&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2450/3679489652_566835202c.jpg"&gt;&lt;br /&gt;&lt;br /&gt;Below a shot of the pod after feeding. BTW, that wasn't telephoto, all we had was a small point and shoot, which apparently has some issues with white point computation given light angle to the water...&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm4.static.flickr.com/3637/3678678969_bd9f5b739b.jpg"&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-2247671703668029566?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/2247671703668029566/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/alaska-whale-watching.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2247671703668029566'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2247671703668029566'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/alaska-whale-watching.html' title='Alaska : Whale Watching'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm3.static.flickr.com/2466/3679452934_bf24339460_t.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-5705353354163602164</id><published>2009-07-01T23:55:00.003-05:00</published><updated>2009-07-02T00:06:14.287-05:00</updated><title type='text'>iPhone 3GS GLBench</title><content type='html'>&lt;a href="http://www.glbenchmark.com/compare.jsp"&gt;The GLBenchmark site provides a way to compare the GL performance of all the iPhone models including the 3GS!&lt;/a&gt; Results are very interesting indeed showing 3GS with 10x the triangle performance of the prior models, and 3x the texture fill rate.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-5705353354163602164?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/5705353354163602164/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/iphone-3gs-glbench.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/5705353354163602164'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/5705353354163602164'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/07/iphone-3gs-glbench.html' title='iPhone 3GS GLBench'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-8643606630202589630</id><published>2009-06-30T18:13:00.015-05:00</published><updated>2009-07-01T18:25:26.206-05:00</updated><title type='text'>Yes 18 hour days but no bugs.</title><content type='html'>Back today from Alaska via overnight flight, will have a blog post on the trip once the photos are sorted through (and BTW I think Wisconsin has worse mosquitos). Playing catchup with everything missed over the past 9 days.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://lly316.blogspot.com/2008/11/microsoft-directx-11-screenshots.html"&gt;What looks (edit) NOT (edit, thanks Cyril) to be some DX11 realtime GI screenshots, one with indirect specular highlights.&lt;/a&gt; Found from the &lt;a href="http://www.legitreviews.com/article/1001/"&gt;AMD DX11 Whitepaper Article&lt;/a&gt;, but with no link to the actual white paper, no new news as to DX11 (give us something real like limits on Append/Consume buffers mapped per shader). I'm also still disapointed at the CS4.x spec in how it doesn't provide support for features in current shipping NVidia DX10 cards (referring to G80 support for shared memory write access and later model atomic operations). Perhaps went to ATI's favor, or will ultimately help NVidia force DX11 upgrades?&lt;br /&gt;&lt;br /&gt;&lt;a href="http://forums.nvidia.com/index.php?showtopic=99797"&gt;Cuda 2.3 Beta is Available for Registered Developers&lt;/a&gt;. According to the forum post, provides float2half and half2float which I believe matches the DX11 functionality. Also some interesting unanswered questions regarding __synchronous_start(int s) and __synchronous_end() which looked to be found in the intermediate nvcc output.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.shacknews.com/featuredarticle.x?id=1156&amp;page=3"&gt;Interesting stuff in the Shack's John Carmack Interview&lt;/a&gt;, &lt;i&gt;"Now I am very excited about what I can do from a hardware and graphics standpoint with the 3GS. With vertex fragment shaders and OpenGL 2.0, I'm pretty convinced that I can actually run the MegaTexture id Tech 5 content creation pipeline on there. And I'm not sure what game I want to do that with yet, but the combination of seeing people download 700mb files of Myst on there, and the new capabilities, I could do some mind-blowingly cool stuff on there."&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.battlefieldheroes.com"&gt;Battle Field Heroes&lt;/a&gt; is live.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-8643606630202589630?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/8643606630202589630/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/yes-18-hour-days-but-no-bugs.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8643606630202589630'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8643606630202589630'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/yes-18-hour-days-but-no-bugs.html' title='Yes 18 hour days but no bugs.'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-3223045717580938595</id><published>2009-06-19T16:04:00.007-05:00</published><updated>2009-06-19T16:53:32.512-05:00</updated><title type='text'>18 Hour Days, Lots of Bugs, but No Computer?</title><content type='html'>My wife and I are going up to Alaska for next week. Packing light, not going to bring the pro camera this time (sticking to a real vacation), between the two of us, just two carry-on backpacks. Going to be awesome.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-3223045717580938595?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/3223045717580938595/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/20-hour-days-lots-of-bugs-but-no.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3223045717580938595'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3223045717580938595'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/20-hour-days-lots-of-bugs-but-no.html' title='18 Hour Days, Lots of Bugs, but No Computer?'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-531056901346333304</id><published>2009-06-17T23:55:00.002-05:00</published><updated>2009-06-18T00:05:56.368-05:00</updated><title type='text'>Requesting Siggraph 2009 Advice</title><content type='html'>I'm looking for any advice as to what to see and do this Siggraph 2009, since I've never been to a Siggraph before, and I've got to book everything this week (going on vacation next week before the early deadline ends). &lt;b&gt;Also looking to meet up with any of you who read this blog and want to talk about tech and stuff!&lt;/b&gt; Please either email my gmail address (farrarfocus) or post comments below.&lt;br /&gt;&lt;br /&gt;Currently planning on flying in 10am Wednesday and doing two day passes for the following, &lt;br /&gt;&lt;br /&gt;Wed 5th: 1:45pm-5:30pm - Efficient Substitutes for Subdivision Surfaces&lt;br /&gt;Thr 6th: 8:30am-12:15pm - Beyond Programmable Shading I&lt;br /&gt;Thr 6th: 1:45pm-5:30pm - Beyond Programmable Shading II&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-531056901346333304?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/531056901346333304/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/requesting-siggraph-2009-advice.html#comment-form' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/531056901346333304'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/531056901346333304'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/requesting-siggraph-2009-advice.html' title='Requesting Siggraph 2009 Advice'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-6591320038810651579</id><published>2009-06-16T10:58:00.002-05:00</published><updated>2009-06-16T11:25:56.298-05:00</updated><title type='text'>Stream Compaction for Deferred Shading</title><content type='html'>&lt;a href="http://graphics.cs.uiuc.edu/~jch/papers/shadersorting.pdf"&gt;Stream Compaction for Deferred Shading&lt;/a&gt;&lt;br /&gt;Jared Hoberock, Victor Lu, Yuntao Jia, John C. Hart&lt;br /&gt;Proc. High Performance Graphics, Aug. 2009&lt;br /&gt;&lt;br /&gt;Paper addresses the issue of SIMD shader divergence when deferred shading on Geforce 8/9 series hardware using the following tests,&lt;br /&gt;&lt;br /&gt;UNORDERED IMPLICIT - All N shaders in one program, use possible divergent branching (at pixel level) to shade pixel. &lt;i&gt;In the case of a small number of shaders (or low run-time shaders) the overhead of stream compaction or sort was higher than divergent branching.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;UNORDERED EXPLICIT - All N shaders in separate programs, run one full pass for each shader (N passes). &lt;i&gt;Was always slower than the UNORDERED IMPLICIT path, even though max register usage (controls max warps running in parallel for latency hiding) would be smaller for explicit shaders. Overhead of fetching shader type just to mask threads was too high.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;COMPACTION IMPLICIT - Overhead proportional to the number of required shaders. Use stream compaction to group pixels by shader and use possible divergent branching (at beginning and end of groups) to shade. They run N passes of a masked parallel scan (one per shader) to create offsets for stream compaction. Then one shading pass using possibly divergent branching (at the N shader boundaries in the compacted stream) to shade. The shading pass fetches pixel offset from the scan results, gathers the pixel data, shades, and scatters the results back.&lt;br /&gt;&lt;br /&gt;COMPACTION EXPLICIT - Same as above, but run a pass per shader when shading. &lt;i&gt;This case looks to always be faster than COMPACTION IMPLICIT because of better register usage when shading (better latency hiding).&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;SORT IMPLICIT - Overhead proportional to the log of the number of shaders. Instead of running N scans, radix sort by shader for the first pass. Shading is the same as COMPACTION IMPLICIT except using the radix sorted offsets instead the scan offsets.&lt;br /&gt;&lt;br /&gt;SORT EXPLICIT - Save as above, but run a pass per shader when shading. &lt;i&gt;This case provided the best results overall when overhead of running the sort was smaller than the cost of divergent branching in the UNORDERED IMPLICIT case.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;LOCAL IMPLICIT - Do COMPACTION IMPLICIT in small tiles and only use shared memory. &lt;i&gt;They were only able to support a working size of 8 warps (256 pixels). I think they ment 8 warps total per block and one block per multiprocessor, which would not be enought to hide memory latency, and would stall the multiprocessor on sync threads. Given the small tile size, shading costs alone were always higher than the UNORDERED IMPLICIT case.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;About the Results&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The paper results are in milliseconds on 2 types of 9800 cards. They don't really provide enough information (no number of samples shaded). My guess working backwards from 9800 GTX+ specs and average instructions per shader numbers in the paper, is about ~0.2G (G=billion) shader invocations in the Car scene (for a 512x512 image estimated from the PDF, also assuming ALU bound, which probably isn't the case) for 337ms of shading time (note sort overhead was an extra 244ms). I believe a 8800 GTX can do a 1M (million) entry parallel scan in under 1ms according to a previous paper I read. So 244ms overhead on a 9800 GTX+ likely collaborates my estimation that they are working in numbers greater than 50M at a time?&lt;br /&gt;&lt;br /&gt;&lt;i&gt;There is much more to talk about with regards to this paper, deferred shading, and compute/DX11, but too busy now...&lt;/i&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-6591320038810651579?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/6591320038810651579/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/stream-compaction-for-deferred-shading.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6591320038810651579'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6591320038810651579'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/stream-compaction-for-deferred-shading.html' title='Stream Compaction for Deferred Shading'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-3528636137792195812</id><published>2009-06-16T10:15:00.002-05:00</published><updated>2009-06-16T10:18:04.573-05:00</updated><title type='text'>Braid for the Mac</title><content type='html'>Old news which I never got, &lt;a href="http://braid-game.com/news/"&gt;Braid is Available for the Mac&lt;/a&gt;. Woohoo (and yes I'm sans PC or 360 at home).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-3528636137792195812?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/3528636137792195812/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/braid-for-mac.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3528636137792195812'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3528636137792195812'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/braid-for-mac.html' title='Braid for the Mac'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-3219159456038017640</id><published>2009-06-11T16:03:00.002-05:00</published><updated>2009-06-11T16:05:48.133-05:00</updated><title type='text'>Factoring Out the Job Scheduler</title><content type='html'>&lt;i&gt;Last night I got another idea on how I might be able to both simplify my job scheduler (in fact remove it) and better handle a dynamic CPU load. All while keeping a very simple design with minimal sync points. Be warned, I have NOT had time to try this yet, it should work in theory...&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Threading&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;(1.) The program entry starts N threads then goes to sleep forever. &lt;br /&gt;(2.) N threads to match the N hardware threads of the machine.&lt;br /&gt;(3.) N threads are locked down via thread affinity.&lt;br /&gt;(4.) Each of these N threads runs the exact same function.&lt;br /&gt;(5.) Each thread starts with function(X, N).&lt;br /&gt;(6.) Where X is the thread index ranging {0 to N-1}.&lt;br /&gt;&lt;br /&gt;The basis of parallelism here is running N instances of a program. However since the N instances can have completely divergent branching, the program can use X to mix and match task and data parallel computation anywhere in the program. Note all threads share the same memory.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Synchronization&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;For efficient computation only very coarse grain synchronization is supported between the N threads. Coarse as in you can count the number of sync points per screen refresh on your fingers. Examples below show lines for each thread and letters as follows,&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;- -&gt; thread doing work&lt;br /&gt;x -&gt; thread blocking&lt;br /&gt;E -&gt; event&lt;br /&gt;S -&gt; sync point, thread blocks until all threads finished E&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;During normal program operation sync points don't ever block because enough work is provided in between by a good program architect. Note in this model threads have a lot of slop and don't ever actually fully sync up in time unless something goes really wrong.&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;--------E-------S-------&lt;br /&gt;-----------E------S-----&lt;br /&gt;------E--------S--------&lt;br /&gt;-------E------------S---&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;When something goes wrong, threads do block.&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;--------E-------S-------&lt;br /&gt;---------------E--S-----&lt;br /&gt;------E-----Sxxx--------&lt;br /&gt;---E-----Sxxxxxx--------&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;The program would be composed of a sequence of E and S points per frame.&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;-----E1----E2S1-----S2--&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;In the above example, after S1, data generated prior to E1 is free to be read-only shared between all threads. Likewise after S2 data generated prior to E2 (ie between E1 and E2) is now safe to share as read-only.&lt;br /&gt;&lt;br /&gt;Data outside sync points is considered "unordered" (just as UAVs are in DX11) and can be shared only with DX11 or CUDA like unordered atomic operations. Data which would otherwise require fine-grain synchronization to share, is simply re-computed in each thread.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Dynamic Load Balancing&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;This is where things get interesting. The goal case is where threads reach E at the same time (E is in the same point in code on all threads). There is enough slop in the system so that the E point in time can be a bit different between threads. To load balance one would want to dynamically adjust work distribution such that at the next E point, that the threads would be closer together in timing.&lt;br /&gt;&lt;br /&gt;To enable dynamic balancing, all threads have a secondary work distribution factor Y (the primary work distribution factor is the thread index X). This Y factor provides a segment {start,end} along the range of {0 to 1}. This Y segment can be used by the thread to choose the ranges of data parallel work to operate on.&lt;br /&gt;&lt;br /&gt;(1.) At E, each thread stores CPU clock cycle time.&lt;br /&gt;(2.) This requires CPU clock cycle counter bases to be re-synced periodically.&lt;br /&gt;(3.) Also at E, each thread records diff in cycle time to the previous E.&lt;br /&gt;(4.) At S, threads read all the E times for all the threads.&lt;br /&gt;(5.) Note the E times are read-only.&lt;br /&gt;(6.) Each thread uses the E times to compute a new future Y segment.&lt;br /&gt;(7.) This new future Y segment stored to be used at the next sync point.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Example Forms of Task and Data Parallel Programming&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;To have one thread handle an OpenGL interface, the program would just dynamic branch on X==0 (X being the thread index). Note, a GL context is usually tied to one thread. With DX11 there is also the concept of the primary thread, secondary threads build deferred commands. Note it is ok to lock down a task to just one thread because the dynamic load balancing would redistribute the data parallel work via the Y segment (from above).&lt;br /&gt;&lt;br /&gt;To process M objects, each thread would compute an exclusive set of the M objects using the Y segment (something like Y.start * M to Y.end * M, with fixup to insure no overlap).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Advantages&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;With this system, there isn't a worklet system or a scheduler, there is just the program running multiple instances of itself. Synchronization is flexible, and rarely used (very parallel friendly). Read-only data would likely be well shared on a shared L1 or L2 or L3 because the same program is running on all threads and the dynamic load balancing works to bring together the phase (the time) of the threads.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-3219159456038017640?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/3219159456038017640/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/factoring-out-job-scheduler.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3219159456038017640'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3219159456038017640'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/factoring-out-job-scheduler.html' title='Factoring Out the Job Scheduler'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-6426808389251587659</id><published>2009-06-11T11:20:00.001-05:00</published><updated>2009-06-11T11:20:55.061-05:00</updated><title type='text'>iPhone 3GS</title><content type='html'>&lt;a href="http://www.anandtech.com/gadgets/showdoc.aspx?i=3579&amp;p=1"&gt;AnandTech: The iPhone 3GS Hardware Exposed &amp; Analyzed&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The new iPhone 3GS looks to be awesome, according to AnandTech,&lt;br /&gt;&lt;br /&gt;- PowerVR SGX GPU&lt;br /&gt;- OpenGL ES 2 (shaders)&lt;br /&gt;- likely a 3.5x to 7x triangle throughput increase&lt;br /&gt;- ARM Cortex A8&lt;br /&gt;- 600 MHz&lt;br /&gt;- dual issue instructions&lt;br /&gt;- 32KB L1 I$, 32KB L1 D$, 256KB L2&lt;br /&gt;- Neon&lt;br /&gt;- likely a 2x processor performance increase&lt;br /&gt;&lt;br /&gt;Yeah really cool! As long as VBOs and other SGX GPU hardware features work in iPhone OS 3.0, developers are bound to get really happy.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-6426808389251587659?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/6426808389251587659/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/iphone-3gs.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6426808389251587659'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/6426808389251587659'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/iphone-3gs.html' title='iPhone 3GS'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-8264049379652242015</id><published>2009-06-10T15:11:00.003-05:00</published><updated>2009-06-10T15:17:16.324-05:00</updated><title type='text'>The Last Guardian Images</title><content type='html'>&lt;a href="http://www.shacknews.com/screenshots.x?gallery=12218&amp;game_id=7624&amp;id=136533#img136533"&gt;The Shack has images of The Last Guardian.&lt;/a&gt; Looks very much like SotC to me in muted color palette choice.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-8264049379652242015?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/8264049379652242015/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/last-guardian-images.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8264049379652242015'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8264049379652242015'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/last-guardian-images.html' title='The Last Guardian Images'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-2738955085786235470</id><published>2009-06-09T10:40:00.002-05:00</published><updated>2009-06-09T10:47:14.660-05:00</updated><title type='text'>Some Assembly Required</title><content type='html'>This mornings googling turned up a really great blog, &lt;a href="http://assemblyrequired.crashworks.org"&gt;Some Assembly Required&lt;/a&gt;, written by Elan Ruskin at Valve. Great topics + disassembly + timings = awesome!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-2738955085786235470?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/2738955085786235470/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/some-assembly-required.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2738955085786235470'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2738955085786235470'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/some-assembly-required.html' title='Some Assembly Required'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-1937145133559713334</id><published>2009-06-05T09:03:00.004-05:00</published><updated>2009-06-05T09:45:44.981-05:00</updated><title type='text'>DX11 Binning</title><content type='html'>BTW, I over simplified my DX11 binning example in a previous post. In reality there are some important considerations for speed. For example on Larrabee since scatter time is a function of the number of cache lines, it might NOT be wise to bin into queues arranged in memory like this,&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;int Queue4Pixel0[16], int Queue4Pixel1[16], int Queue4Pixel2[16], ...&lt;br /&gt;&lt;br /&gt;offset = (pixel &lt;&lt; 4) + bin&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;Because only 1 queue would be on a cache line, so the scatter at best would be 16 cycles. Read back of the data later also might not be ideal in this format. Other option is this which has better cache locality at the expense of more complex offset logic, &lt;br /&gt;&lt;br /&gt;&lt;tt&gt;int QueueBin0ForPixels0to16[16], int QueueBin1ForPixels0to16[16], ...&lt;br /&gt;&lt;br /&gt;offset = (pixel &amp; 15) + (((pixel &amp; (~15)) + bin) &lt;&lt; 4);&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;So for each grouping of 16 pixels, each queue bin is on the same cache line. Which in the completely coherent case would take a clock cycle to scatter to, and would degrade in performance as a function of how many different queue lengths there are in the groupings of 16 pixels.&lt;br /&gt;&lt;br /&gt;The same above logic goes for global scatter in CUDA (NVidia) since performance is a function of the number of segments touched. Similar logic could work for shared memory (or shared register) scatter but there are other options in CUDA since performance is a function of bank conflicts (and not "cache lines" or "segments").&lt;br /&gt;&lt;br /&gt;Clearly this is assuming via ATI and NVidia DX11 hardware that texture UAVs are built from CUDA/OpenCL/DX11CS global memory accesses. This assumption could be dead wrong, and if another functional unit is involved (like UAV enabled ROPs) then I would need to re-explore this...&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;CS vs PS&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Oops, another important consideration I left out is the assumption that the binning is happening from the CS stage (so block arrangement insures vector alignment and coalescing).&lt;br /&gt;&lt;br /&gt;PS stage provides some complex problems in that pixel grouping in the SIMD units isn't likely always coalesce/vector friendly with the address calculations above. Full SIMD width number of pixels (full course raster tile) is likely to get packed in a very vector cache line friendly way on Larrabee, but you'd have to know if the output addressing was Morton order (bit interleaved or not) and adjust the offset calculation. Grouping of 2x2 pixel quads (or just pixels) into SIMD vector groups for small triangles probably breaks down the binning into something vector scatter unfriendly on Larrabee (extra different cache lines). Banked memory could lessen this limitation (so who knows what NVidia's GT300 performance will be).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-1937145133559713334?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/1937145133559713334/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/dx11-binning.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1937145133559713334'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/1937145133559713334'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/dx11-binning.html' title='DX11 Binning'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-8697797184170367005</id><published>2009-06-05T01:44:00.002-05:00</published><updated>2009-06-05T02:00:42.097-05:00</updated><title type='text'>Thread Scheduling Part 3</title><content type='html'>&lt;b&gt;My Personal Views on Solutions to This Mess&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;*** Personal views for the independent developer. Clearly things are different with the constraints a huge legacy code bases, etc...&lt;br /&gt;&lt;br /&gt;&lt;i&gt;First a little background, back when I started programming, is was practical to write your own operating systems and drivers (I did this in my teens). I would boot DOS, then switch into protected mode with the DOS interrupt table and low memory intact so if necessary I could switch into virtual x86 mode and use DOS for things I didn't feel like programming my self (like access to the FAT files system on the main harddrive). Everything else, sound, video, etc was done accessing the hardware directly. Actually things where much EASIER back then (IMO). Since then, the industry had added layer upon layer to abstract out the hardware...&lt;br /&gt;&lt;br /&gt;WHAT DO I MEAN BY EASIER? - Back in the early days you got control over everything. Computers were single processor, no virtual memory, no threading. Applications (at least the stuff I did) would work with non-blocking asynchronous IO requests and coarse polling with a small set of interrupt service routines to handle hardware interrupts. It was rather easy to poll with one bitmask read to check for completion of any IO request, then cooperatively "multi-task" in application. Things were actually more interactive and responsive back then as well. Polling was cheep (read from memory, no kernel call), task switch was cheap because it was cooperative, no state to save and restore. Programs were tiny, systems were tiny. Check out &lt;a href="http://www.sics.se/contiki/"&gt;Contiki&lt;/a&gt; for a modern example!&lt;br /&gt;&lt;br /&gt;I hold no illusions that somehow we could go back to the way things were before, however there are a huge number of lessons from back then which should be applied now.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;What I am Doing Circa 2009&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;First I have one thread per hyperthread in the system (or per processor in a non-hyperthreaded system). These threads are locked down via setting thread affinity. These threads are my "compute threads" which don't ever do blocking IO. Whenever possible I use low-level as possible non-blocking IO interfaces (this covers networking and raw file IO on most systems). &lt;br /&gt;&lt;br /&gt;For all the other cases, meaning when the operating system or API doesn't provide a 100% non-blocking interface, I resort to blocking IO threads to which I non-locking queue requests to. Note because scheduling granularity is 1ms or so at best, queues get groups of requests at one time (to lower atomic operation overhead and overhead of kernel call to wake a IO thread which was blocking on an empty queue). Blocking IO threads run at a higher priority so they will be insured to preempt a "compute thread" whenever they get marked as runnable. Blocking IO threads are designed to have very short runtime and thus a minor effect on the "compute thread" which it prempted.&lt;br /&gt;&lt;br /&gt;My job scheduler is quite different from what others do (it is modeled directly on how GPUs process draw calls and what is now the OpenCL model). I break my program down into an array/list of small jobs (like draw calls). I handle handle different sections of the program by turning on/off jobs in this list.&lt;br /&gt;&lt;br /&gt;Each job has an index to a prior dependent job (which controls when the job can be scheduled). I keep track of job runtime (free profiling) and use this to pre-schedule all jobs into one queue per "compute thread" a frame in advance. So "compute threads" simply pull jobs in order off their queue and spin if in the unlikely case the job's dependency hasn't finished execution. &lt;br /&gt;&lt;br /&gt;It is up to the program designer (ie me) to insure that the program pipeline doesn't ever need to stall and thus has enough non-dependent jobs to handle the variability of the preemption by the blocking IO threads. &lt;br /&gt;&lt;br /&gt;Note that with this system I don't ever need any other thread-2-thread synchronization other than atomic operations. All that "work" and "mess" is factored out into proper program "pipeline" design!&lt;br /&gt;&lt;br /&gt;To handle scalability to machines of varying number of processors, I have two types of jobs: (BATCH) This job is attempted to be scheduled on all "compute threads" at the same time. Each job entry point gets the "compute thread" index and the number of "compute threads". It is then up to the job to carve up the data and process its group. Batches are for data parallel work. (TASK) This job gets scheduled on only one processor. Tasks are for task parallel work.&lt;br /&gt;&lt;br /&gt;For fast development cycles, I have all my code in a single dynamic (or shared on unix platforms) library, so I can re-compile, unload the old library, and load the new library to instantly test code changes without needing to restart the program. This is my re-attachable code model. BTW, I don't use the debugger ever at home, I don't need to!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-8697797184170367005?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/8697797184170367005/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/thread-scheduling-part-3.html#comment-form' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8697797184170367005'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/8697797184170367005'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/thread-scheduling-part-3.html' title='Thread Scheduling Part 3'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-2445170607110597017</id><published>2009-06-04T23:31:00.007-05:00</published><updated>2009-06-05T01:43:48.137-05:00</updated><title type='text'>Thread Scheduling Part 2</title><content type='html'>&lt;i&gt;Part 2 is all about the IO side of Thread Scheduling, CPU IO in a parallel world.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Review, Blocking vs Non-Blocking&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;BLOCKING - Thread which issues the IO call is put to sleep until the operation is "finished". At some point in the IO call, a kernel call is made (to service the IO request) which results in a run-level task switch to kernel mode. In kernel mode, the kernel queues the IO request, sleeps the calling thread and does a task switch to another runnable thread. At some point later after the kernel services the IO request, the original thread (which issued the IO) is again scheduled to run (latency can be a function of IO latency plus how often the blocking thread gets rescheduled to run).&lt;br /&gt;&lt;br /&gt;LIKELY BLOCKING, POSSIBLY NON-BLOCKING - In the case of reads, sometimes the data is cached (perhaps disk cache) and thus the IO call can be serviced right away. This can also be the case for writes when the kernel call copies the write request to a queue (or to the page cache) and immediately returns (note when queue is full or page isn't in page cache, kernel call can become a blocking call).&lt;br /&gt;&lt;br /&gt;MOST LIKELY NON-BLOCKING, POSSIBLY BLOCKING - This is the troubling case where an interface is almost completely non-blocking, but has some caveats. Caveats in that the interface doesn't return some error code indicating that the call would block. Often this is the case when buffer or queue space is full. The v-sync section in the previous post is a good example of this!&lt;br /&gt;&lt;br /&gt;NON-BLOCKING - Correct fully non-blocking API, if a call would block, the call returns with an error code. Non-blocking APIs might provide a polling function to return the status of pending operations. Non-blocking APIs might also provide a callback interface. In this case a user supplied function is called on pending IO. Callback interface can be lower latency at the expense of loosing control over job synchronization.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Disk Access Re-ordering To Reduce Seek Time&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;All operating systems which have good file IO performance on rotating disc based storage do this (else performance would suck worse than it does now). The operating system queues up the disk accesses required to service the file IO (virtual memory IO, etc) from all processes and threads, and orders the accesses to reduce seek time.&lt;br /&gt;&lt;br /&gt;This should be kept in mind when thinking about application file IO. It is often smart to queue up file IO requests from many files at once, instead of just following the read-&gt;process-&gt;read-&gt;process model. Likewise, having just one asych thread servicing file requests using a blocking interface might not yield the best throughput depending on access patterns.  &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Read Ahead&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The operating system provides a read-ahead mechanism which attempts to predict which pages will be needed in the future (based on past access patterns) and effectively adds future requests to current requests when reading from the device. In some cases (Linux) it can get various patterns like sequential + stride, etc.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Why Your Desktop User Experience Is Often a Waiting Game&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;(1.) Same thread processing the application "GUI" event queue does file IO. Epic Fail!&lt;br /&gt;&lt;br /&gt;(2.) Application is creating a huge data structure, it writes into memory which was just allocated, but hasn't yet been mapped to physical pages by the OS. System is under heavy load, and not enough non-dirty pages are available to discard. Operating system needs to write out dirty pages to disc to provide pages for the application to write to. Application blocks until disc IO is finished on the dirty pages. Epic Fail!&lt;br /&gt;&lt;br /&gt;(3.) Application suffers from code bloat. App could even be tiny but suffer from a bloated hierarchy of dynamic link libraries. Non-accessed code likely isn't resident (or if loaded, was paged out to provide space for something else). Example, user clicks on menu and opens up a seldom used dialog, paged-out code is accessed, and App blocks on file IO. Epic Fail!&lt;br /&gt;&lt;br /&gt;(4.) Application access a huge number of small files, all with long path names (think bloated file structure with huge number of files per directory). Seek nightmare for two reasons, first just accessing the (possibly uncached) directory structure to find the files, and second accessing the (likely uncached) files. Epic Fail!&lt;br /&gt;&lt;br /&gt;Solving the first case is just good application design. Solving the third case is expert application design. One has to insure all GUI or user-input processing is in a small core amount of common code (and data) which will always be resident (often accessed likely equals always resident). Possibly blocking functions (ie ones which might access code which is seldom called, ie which is paged out) needs to be done in an asynchronous thread. This way the app is always responsive!&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Point here is anything which might result (directly or indirectly) in file IO must be on a separate thread from whatever thread or thread(s) interact with the user.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Flavors of File IO&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;CACHED STREAM FILE IO - Think the standard C library interface: open(), close(), read(), write(), etc. Reads and writes end up as a mix of blocking and non-blocking kernel calls. Nothing new here. I avoid this stuff. &lt;br /&gt;&lt;br /&gt;MEMORY MAPPED FILE IO - The case where the operating system provides file IO directly using the virtual memory system. All the mapping kernel call has to do is to setup the page tables to invalid and then setup the os pages mapping to map to the file. At this point it can return (however it might prefetch based on mapping hints). When the app then attempts to first access memory in the mapped range, a page fault will happen, the kernel will then queue the disk IO, and sleep the thread (assuming the access isn't beyond the end of the file, also assuming the disk page isn't cached in memory). Depending on access patterns, the OS might prefetch N extra pages on fault (read-ahead). Nothing says that the entire file is loaded. &lt;br /&gt;&lt;br /&gt;&lt;i&gt;A scheduling pitfall of memory mapped IO is possible lengthy page faults even after the first access. So if using memory mapped IO in a blocking service thread, that thread would have to manually touch (read a integer from) all the pages to insure they are all actually loaded (skipping pages isn't an option because the read-ahead might detect that an then only service the skipped pages). This way when the async service thread marks the IO as complete, the program can assume a compute thread will not block. Problem with a blocking async thread touching pages to insure they are loaded is that each read will be a full (wasted) cache line miss out to main memory (if each miss is 1000 cycles, on a 2 GHz machine with 4KB pages, 1 ms of miss time is only good for at best ~7.8 MB of touched blocks, costing you 1/16 of your 60Hz frame). So totally NOT practical!&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Pages modified are automatically marked by the CPU so on unmap the OS can only write out modified pages for free. Also in theory, memory mapped IO could provide the operating system a way to avoid an extra copy on read and write from the file system. This is assuming the OS does direct file DMA to/from the pages. Avoiding the copy would be avoiding the cache pollution and stalls caused by huge CPU to CPU copies. Avoiding the extra copy sitting around in the disk cache also saves memory. &lt;br /&gt;&lt;br /&gt;Memory mapped file IO is available on Windows/Unix/Mac/Etc, but not on consoles, and other embedded systems!&lt;br /&gt;&lt;br /&gt;RAW DIRECT ASYNCHRONOUS FILE IO - Given the problems with cached or memory mapped file IO, for databases and other critical high performance applications, operating systems often provide an interface for raw file IO. Interfaces for raw file IO are also commonly non-blocking and asynchronous (for obvious reasons) and likely batch request friendly. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Windows Asynchronous Disk I/O&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Quick refresher, I'm not personally interested in things like accessing a huge number of random files. IMO that is just broken by design. What I do want is background IO with simple polling for completion (no heavy weight kernel call), and no duplication of memory (ie copy of the file block in the file cache). So I'm not going to cover things like I/O completion ports...&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Microsoft has a &lt;a href="http://support.microsoft.com/kb/156932"&gt;great article&lt;/a&gt; on this which should be read first. Windows AIO has one key design flaw in that the operating system can choose to go synchronous for an operation without giving the application a choice. This IMO is an Epic Fail case for any asynchronous interface.&lt;br /&gt;&lt;br /&gt;Cases of going synchronous are: NTFS compression, NTFS encryption, extending a file's length, and large number of cached file IO requests. FILE_FLAG_NO_BUFFERING (raw direct un-cached IO) best insures an actual asynchronous call. However from their example (of no-buffering async calls), only 500 requests got queued in 0.22 seconds. Which shows a 0.45ms average kernel call time. Yikes, that is either certainly not asynchronous, or just spin locking, or the thread is getting heavily preempted!!!&lt;br /&gt;&lt;br /&gt;Windows does provide fail cases with ERROR_INVALID_USER_BUFFER or ERROR_NOT_ENOUGH_MEMORY whenever there are too many outstanding asynchronous I/O requests.&lt;br /&gt;&lt;br /&gt;Windows also does provide a HasOverlappedIoCompleted() macro to check for AIO completion without using a kernel call.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;MacOSX Asynchronous Disk I/O&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;POSIX asynchronous IO wasn't supported until MacOSX 10.4. POSIX AIO provides a true non-blocking interface, if a request cannot be enqueued, then functions fail. POSIX AIO requires a aio_error() "system call" to check on status of an async operation. I believe that POSIX AIO on OSX is a wrapper for kqueue() so polling does indeed require a system call (yuck). According to some internet sources a process is limited to 16 AIO requests at a time on OSX. Probably better to just use kqueue() directly on OSX. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Linux Asynchronous Disk I/O&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Included in 2.6 kernel (patch for 2.4). Works on files or block devices opened with O_DIRECT or raw devices. Doesn't work on sockets or pipes. One option is to directly interface with the io_*() system calls through the libaio userspace wrapper library. Basically,&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;#include &lt; libaio.h &gt;&lt;br /&gt;&lt;br /&gt;// setup context&lt;br /&gt;aio_context_t ctx;&lt;br /&gt;bzero(&amp;ctx, sizeof(aio_context_t));&lt;br /&gt;io_setup(max_events, &amp;ctx);&lt;br /&gt;&lt;br /&gt;// submit events&lt;br /&gt;struct iocb tab[events];&lt;br /&gt;io_prep_pread(tab+0, fd, buf, bytes, offset);&lt;br /&gt;io_prep_pwrite(tab+1, fd, buf, bytes, offset);&lt;br /&gt;...&lt;br /&gt;io_submit(ctx, events, tab);&lt;br /&gt;&lt;br /&gt;// poll on completion using syscall&lt;br /&gt;// will want to wrap this in something which disables signals&lt;br /&gt;struct io_event tab2[events];&lt;br /&gt;num_events_or_error = io_getevents(ctx, 0, events, tab2, 0);&lt;br /&gt;if(tab2[0].obj == tab+0) // then that event is finished&lt;br /&gt;// tab2[0].res2 == 0 then ok, else error&lt;br /&gt;// tab2[0].res == amount read/written&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;Links: &lt;a href="http://www.kernel.org/doc/man-pages/online/pages/man2/syscalls.2.html"&gt;Linux System Calls&lt;/a&gt;, &lt;a href="http://www.makelinux.net/kernel_map"&gt;Neat Interactive Map of Linux Kernel&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Other option would be to use &lt;a href="http://paiol.cvs.sourceforge.net/viewvc/paiol/libposix-aio/"&gt;libposix-aio&lt;/a&gt; to get the POSIX AIO interface (which uses io syscalls). The link above is to the source which is a good reference to the syscalls (for example they wrap io_getevents() in something which masks interrupts, stuff like that likely learned the hard way through trial and error!).&lt;br /&gt;&lt;br /&gt;&lt;i&gt;BTW, Linux too fails on the most basic level of simply directly updating the iocb with the completion status so a kernel call isn't needed to poll. Arg.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Non-Blocking Network Interface&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;I personally don't use TCP, and only do UDP. Via UDP, the problem of the "connection" interface to the operating system goes away, and a single socket can gather all incoming packets from all clients and/or send packets to all clients (recvfrom()/sendto()). A quick ioctl(), or ioctlsocket() (with Windows WSA emulation of BSD sockets), can be used to set non-blocking operation.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Audio&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Going to leave this for another blog post...&lt;/i&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-2445170607110597017?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/2445170607110597017/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/thread-scheduling-part-2.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2445170607110597017'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/2445170607110597017'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/thread-scheduling-part-2.html' title='Thread Scheduling Part 2'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1089702811248465710.post-3338294718964524870</id><published>2009-06-04T21:56:00.002-05:00</published><updated>2009-06-04T23:06:44.269-05:00</updated><title type='text'>More DX11 Thoughts</title><content type='html'>Was going through &lt;a href="http://repi.blogspot.com/2009/04/gdc09-your-game-needs-direct3d-11-so.html"&gt;Bill Bilodeau's Presentation&lt;/a&gt; off the &lt;a href="http://repi.blogspot.com/"&gt;repiloque blog&lt;/a&gt;, and realized I had missed some important info from slide 20 saying that the &lt;b&gt;"3 times faster and 1/100th the size"&lt;/b&gt; is comparing ATI DX9 (actually DX10, with vertex texture fetch) tessellation (requiring at least one pass to export per edge tessellation factors, and a second to tessellate) to running the high poly mesh through the standard DX9 pipeline, also that "DX11 Tessellator algorithms can &lt;b&gt;usually&lt;/b&gt; be done in one pass". Guessing this "usually" is suggesting that hardware might still optionally spill intermediate results to main memory under a few rare conditions?&lt;br /&gt;&lt;br /&gt;Given that tessellation is a data amplification process and that pixel shading is (for the most part) also a data amplification process, it isn't too hard to understand that the hardware could likely keep intermediate data all on chip. GS is the wild card there, and I don't plan on using GS at all anyway.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;UAV Limits&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;- Pixel Shader limited to 8 RTVs+UAVs total&lt;br /&gt;- Compute Shader limited to 8 UAVs&lt;br /&gt;&lt;br /&gt;So the limit of the union of Render Targets and Unordered Resource Views is the same as the limit of Render Targets... says to me that at least one DX11 vendor will be using the same interface for both RTVs and UAVs. Also UAV texture access is still unordered, I'm thinking any deep buffer magic (k-buffer, etc) is going to require atomic operations...&lt;br /&gt;&lt;br /&gt;Speaking of a combined RTV+UAV ROP unit. So does the new DX11 path support better fine granularity (32-bit) scattering (assuming the scatter has good data locality)? Nick Thibieroz's Shader Model 5.0 and Compute Shader slides describe &lt;i&gt;"Order-Independent Transparency"&lt;/i&gt; and &lt;i&gt;"Data Binning Operations"&lt;/i&gt; as applications of UAVs. Is this really going to be fast enough make binning pixels in a single pass useful? With Larrabee the answer is YES (I think), what about ATI and NVidia? Currently with UAVs emulated via CUDA on a DX10 GT200, the answer is a clear NO.&lt;br /&gt;&lt;br /&gt;Binning literally means something like this (in C like code, not HLSL),&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;bin = atomic_add(queue_head + pixel_offset, 1);&lt;br /&gt;if(bin &lt; max_bin) { queue[pixel_offset * max_bin + bin] = value; }&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;Note the first line is a fully coalesced vector friendly GLOBAL atomic add (this is an unordered view). In theory this could be fast (coalesced) but latency unfriendly because of the needed return value. The second line scatter requires a fast 32-bit scatter (value is a 32-bit pixel) to a "cache" or ROP write-combining buffer to be bandwidth friendly. If binning of 32-bit values with good locality gets fast, then an amazing amount of algorithms become possible on DX11. I've been waiting for this since the CUDA PTX .surf space has been documented but never implemented in hardware...&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Append / Consume&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;No limits posted yet that I can find (I haven't tried/installed the DX11 software reference), but ATI presentations seem to show only Raw or Structured Buffers can be mapped as Append or Consume in a shader but seems like NOT as both Append and Consume at the same time? I'm also guessing Append output or Consume input is to/from main memory since it seems most likely that the Append to Consume boundary is at the draw call. I'm also thinking the limits on Append/Consume buffers mapped in a shader will be small.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1089702811248465710-3338294718964524870?l=farrarfocus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://farrarfocus.blogspot.com/feeds/3338294718964524870/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/more-dx11-thoughts.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3338294718964524870'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1089702811248465710/posts/default/3338294718964524870'/><link rel='alternate' type='text/html' href='http://farrarfocus.blogspot.com/2009/06/more-dx11-thoughts.html' title='More DX11 Thoughts'/><author><name>Timothy Farrar</name><uri>http://www.blogger.com/profile/05680970497491371987</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
