20090930

Fermi

NVIDIA Fermi Architecture Page

Download the white paper for some architectural details.


Personal View

Yeah the architecture is awesome! For HPC, double precision fused multiply add at 1/2 the single precision rate, ECC memory support, and 40-bit address space support! For compute, configurable shared memory/L1 split, function and data pointer support via a unified address space, faster global atomics, unified L2 and more...

20090925

Last Night in Wisconsin

Yeah, last night in Wisconsin. Could talk about how I've actually physically moved out of Wisconsin and back to my Illinois Condo the weekend before September 11th when my lease was up, and how that has resulted in not having the ability to do any interesting CUDA and GPU computing programming for the past few weeks, but likely that would be too boring!


Prior Dormant Hobby

Had a 1969 911E which I started to build into a track car years ago and never finished. Sold off the car, trailer, and related stuff this Wednesday. Since I had this huge stalled project (full tear down and rebuild of the entire car), the racing bug was effectively squashed, the wife would have killed me if I bought another car without selling the old project. Actually I ended up doing just that! Made matters worse that the 911E body and trailer was in her parents barn. Also bad that I took her WRX to the track a few times and destroyed the rotors which I didn't fix for a year which ultimately resulted in her taking the car in to get the brakes done locally. Kathryn, you really are the best wife ever! Yes, she reads this blog once and a while ... I'll get an email saying, "I actually understood that post" ;)

With the project car gone, and a very track friendly daily driver to replace it, the racing bug is biting back hard.



Found that video via this rx7club forum post while looking for someone who is using a G-Force Transmission dog box in a road racing track car.

Need to do the brakes, brake lines, harness, and more first, but a G-Force dog box conversion of the Tremec T-56 in my car, is on the list long before adding more HP. Yes it is still going to be my daily driver, no the sound doesn't bother me, and yes one of my old daily drivers didn't have heat, air conditioning, or an interior (so safe to classify me as crazy with respect to what I drive).

Looks like Thunder Hill Raceway is less than a 3 hour drive from where I will be living...

20090923

R800

R800 info is out, architecture looks similar to R700. Hard to get a clear answer on if texture filtering has indeed moved completely to the SIMD units (EDIT: perhaps not, see new comments/edits). Also have not seen any definitive answers with regards to triangle setup performance and info on append/consume performance. HD 5870 specs appear to be as follows,

EDIT: hardware.fr tests show same triangle setup rate as R700. No parallel triangle setup, but perhaps double samples rasterized per triangle per clock. Very interesting indeed!

EDIT: ATI's Dave Baumann on B3D post: "Texture interpolators have been removed from the design and is done on the shader core. In general we are seeing this as a performance improvement - its also the reason why one of the Vantage feature test gets a disproportionate increase over the previous gen."

OVERALL
- 2.7 Tflops single precision.
- 544 Gflops double precision.
- 153.6 GB/sec bandwidth.
- 20 SIMD cores.

TEXTURE
- 272 Gtex/sec 32-bit pixel unfiltered samples.
- 68 Gtex/sec bilinear filtering.
- 1 TB/sec L1 texture cache.
- 16KB TEX L1 per SIMD core (320KB total).
- Texture units able to read compressed AA color buffers.
- Texture filtering done in SIMD units (according to AnandTech)???

COMPUTE
- 8KB extra compute L1 per SIMD core (160KB total).
- 32KB local data store (640KB total).
- 64KB global data store.

MC/RBE/ROP
- 435 GB/sec L2 cache bandwidth.
- 128KB L2 cache per memory controller (512KB total).
- 4 64-bit memory controllers (MCs).
- Render Back-Ends (RBEs) can process 32 pixel/clk.
- RBEs look to be divided across the 4 MCs.
- Fast color clears.

SIMD
- EDIT: vertex attribute interpolation done in SIMD units.
- EDIT: Support for full speed 24-bit integer math.
- Dedicated Sum of Absolute Differences instruction.
- Faster Dot Product instruction.
- Required support for DX11 instructions.
- Full speed denormals?

20090921

Bad Industry Humor: To 12 MPix and Beyond!

Second post in the awfully-bad-industry-humor or would-have-been-better-off-sleeping catagory,

"At first we thought it would be a great idea, the new ATI GPU was to have something like 10x the bandwidth and 20x the compute power of the Xbox 360. So what was working at 720p on the 360 should work about the same on a six 1080p display array off the new GPU. Napkin math said peak capacity amortized to around 8 Kflops/pixel and 800 bytes/pixel bandwidth at 720p/30fps on the 360 and about the same targets on the 12 MPix (multi-monitor) output of the new GPU.

The project would be simple ... (just use nearly the same engine, add tessellation and up the texture quality) ... or so we thought.

Early on, decided to go with virtual texturing. With deferred shading, needed about 380 MB alone for the G-buffer (2xMSAA 16B/pixel). Another minimum of 1 GB would be needed for four layers of a compressed 16Kx16K virtual texture. This virtual texture would provide 21 four layer texels per screen pixel (a great ratio).

That is when the problems started.

Artists used to work with 2Kx2K source stamp textures at maximum size. Now they were required to work at 8Kx8K just to be safe. This required an upgrade of all the artists machines to 64-bit Photoshop CS4 to support enough address space so they could have more than 16 layers active in Photoshop without swapping to disk (we use 16-bit per channel source images). Had to get rid of all the Mac machines because 64-bit wasn't supported on the Mac with Photoshop.

Old source material just would not cut it with the new 64 Mpix texture resolution requirement. We had to purchase a few 40 Mpix Phase One digital camera backs and special medium format camera gear just to gather good source photos (at a cost of about $30-40K US per each group of camera/back/lenses/etc). Also had to hire a special photographer to help train the artists on the new gear (teach details like how to choose the proper aperture for sharp photos, etc).

Furthermore our publisher would only allow us to use four dual-layer Blue-Ray discs, which is a measly 200 GB of data. The art department was outraged, what is the point of having 1 to 2TB desktop drives when the game can only use 200 GB of space? Had something to do with the publisher saying that 12x speed BR drives would require a little over a one hour install to the HDD, and that was the upper maximum time that the average user would stand when every 16 to 20 minutes they had to put in another BR disk to complete the install.

And don't get me started on the required upgrade to our in-house GI farm, storage solution, and network ... lets just say re-baking the lighting and virtual texture was almost measured in weeks not hours.

In the end the project got canned and we ran out of money, going for 10x the resolution and quality on 2x the budget of a console title just was not possible yet. We never should have bet on technology that was going to only sustain or reduce the bandwidth (or compute) to pixel ratio ... or worse yet, require hanging an array of monitors from the ceiling.

Later we heard of another post DX11 launch title going the opposite direction, doing something VGA retro, real-time photon mapping at 320x200 at 30Hz with motion blur and a real camera lens model. The graphics engine was called something like The Turing Engine because most viewers could not tell the difference between the real-time 320x200 rendering and a VHS tape of similar real life source video. Something like 2 Mflops/pixel and 100 KB/pixel of bandwidth at 30 fps, damn, why didn't I think of that!"


If for a second you took any of that seriously, you really need to fill up on coffee before hitting the early morning blog run!

20090918

CEDEC 2009: Anti-Downsized Buffer Artifacts

CEDEC 2009: Anti-Downsized Buffer Artifacts
"The page you requested was too large to translate."
Anyone have a good summary?

20090917

Food-For-Thought Supermarket Aisle

This post is to keep the twitter peanut gallery happy ;) Thanks Ivan-Assen Ivanov for the inspiration, sure was a lot more fun then doing something productive like continuing my search for a place to live when I move...


Starting With Some Simple Examples

Q: If you are playing chess with a distant opponent via snail mail, and want to play the game faster than the US postal service can deliver messages, how do you do it?

A: You send not only your next move, but also your response to all of your opponent's possible next moves, under agreement neither player will cheat.

In the above case, latency is the bottleneck, compute time is abused with redundant work to solve the latency problem.

Q: A pre-gunpowder age general wants to communicate a real-time updates of highly complex battle field commands to distant field officers but only has a limited set of flags, how is this done?

A: General and field officers pre-agree upon a complex battle plan option (or meaning) per flag.

Bandwidth is the bottleneck, solved by taking advantage of local memory using the ultimate form of compression.

Q: A manager has 12 phone meetings in a given very busy day, how is this handled efficiently?

A: A secretary pre-schedules each phone meeting.

This is a rough example of the problem of task preemptive scheduling (interrupted phone calls) when there are dependencies (communication) between nodes.


Humans Are Natural Parallel Programmers

Ironically, humans as a group are great at solving problems with distributed memory, as individuals have only a subset of the knowledge stored in the continuation of the species. Many of us, myself included, couldn't possibly remember everything we need to know on a daily basis. In fact, often we get better at remembering where we can find information (indexes or addresses) than we are at remembering the information itself. Humans organize into hierarchies to solve problems, as well as pipeline processes, think about manufacturing (thanks Mike Acton), supply chain management, etc.

The point I am getting at here, is that going parallel is natural and is easy to think about when grounding the problem in terms of something real.


Human Species Machine

How about something more crazy...

Massively parallel programming gets rather interesting when thinking about distributed memory, non-coherent redundancy, the problem of node failure, message passing, latency, and compression, all together.

One fantastic quality of humans is that we all have different memories of similar events or ideas, a kind of distributed non-coherent redundancy. The memory of an event (or idea) is physically distributed across many nodes, with great overlap in the "important" data and less overlap in details which are more important to individual nodes.

People live and die (node failure) and pass on ideas though ripples left in the fabric of this non-coherent distributed memory. Human language and culture provides a huge distributed context as to which to compress data for message passing. A single phrase can draw upon this giant context for meaning.


Stability Without Coherent Memory

Simple patterns can serve to generate intrinsic randomness (divergence), or can serve to maintain and safely transmit information (convergence). Digital computation is a great example of this (given the underlining "analog" flow of electrons).

With the human species machine, constant message passing serves to shape and change the distributed memory. Ideas and information become stable when replicated across many nodes. The more parallel the machine, and the more non-coherent, distributed, and redundant the memory gets, the less a single node failure matters. In fact, given the amount of message passing, the information is constantly in flux at the scale of individual nodes, but as a global whole, a stable idea is highly coherent collectively.

BTW, if you are still reading this post, I would highly suggest reading Stephen Wolfram's New Kind of Science (at least twice)!

20090916

NVIDIA and Moving

Moving out to Silicon Valley soon to start a new job at NVIDIA!

Clearly the blog will be changing, should be obvious what I will NOT be writing about. Also more excited about what I will be working on at my new job than anything I would have worked on at home. Future "at home / free time" focus will likely be shifting towards actual technical papers or presentations or demos.


Soon Farewell to Wisconsin

Had a great time working many of the people at Humanhead: Brian, Jimmy, Ash, Giovanni, Michael, Norm, and more.

Also have a lot of great memories of Madison and Wisconsin: Lao Laan Xang (favorite local restaurant), Ford's Gym, Devils Lake (hiking, climbing), Blue Mound, Farmer's Market on the Square, EAA Airshow, Apostle Islands, and Road America (race track).

Did not even scratch the surface of everything I could have done when in Wisconsin. For instance really did not take advantage of the good proximity to two great road courses: Road America and Black Hawk Farms. Last Road America visit warped the rotors on the WRX (which is ironically my wife's daily driver). Did not get around to either replacing the rotors (done) or getting myself a track friendly car (also done) until it was too late (too busy, and now moving). Hopefully will rectify this problem at Infineon Raceway or Laguna Seca after I move.

20090913

Selling My Truck

In case anyone in the Chicago area is interested (the truck is garaged in Westmont IL),



Kelley Blue Book Value: $5210
Selling For: $5000 or best offer

Contact: Timothy Farrar
Phone: 630-430-6656
Gmail: farrarfocus

Had this truck custom made from the factory with special options like the close-ratio New Venture Gear transmission with single shift rail rated to 300 ft lbf of torque, and the Steel wheels (actually durable compared to crappy Aluminum). Great for towing, great for weekend trips, fun to drive. Was not planning on ever selling this truck, but moving across the country in two weeks and I have too many vehicles (truck, project car, and trailer need to go).


Chevy S-10 2WD Extended Cab

Original Owner
4.3 L V6
Heavy duty 5 speed manual transmission option
Limited slip differential
Bed liner and topper
Hitch
98K highway miles
New tires
Heavy duty suspension option


Specific Options

G80 : AXLE POSITRACTION, LIMITED SLIP
GU4 : AXLE REAR, 3.08 RATIO
K60 : GENERATOR, 100 AMP
L35 : ENGINE, GAS, 6 CYL, 4.3L, CPI, V6, 90 DEG
M50 : TRANSMISSION, MAN 5 SPD, NVG
PH1 : WHEEL, 15 X 7, STEEL
QCE : TIRE ALL, P205/75R15/N BL R/PE ST TL ALS
Z85 : CHASSIS PACKAGE, HIGH PAYLOAD FIRM RIDE



Note to Self

S10 Forum Ad

20090909

LLVM - Block Implementation Specification

http://clang.llvm.org/docs/BlockImplementation.txt
http://clang.llvm.org/docs/BlockLanguageSpec.txt

Interesting read for anyone looking to get a better idea on GCD low level details. Have not yet had the time to really look at GCD in enough detail yet myself. However, it is very exciting for Apple to take a step in the right direction and both (1) make the OS smaller and faster, and (2) start to address issues of parallel programming.

L1 Misses

Dark Shikari has a really great blog post talking about L1 misses. Quoting from that post,

"And yet in many cases–such as in x264–much more time is wasted on L1 misses than L2 misses.

The AMD processor documentation says that the L2->L1 prefetcher is not strided, and tests on Intel chips suggest the same. This means that if we are performing, for example, an access of a block of image data that is in L2 but not L1 cache, every single line of data will cause an L1 cache miss. The benchmarks seem to agree; the first chroma motion compensation during qpel in x264 takes more than twice as long as the others!"


This is really a fantastic example of some of the less talked about limitations of using CPU caches and processors optimized for low latency serial computation. Will be interesting to see if this becomes more or less of a problem as CPU style architectures have ever larger vector units and end up using L1 more as a virtual register file.

Erlang

Never really looked at Erlang, so just watched Erlang : Software for a Concurrent World to get a general overview. Lots of interesting bits in that presentation. One thing I think might be a possible weakness of Erlang is the idea of "location transparency". Seems as if when going parallel with message passing, that computation location and data location becomes a critical factor in algorithm design (perhaps less critical when nodes are interconnected by high-radix routers). That presentation referenced Tilera Tile64 hardware, which I had not seen before. Interesting in how the above Erlang presentation talks about how shared memory doesn't scale, and yet the Tilera hardware dedicates area to a giant shared L2. That last link from Ars references The Raw Project, which I also had never seen ... starting to feel like I have been living in a vacuum! Anyway, for those interested in what others are doing with parallel systems and programming, lots of interesting reading is only a google away from many of the above topics.

20090903

Early September Link Dump

Still no time for anything but links and videos...

GDC 2009 Europe and Canada
HPG 2009
Micro Rendering
Grand Central Dispatch

Krakatoa particle renderings, links to these videos were posted on Twitter.