NVIDIA's GeForce 8800 (G80): GPUs Re-architected for DirectX 10

Name: NVIDIA's GeForce 8800 (G80): GPUs Re-architected for DirectX 10
Item: NVIDIA's GeForce 8800 (G80): GPUs Re-architected for DirectX 10
Author: Anand Lal Shimpi & Derek Wilson

by Anand Lal Shimpi & Derek Wilson on November 8, 2006 6:01 PM EST

Posted in
GPUs

111 Comments | Add A Comment

111 Comments

Branching

In order to talk generally about SPs and their capabilities, all the vertices, primitives, pixel components, etc. to be processed are referred to as threads. This way we can look at each SP as handling its own thread no matter what type of data is being processed. G80 is able to sustain "thousands" of threads at a time, but the actual number of threads that can be active at any given time is not disclosed. While all SPs can handle any type of thread, SPs that share resources must be running the same type of thread at any given time. In this way, each block of 16 SPs can be running one type of shader program on 16 threads. This indicates something about branch granularity as well. For vertex shaders, branch granularity is 16 vertices. For pixel shaders, branch granularity is 32 pixels (arranged in pairs of blocks of 4x4 pixels).

Branch granularity defines how many threads must follow the same path through data. When a group of 32 pixel threads all take the same branch, we don't have a problem. If even one thread must take a path that is different from the others, all 32 threads must be evaluated with both paths following the branch. The branch then defines what result each individual thread will keep and which it will discard. It's easy to see that optimum granularity is 1 thread, as no unnecessary work would be done. The way resources are allocated and the way instructions are run on SPs grouped together currently doesn't allow any more fine-grained branching. Here's a chart that address branch granularity:

GPU	Branch Granularity
NVIDIA NV4x	~1K pixels
NVIDIA G70	~256 pixels
ATI R580	48 pixels
NVIDIA G80	16 vertex 32 pixels

Clearly G80 has the advantage here, as it's less likely that smaller groups of pixels will take different directions through a branch. This gives programmers the ability to more easily integrate branching into their code without getting a massive performance hit. If programmers are able to incorporate more branches, shader code can become more general purpose and we will see many more effects make their way into games. Now that G80 has caught up to ATI in terms of potential branch performance, we hope developers will take the reality of more complex code seriously.

Early-Z, Memory Interface

NVIDIA has added hardware for Early-Z to G80, after their current Z-Cull hardware which removes regions of pixels completely occluded by other geometry. Early-Z is a more fine-grained occlusion culling method that looks at a calculated Z value of a fragment before it hits the pixel pipeline. Z-Cull doesn't look at per fragment Z values, but uses a Z value based on geometry. While Z-Cull can get rid of large blocks of data it has issues handling surfaces that are only partially occluded or intersecting surfaces. Looking at individual depth values per pixel can help remove unnecessary fragments from heading down the pipeline only to be thrown out when the ROPs get to them.

The memory interface has been dramatically redesigned to support the access patterns of all of G80's independent stream processors. Given the theme of increasing granularity within G80 it's no surprise that we are now seeing 5 and 6 channels of GDDR rather than the 2 or 4 channels we have been used to for the past few years. 8800 GTX will have a 384 bit bus (6 x 64-bit channels), while the 8800 GTS will have a 320 bit wide connection to DRAM (5 x 64-bit channels). We would love to delve further into the details of G80's new memory interface, but NVIDIA isn't discussing the details of this aspect of their hardware.

Digging deeper into the shader core General Purpose Processing with G80

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

111 Comments

View All Comments

aweigh - Friday, November 10, 2006 - link
You can just use the program DX Tweaker to enable Triple Buffering in any D3D game and use your VSYNC with negligable performance impact. So you can play with your VSYNC, a high-res and AA as well. :)
aweigh - Friday, November 10, 2006 - link
I'm gonna buy an 88 specifically to use 4x4 SuperSampling in games. Why bother with MSAA with a card like that?
DerekWilson - Friday, November 10, 2006 - link
Supersampling can make textures blurry -- especially very detailed textures.

And the impact will be much greater with the use of longer more detailed pixel shaders (as the shaders must be evaluated at every sub-pixel in supersample).

I think transparency / adaptive AA are enough.

On your previous comment, I don't think we're to the point where we can hit triple buffering, vsync, high levels of AA AND high resolution (2560x1600) without some input lag (triple buffering plus vsync with framerates less than your refresh rate can cause problems).

If you're talking about enabling all these options on a lower resolution lcd panel, then I can definitely see that as a good use of the hardware. And it might be interesting to look at more numbers with these type of options enabled.

Thanks for the suggestion.
aweigh - Saturday, November 11, 2006 - link
I never knew that about SuperSampling. Is it something similar to Quincux blurring? And would using a negative LOD via RivaTuner/nHancer counteract the effect?

How about NVIDIA's Digital Sharpness setting in Color Correction? I've found a smidge of sharpening can do wonders to improve overall clarity.

By the way, when you said Adaptive AA, were you referring to ATI cards?
Unam - Friday, November 10, 2006 - link
Derek,

Saw your comment regarding the rationale for the test resolution, while I understand your reasoning now, it still begs the question how many of your readers have 30" LCD flat panels?
DerekWilson - Friday, November 10, 2006 - link
There might not be many out there right now, but it's still the right test platform for G80. We did test down to 1600x1200, so people do have information if they need it.

But it speaks to who should own an 8800 GTX right now. It doesn't make sense to spend that much money on a part if you aren't going to get anything out of it with your 1280x1024 panel.

Owners of a 2560x1600 panel will want an 8800 GTX. Owners of an 8800 GTX will want a 2560x1600 panel. Smooth framerates with the ability to enable 4xAA in every game that allowed it is reason enough. People without a 2560x1600 panel should probably wait until prices come down on the 8800 GTX or until games that are able to push the 8800 GTX harder to buy the card.
Unam - Tuesday, November 14, 2006 - link
Derek,

A follow up to testing resolutions, the FPS numbers we see in your articles, are they maximum, minimum or average?
Unam - Friday, November 10, 2006 - link
Who the heck runs 2560x1600? At 4XAA? Come on guys, real world benchmarks please!
DerekWilson - Friday, November 10, 2006 - link
we did:

1600x1200, 1920x1440, and even 1280x1024 in Oblivion
dragonsqrrl - Thursday, August 25, 2011 - link
....lol, owned.

NVIDIA's GeForce 8800 (G80): GPUs Re-architected for DirectX 10

Branching

Early-Z, Memory Interface

Post Your Comment

111 Comments

View All Comments

aweigh - Friday, November 10, 2006 - link

aweigh - Friday, November 10, 2006 - link

DerekWilson - Friday, November 10, 2006 - link

aweigh - Saturday, November 11, 2006 - link

Unam - Friday, November 10, 2006 - link

DerekWilson - Friday, November 10, 2006 - link

Unam - Tuesday, November 14, 2006 - link

Unam - Friday, November 10, 2006 - link

DerekWilson - Friday, November 10, 2006 - link

dragonsqrrl - Thursday, August 25, 2011 - link

Log in

Don't have an account? Sign up now