Digging deeper into the shader core

Many of the same patterns that lead designers of current hardware to their conclusions are still true today. For instance, pixels next to each other on the screen still tend to follow a very similar path through the hardware. This means that it still makes sense to process pixels in quads. As for changes, as hardware becomes more programmable, we are seeing a higher percentage of scalar data being used. In spite of the fact that much of the work done by graphics hardware is vector based, it becomes easier to schedule code if we are working with a bunch of parallel, independent, scalar processors. It is also more efficient to build separate units for texture addressing and filtering, and ATI has done this for quite some time now.

NVIDIA has finally decoupled the texture units from their shader hardware, enabling math and texturing to happen at the same time with no scheduling issues. They have also decided to implement their math hardware as a collection of scalar processors that can be used together to perform vector operations. NVIDIA calls the scalar processors Stream Processors (SPs), and they handle all the math performed in the shader core of G80.

It isn't surprising to see that NVIDIA's implementation of a unified shader is based on taking a pixel shader quad pipeline, and breaking up the vector units into 4 scalar units. Now, rather than 4 pixel quads, we see 16 SPs per "quad" or block of stream processors. Each block of 16 SPs shares 4 texture address units, 8 texture filter units, and an L1 cache.

G70 Pixel Shader Quad


G80 Stream Processor Block


The fact that these SPs are now independent and scalar gives NVIDIA the ability to keep more of them busy more of the time. This is very important as programmers start to write longer more complex shaders. Even while working with vectors, programmers need to use scalar values all the time to manipulate and evaluate data.

Each Stream Processor is able to complete one MAD and one MUL per clock cycle. While this is based on maximum throughput, we can reasonably expect to achieve this even though the hardware is pipelined. In spite of the 4 or 5 cycles (depending on precision) latency of a MUL in Conroe, SSE is now capable of one MUL per cycle throughput (as long as there are no stalls in the pipeline). Latency of operations in G80 could be even longer and sustain high throughput, as most of the time we are working with code that isn't riddled with dependencies.

The fact that each SP is capable of IEEE 754 single precision and can sustain high throughput for MAD and MUL operations while running any type of shader code makes this hardware very powerful and more general purpose than ever.

As a thread exits the SP, G80 is capable of writing the output of the shader to memory. The fact that SPs can do this at any time (except after pixel shaders) goes beyond the DX10 spec of just allowing for stream output after the Geometry Shader. On previous hardware, data would have to go through every stage of the pipeline until a value was finally written out to the frame buffer. Now, we can write data out at the end of anything but a pixel shader (as pixel shaders must send their output straight over to the ROPs for processing). This will be a great benefit to GPGPU (general purpose computing on graphics processing units).

G80: A Mile High Overview Branching, Early Z and Memory Interface
POST A COMMENT

111 Comments

View All Comments

  • aweigh - Friday, November 10, 2006 - link

    You can just use the program DX Tweaker to enable Triple Buffering in any D3D game and use your VSYNC with negligable performance impact. So you can play with your VSYNC, a high-res and AA as well. :) Reply
  • aweigh - Friday, November 10, 2006 - link

    I'm gonna buy an 88 specifically to use 4x4 SuperSampling in games. Why bother with MSAA with a card like that? Reply
  • DerekWilson - Friday, November 10, 2006 - link

    Supersampling can make textures blurry -- especially very detailed textures.

    And the impact will be much greater with the use of longer more detailed pixel shaders (as the shaders must be evaluated at every sub-pixel in supersample).

    I think transparency / adaptive AA are enough.

    On your previous comment, I don't think we're to the point where we can hit triple buffering, vsync, high levels of AA AND high resolution (2560x1600) without some input lag (triple buffering plus vsync with framerates less than your refresh rate can cause problems).

    If you're talking about enabling all these options on a lower resolution lcd panel, then I can definitely see that as a good use of the hardware. And it might be interesting to look at more numbers with these type of options enabled.

    Thanks for the suggestion.
    Reply
  • aweigh - Saturday, November 11, 2006 - link

    I never knew that about SuperSampling. Is it something similar to Quincux blurring? And would using a negative LOD via RivaTuner/nHancer counteract the effect?

    How about NVIDIA's Digital Sharpness setting in Color Correction? I've found a smidge of sharpening can do wonders to improve overall clarity.

    By the way, when you said Adaptive AA, were you referring to ATI cards?
    Reply
  • Unam - Friday, November 10, 2006 - link

    Derek,

    Saw your comment regarding the rationale for the test resolution, while I understand your reasoning now, it still begs the question how many of your readers have 30" LCD flat panels?
    Reply
  • DerekWilson - Friday, November 10, 2006 - link

    There might not be many out there right now, but it's still the right test platform for G80. We did test down to 1600x1200, so people do have information if they need it.

    But it speaks to who should own an 8800 GTX right now. It doesn't make sense to spend that much money on a part if you aren't going to get anything out of it with your 1280x1024 panel.

    Owners of a 2560x1600 panel will want an 8800 GTX. Owners of an 8800 GTX will want a 2560x1600 panel. Smooth framerates with the ability to enable 4xAA in every game that allowed it is reason enough. People without a 2560x1600 panel should probably wait until prices come down on the 8800 GTX or until games that are able to push the 8800 GTX harder to buy the card.
    Reply
  • Unam - Tuesday, November 14, 2006 - link

    Derek,

    A follow up to testing resolutions, the FPS numbers we see in your articles, are they maximum, minimum or average?
    Reply
  • Unam - Friday, November 10, 2006 - link

    Who the heck runs 2560x1600? At 4XAA? Come on guys, real world benchmarks please! Reply
  • DerekWilson - Friday, November 10, 2006 - link

    we did:

    1600x1200, 1920x1440, and even 1280x1024 in Oblivion
    Reply
  • dragonsqrrl - Thursday, August 25, 2011 - link

    ....lol, owned. Reply

Log in

Don't have an account? Sign up now