Digging deeper into the shader core

Many of the same patterns that lead designers of current hardware to their conclusions are still true today. For instance, pixels next to each other on the screen still tend to follow a very similar path through the hardware. This means that it still makes sense to process pixels in quads. As for changes, as hardware becomes more programmable, we are seeing a higher percentage of scalar data being used. In spite of the fact that much of the work done by graphics hardware is vector based, it becomes easier to schedule code if we are working with a bunch of parallel, independent, scalar processors. It is also more efficient to build separate units for texture addressing and filtering, and ATI has done this for quite some time now.

NVIDIA has finally decoupled the texture units from their shader hardware, enabling math and texturing to happen at the same time with no scheduling issues. They have also decided to implement their math hardware as a collection of scalar processors that can be used together to perform vector operations. NVIDIA calls the scalar processors Stream Processors (SPs), and they handle all the math performed in the shader core of G80.

It isn't surprising to see that NVIDIA's implementation of a unified shader is based on taking a pixel shader quad pipeline, and breaking up the vector units into 4 scalar units. Now, rather than 4 pixel quads, we see 16 SPs per "quad" or block of stream processors. Each block of 16 SPs shares 4 texture address units, 8 texture filter units, and an L1 cache.

G70 Pixel Shader Quad

G80 Stream Processor Block

The fact that these SPs are now independent and scalar gives NVIDIA the ability to keep more of them busy more of the time. This is very important as programmers start to write longer more complex shaders. Even while working with vectors, programmers need to use scalar values all the time to manipulate and evaluate data.

Each Stream Processor is able to complete one MAD and one MUL per clock cycle. While this is based on maximum throughput, we can reasonably expect to achieve this even though the hardware is pipelined. In spite of the 4 or 5 cycles (depending on precision) latency of a MUL in Conroe, SSE is now capable of one MUL per cycle throughput (as long as there are no stalls in the pipeline). Latency of operations in G80 could be even longer and sustain high throughput, as most of the time we are working with code that isn't riddled with dependencies.

The fact that each SP is capable of IEEE 754 single precision and can sustain high throughput for MAD and MUL operations while running any type of shader code makes this hardware very powerful and more general purpose than ever.

As a thread exits the SP, G80 is capable of writing the output of the shader to memory. The fact that SPs can do this at any time (except after pixel shaders) goes beyond the DX10 spec of just allowing for stream output after the Geometry Shader. On previous hardware, data would have to go through every stage of the pipeline until a value was finally written out to the frame buffer. Now, we can write data out at the end of anything but a pixel shader (as pixel shaders must send their output straight over to the ROPs for processing). This will be a great benefit to GPGPU (general purpose computing on graphics processing units).

G80: A Mile High Overview Branching, Early Z and Memory Interface


View All Comments

  • haris - Thursday, November 09, 2006 - link

    You must have missed the article they published the very next day http://www.theinquirer.net/default.aspx?article=35...">here. saying they goofed. Reply
  • Araemo - Thursday, November 09, 2006 - link

    Yes I did - thanks.

    I wish they would have updated the original post to note the mistake, as it is still easily accessible via google. ;) (And the 'we goofed' post is only shown when you drill down for more results)
  • Araemo - Thursday, November 09, 2006 - link

    In all the AA comparison photos of the power lines, with the dome in the background - why does the dome look washed out in the G80 images? Is that a driver glitch? I'm only on page 12, so if you explain it after that.. well, I'll get it eventually.. ;) But is that just a driver glitch, or is it an IQ problem with the G80 implementation of AA? Reply
  • bobsmith1492 - Thursday, November 09, 2006 - link

    Gamma-correcting AA sucks. Reply
  • Araemo - Thursday, November 09, 2006 - link

    That glitch still exists whether or not gamma-correcting AA is enabled or disabled, so that isn't it. Reply
  • iwodo - Thursday, November 09, 2006 - link

    I want to know if these power hungry monster have any power saving features?
    I mean what happen if i am using Windows only most of the time? Afterall CPU have much better power management when they are idle or doing little work. Will i have to pay extra electricity bill simply becoz i am a cascual gamer with a power - hungry/ ful GPU ?

    Another question pop up my mind was with CUDA would it now be possible for thrid party to program a H.264 Decoder running on GPU? Sounds good to me:D
  • DerekWilson - Thursday, November 09, 2006 - link

    oh man ... I can't believe I didn't think about that ... video decoder would be very cool. Reply
  • Pirks - Friday, November 10, 2006 - link

    decoder is not interesting, but the mpeg4 asp/avc ENCODER on the G80 GPU... man I can't imagine AVC or ASP encoding IN REAL TIME... wow, just wooowww
    I'm holding my breath here
  • Igi - Thursday, November 09, 2006 - link

    Great article. The only thing I would like to see in a follow up article is performance comparison in CAD/CAM applications (Solidworks, ProEngineer,...).

    BTW, how noisy are new cards in comparison to 7900GTX and others (in idle and under load)?
  • JarredWalton - Thursday, November 09, 2006 - link

    I thought it was stated somewhere that they are as loud (or quiet if you prefer) as the 7900 GTX. So really not bad at all, considering the performance offered. Reply

Log in

Don't have an account? Sign up now