Digging deeper into the shader core

Many of the same patterns that lead designers of current hardware to their conclusions are still true today. For instance, pixels next to each other on the screen still tend to follow a very similar path through the hardware. This means that it still makes sense to process pixels in quads. As for changes, as hardware becomes more programmable, we are seeing a higher percentage of scalar data being used. In spite of the fact that much of the work done by graphics hardware is vector based, it becomes easier to schedule code if we are working with a bunch of parallel, independent, scalar processors. It is also more efficient to build separate units for texture addressing and filtering, and ATI has done this for quite some time now.

NVIDIA has finally decoupled the texture units from their shader hardware, enabling math and texturing to happen at the same time with no scheduling issues. They have also decided to implement their math hardware as a collection of scalar processors that can be used together to perform vector operations. NVIDIA calls the scalar processors Stream Processors (SPs), and they handle all the math performed in the shader core of G80.

It isn't surprising to see that NVIDIA's implementation of a unified shader is based on taking a pixel shader quad pipeline, and breaking up the vector units into 4 scalar units. Now, rather than 4 pixel quads, we see 16 SPs per "quad" or block of stream processors. Each block of 16 SPs shares 4 texture address units, 8 texture filter units, and an L1 cache.

G70 Pixel Shader Quad


G80 Stream Processor Block


The fact that these SPs are now independent and scalar gives NVIDIA the ability to keep more of them busy more of the time. This is very important as programmers start to write longer more complex shaders. Even while working with vectors, programmers need to use scalar values all the time to manipulate and evaluate data.

Each Stream Processor is able to complete one MAD and one MUL per clock cycle. While this is based on maximum throughput, we can reasonably expect to achieve this even though the hardware is pipelined. In spite of the 4 or 5 cycles (depending on precision) latency of a MUL in Conroe, SSE is now capable of one MUL per cycle throughput (as long as there are no stalls in the pipeline). Latency of operations in G80 could be even longer and sustain high throughput, as most of the time we are working with code that isn't riddled with dependencies.

The fact that each SP is capable of IEEE 754 single precision and can sustain high throughput for MAD and MUL operations while running any type of shader code makes this hardware very powerful and more general purpose than ever.

As a thread exits the SP, G80 is capable of writing the output of the shader to memory. The fact that SPs can do this at any time (except after pixel shaders) goes beyond the DX10 spec of just allowing for stream output after the Geometry Shader. On previous hardware, data would have to go through every stage of the pipeline until a value was finally written out to the frame buffer. Now, we can write data out at the end of anything but a pixel shader (as pixel shaders must send their output straight over to the ROPs for processing). This will be a great benefit to GPGPU (general purpose computing on graphics processing units).

G80: A Mile High Overview Branching, Early Z and Memory Interface
POST A COMMENT

111 Comments

View All Comments

  • DerekWilson - Thursday, November 09, 2006 - link

    i'm sure there was a lot burried in there ... sorry if it wasn't easy to find.

    8800 gtx and gtx are both no louder than 7900 gtx. 1950 xtx still takes the cake for loudest graphics card around by a long shot -- especially after it heats up in a game.
    Reply
  • crystal clear - Thursday, November 09, 2006 - link

    My comments in Daily Tech on this subject-

    More "G80" Derivatives in February R
    E: More info would be nice
    By crystal clear on 11/8/06, Rating: 2
    By crystal clear on 11/8/2006 8:03:43 AM , Rating: 2

    If you link VISTA -SANTA ROSA platform-Core2DUO(merom)CPU line up(T7300,7500,7700 models)then a matching Graphics card
    to complete the link.

    So a G80 for laptops/notebooks?

    The pairing of Intels Santa Rosa platform with Vista in the 2Q 07 is next big thing for the first tier notebook manufacturers & all they need is a matching G80 for this setup.

    Unquote-
    Nvidia currently caters to Desktop requirement/needs with the new G80 releases,wonder how the notebook/server versions will be-with Vista ofcourse.



    Reply
  • yyrkoon - Thursday, November 09, 2006 - link

    Vitual memory is probably a good thing for most cases, but in the graphics arena, this *could* potentially make for sloppy/ bad coding practises. Knowing a lot of game devers (some of which actually work for well known companies), I've heard them from time to time complain about maxing a 16x PCI-E pipe. What I'm trying to say here, is that while it would be a good thing for never having to run out of texture memory, but that system memory, and definately the swap disk can not hold a candle to the memory bandwidth that most Video cards are capable of. End result, is that you definately *will* get a performance hit. All this, and we already know the memory bandwidth capabilities of modern PCs, suffice it to say, the most we'll see from current systems is what ? 12-13K GB/s ? Even a 7800GS can do roughly 35 GB/s on card. A 7600GT ? 22GB/s ?

    Still I think Directx10 is a very good thing, and as I didnt read the whole article, perhaps a missed a little ? Reason being, I've been reading about Directx10 since April, and a friend of mine was privy to some of this information after an interview with ATI.

    http://www.gamedev.net/reference/programming/featu...">http://www.gamedev.net/reference/programming/featu...
    Reply
  • saratoga - Thursday, November 09, 2006 - link

    I don't know how they threading really works, but its quite possible VM support is required in order to allow multiple threads to run without stepping all over each other,. Reply
  • saratoga - Thursday, November 09, 2006 - link

    Sorry, should read "I don't know how THEIR threading works" Reply
  • falc0ne - Thursday, November 09, 2006 - link

    I don't know what is the problem but I'm really unable to see the images within the latest articles from Anand...Can anyone give me a suggestion? What might be the cause of that?
    The thing is I'm really, really interested in these articles and I need to see those images. Thanks
    Reply
  • yyrkoon - Thursday, November 09, 2006 - link

    Oh, er, then in the options tab of Firefox, (tools->options->content) check the "load images" check box ;) Reply
  • falc0ne - Thursday, November 09, 2006 - link

    well...it would've been simple but I'm afraid is not that...It might be the addblock extension from firefox, other than that I have nooo ideeea...Well I will use the IE tab option instead and load the pages using IE 7. Thanks anyway:) Reply
  • yyrkoon - Thursday, November 09, 2006 - link

    Checked the exceptions list ? I know that firefox makes it really simple to block images from a site (to a point of being too easy). Reply
  • JarredWalton - Thursday, November 09, 2006 - link

    If you've got AdBlock on Firefox, press Ctrl+Shift+A and you can see what it's blocking. If it blocks the images.anandtech.com stuff, you can then see which RegEx isn't working right and edit that. Reply

Log in

Don't have an account? Sign up now