The Pixel Shader Engine

On par with what we have seen from NVIDIA, ATI's top of the line card is offering a GPU with a 16x1 pixel pipeline architecture. This means that it is able to render up to 16 single textured pixels in parallel per clock. As previously alluded to, R420 divides its pixel pipes into groups of four called quads. The new line of ATI GPUs will offer anywhere from one to four quad pipelines. The R3xx architecture offers an 8x1 pixel pipeline layout (grouped into two quad pipelines), delivering half of R420's pixel processing power per clock. For both R420 and R3xx, certain resources are shared between individual pixel pipelines in each quad. It makes a lot of sense to share local memory among quad members, as pixels near eachother on the screen should have (especially texture) data with a high locality of reference. At this level of abstraction, things are essentially the same as NV40's architecture.

Of course, it isn't enough to just look how many pixel pipelines are available: we must also discover how much work each pipeline is able to get done. As we saw in our final analysis of what went wrong with NV3x, the internals of a shader unit can have a very large impact on the ability of the GPU to schedule and execute shader code quickly and efficiently.

At our first introduction, the inside of R420's pixel pipeline was presented as a collection of 2 vector units, 2 scalar units, and one texture unit that can all work in parallel. We've seen the two math and one texture layout of NV40's pixel pipeline, but does this mean that R420 will be able to completely blow NV40 out of the water? In short, no: it's all about what kind of work these different units can do.

Lifting up the hood, we see that ATI has taken a different approach to presenting their architecture than NVIDIA. ATI's presentation of 2 vector units (which are 3 wide at 72bits), 2 scalar units (24bits), and a texture unit may be more reflective of their implementation than what NVIDIA has shown (but we really can't know this without many more low level details). NVIDIA's hardware isn't quite as straight forward as it may end up looking to software. The fact is that we could look at the shader units in NV40's pixel pipeline in the same way as ATI's hardware (with the exception of the fact that the texture unit shares some functionality with one of the math units). We could also look at NV40 architecture as being 4 2-wide vector units or 2 4-wide vector units (though this is still an over simplification as there are special cases NVIDIA's compiler can exploit that allow more work to be done in parallel). If ATI had decided to present it's architecture in the same way as NVIDIA, we would have seen 2 shader math units and one completely independent texture unit.

In order to gain better understanding, here is a diagram of the parallelism and functionality of the shader units within the pixel pipelines of R420 and NV40:


ATI has essentially three large blocks that can push up to 5 operations per clock cycle


NV40 can be seen two blocks of a more amorphous configuration (but there are special cases that allow some of these parts to work at the same time within each block.

Interestingly enough, there haven't been any changes to the block diagram of a pixel pipeline at this level of detail from R3xx to R420.

The big difference in the pixel pipe architectures that gives the R420 GPU a possible upper hand in performance over NV40 is that texture operations can be done entirely in parallel with the other math units. When NV40 needs to execute a texture operation, it looses much of its math processing power (the texturing unit cannot operate totally independently of the first shader unit in the NV40 pixel pipeline). This is also a feature of R3xx that carried over to R420.

Understanding what this all means in terms of shader performance depends on the kind of code developers end up writing. We wanted to dedicate some time to hand mapping some shader code to both architecture's pixel pipelines in order to explain how each GPU handled different situations. Trial and error have led us to the conclusion that video card drivers have their work cut out for them when trying to optimize code; especially for NV40. There are multiple special cases that allow NVIDIA's architecture to schedule instructions during texturing operations on the shared math/texture unit, and some of the "OR" cases from our previous diagram of parallelism can be massaged into "and" cases when the right instructions are involved. This also indicates that performance gains due to compiler optimizations could be in NV40's future.

Generally, when running code with mixed math and texturing (with a little more math than texturing) ATI will lead in performance. This case is probably the most indicative of real code.

The real enhancements to the R420 pixel pipeline are deep within the engine. ATI hasn't disclosed to us the number of internal registers their architectures have, or how many pixels each GPU can maintain in flight at any given time, or even cache hit/miss latencies. We do know that, in addition to the extra registers (32 constant and 32 temp registers up from 12) and longer length shaders (somewhere between 512 and 1536 depending on what's being done) available to developers on R420, the number of internal registers has increased and the maximum number of pixels in flight has increased. These facts are really important in understanding performance. The fundamental layout of the pixel pipelines in R420 and NV40 are not that different, but the underlying hardware is where the power comes from. In this case, the number of internal pipeline stages in each pixel pipeline, and the ability of the hardware to hide the latency of a texture fetch are of the utmost importance.

The bottom line is that R420 has the potential to execute more PS 2.0 instructions per clock than NVIDIA in the pixel pipeline because of the way it handles texturing. Even though NVIDIA's scheduler can help to allow more math to be done in parallel with texturing, NV40's texture and math parallelism only approaches that of ATI. Combine that with the fact that R420 runs at a higher clock speed than NV40, and even more pixel shader work can get done in the same amount of time on R420 (which translates into the possibility for frames being rendered faster under the right conditions).

Of course, when working with fp32 data, NV40 is doing 25% more "work" per operation, and it's likely that the support for fp32 from the front of the shader pipeline to the back contributes greatly to the gap in the transistor count (as well as performance numbers). When fp16 is enabled in NV40, internal register pressure is decreased, and less work is being done than in fp32 mode. This results in improved performance for NV40, but questions abound as to real world image quality from NVIDIA's compiler and precision optimized shaders (we are currently exploring this issue and will be following up with a full image quality analysis of now current generation hardware).

As an extension of the fp32 vs. fp24 vs. fp16 debate, NV40's support of Shader Model 3.0 puts it at a slight performance disadvantage. By supporting fp32 all the way through the shader pipeline, flow control, fp16 to the framebuffer and all the other bells and whistles that have come along for the ride, NV40 adds complexity to the hardware, and size to the die. The downside for R420 is that it now lags behind on the feature set front. As we pointed out earlier, the only really new features of the R420 pixel shaders are: higher instruction count shader programs, 32 temporary registers, and a polygon facing register (which can help enable two sided lighting).

To round out the enhancements to the R420's pixel pipeline, ATI's F-Buffer has been tweaked. The F-Buffer is what ATI calls the memory that stores pixels that have come out of the pixel shader but still require another pass (or more) thorough the pixel shader pipeline in order to finish being processed. Since the F-Buffer can require anywhere from no memory to enough memory to handle every pixel coming down the pipeline, ATI have built "improved" memory management hardware into the GPU rather than relegating this task to the driver.

The R420 Vertex Pipeline Depth and Stencil with Hyper Z HD
Comments Locked

95 Comments

View All Comments

  • ZobarStyl - Tuesday, May 4, 2004 - link

    Jibbo I thought that the dynamic branching capability as part of PS3.0 could make rendering a scene faster because it skips rendering unneccessary pixels and thus could offer an increase in performance, albeit a small one. In an interview one of the developers of Far Cry said that there weren't many more things that PS3.0 could do that 2.0 can't, but that 3.0 can do things in a single pass that a 2.0 shader would have to do in multiple passes. The way he described it, the real pretty effects can come in later but a streamlined (read: slightly faster) shader could very well improve NV40 scores as is. This seems kind of analogous to the whole 64-bit processor ordeal going on; Intel says you don't need it, but then most articles show higher scores from A64 chips when they are in a 64 bit OS, so basically if you streamline it you can run a little bit faster than in less efficient 32-bit.

    In the end, it'll still be bitter fanboys fighting it out and buying whatever product their respective corporation feeds them, despite features or speeds or price or whatever. Personally, like I said before, I'll wait and see who really ends up earning my dollar.

    Anyway, thanks for keeping me on my toes though, jib...I can't get lazy now... =)
  • Barkuti - Tuesday, May 4, 2004 - link

    From my point of view, the 6800U is superior high end hardware. Folks, you don't need to be that intelligent to understand that if ATI needs 520 Mhz to "beat" nVidia's 400 MHz chip, as it will need to overclock proportionally to keep the same level of performance that means it will need a good bunch of extra MHz to stay at least on par on the overclocking front.

    I think the final revision of the 6800U will manage 500 MHz overclocks or around (probably more if they deliberately set the initial clock low waiting for ATI), so ATI's hardware may need around 650 Mhz, which I doubt it'll make. As for the power requirements, sure ATI is the winner, but the nVidia's card can be fed with more standard PSU's than they claim; I just think they played on the safe side.
    Oh, sure, power may be a limiting factor when oc'ing the 6800U, but the reality is that people who buy these kind of harware already has top end computer components (including the PSU), so no worries here also.

    And finally speaking, I think PS 3.0 will make some additional difference. With the possibility to somewhat enhance shader performance and the superior displacement mapping effect, it may give it the edge in at least a handful of games. We'll see.

    "Just my 2 cents"
    Cheers
  • Staples - Tuesday, May 4, 2004 - link

    Everyone be sure to check out Tom's review. Looks like the X800 did better here than it did against the 6800. I have seen other reviews and the X800 doesn't really seem as fast in comparison as it does here.

    Anyway, it is a lot faster than I though. The 6800 was impressive but it seems that the reason it does really well in some games and not so great in others is because some games have NVIDIA specific code that the 6800 takes advantage of very well.
  • UlricT - Tuesday, May 4, 2004 - link

    wtf? the GT is outperforming the Ultra in F1 Challenge?
  • jibbo - Tuesday, May 4, 2004 - link

    Agree with you all the way on the fanboys, ZobarStyl.

    Just wanted to point out that PS3.0 is not "faster" - it's simply an API. It allows longer and more complex shaders so, if anything, it's likely to be "slower." I'm guessing that designers who use PS3.0 heavily will see serious fill-rate problems on the 6800. These shaders will have potentially 65k+ instructions with dynamic branching, a minimum of 4 render targets, 32-bit FP minimum color format, etc - I seriosuly doubt any hardcore 3.0 shader programs will run faster than existing 2.0 shaders.

    Clearly a developer can have much nicer quality and exotic effects if he/she exploits these, but how many gamers will have a PS3.0 card that will run these extremely complex shaders at high resolutions and AA/AF without crawling to single-digit fps? It's my guess that it will be *at least* a year until games show serious quality differentiation between PS2.0 and PS3.0. But I have been wrong in the past...
  • T8000 - Tuesday, May 4, 2004 - link

    I think it is strange that the tested X800XT is clocked at 520 Mhz, while the 6800U, that is manufactured by the same taiwanese company and also has 16 pipelines, is set at 400 Mhz.

    This suggests a lot of headroom on the 6800U or a large overclock on the X800XT.

    Also note that the 6800U scored much better on tomshardware.com (HALO 65FPS@1600x1200), but that can also be caused by their use of the 3.2 Ghz P4 instead of a 2.2 Ghz A64.
  • ZobarStyl - Tuesday, May 4, 2004 - link

    I love seeing these fanboys announce each product as the best thing ever (same thing happened with the Prescott, Intel fanboys called it the end of AMD and the AMD guys laughed and called it a flamethrower) without actually reading the benches. NV won some, ATi won some. Most of the time it was tiny margins either way. Fanboys aside, this is gonna be a driver war nothing more. The biggest margin was on Far Cry, and I'm personally waiting on the faster PS3.0 to see what that bench really is. This is a great card but price drops and drivers updates will eventually show us the real victor.
  • jibbo - Tuesday, May 4, 2004 - link

    If I had to guess, DX10 and Longhorn will coincide with the release of new hardware from everyone.
  • Akaz1976 - Tuesday, May 4, 2004 - link

    Just thought of something. If i am reading AT review right, ATi now has milked the original Radeon9700 architecture for nearly 2 years (sure says a lot of good things about the ArtX design team).

    Anyone know when the true next gen chip can be expected?

    Akaz
  • Ilmater - Tuesday, May 4, 2004 - link

    ---------------------------------------
    Hearing about the 6850 and the other Emergency-Extreme-Whatever 6800 variants that are floating about irritates me greatly. Nvidia, you are losing your way!

    Instead of spending all that time, effort and $$ just to try to take the "speed champ" title, make your shit that much cheaper instead! If your 6800 Ultra was $425 instead of $500, that would give you a hell of alot more market share and $$ than a stupid Emergency Edition of your top end cards... We laugh at Intel for doing it, and now you're doing it too, come fricking on...
    --------------------------------------------
    This is ridiculous!! What do you think the XT Platinum Edition from ATI is? The only difference is that nVidia released first, so it's more obvious when they do it than when ATI does. I'm not really a fanboy of either, but you shouldn't dog nVidia for something that everyone does.

    Plus, if nVidia dropped their prices, ATI would do the same thing. Then nVidia would be right back where it was before, but they wouldn't be making any money on the cards.

Log in

Don't have an account? Sign up now