The Pixel Shader Engine

On par with what we have seen from NVIDIA, ATI's top of the line card is offering a GPU with a 16x1 pixel pipeline architecture. This means that it is able to render up to 16 single textured pixels in parallel per clock. As previously alluded to, R420 divides its pixel pipes into groups of four called quads. The new line of ATI GPUs will offer anywhere from one to four quad pipelines. The R3xx architecture offers an 8x1 pixel pipeline layout (grouped into two quad pipelines), delivering half of R420's pixel processing power per clock. For both R420 and R3xx, certain resources are shared between individual pixel pipelines in each quad. It makes a lot of sense to share local memory among quad members, as pixels near eachother on the screen should have (especially texture) data with a high locality of reference. At this level of abstraction, things are essentially the same as NV40's architecture.

Of course, it isn't enough to just look how many pixel pipelines are available: we must also discover how much work each pipeline is able to get done. As we saw in our final analysis of what went wrong with NV3x, the internals of a shader unit can have a very large impact on the ability of the GPU to schedule and execute shader code quickly and efficiently.

At our first introduction, the inside of R420's pixel pipeline was presented as a collection of 2 vector units, 2 scalar units, and one texture unit that can all work in parallel. We've seen the two math and one texture layout of NV40's pixel pipeline, but does this mean that R420 will be able to completely blow NV40 out of the water? In short, no: it's all about what kind of work these different units can do.

Lifting up the hood, we see that ATI has taken a different approach to presenting their architecture than NVIDIA. ATI's presentation of 2 vector units (which are 3 wide at 72bits), 2 scalar units (24bits), and a texture unit may be more reflective of their implementation than what NVIDIA has shown (but we really can't know this without many more low level details). NVIDIA's hardware isn't quite as straight forward as it may end up looking to software. The fact is that we could look at the shader units in NV40's pixel pipeline in the same way as ATI's hardware (with the exception of the fact that the texture unit shares some functionality with one of the math units). We could also look at NV40 architecture as being 4 2-wide vector units or 2 4-wide vector units (though this is still an over simplification as there are special cases NVIDIA's compiler can exploit that allow more work to be done in parallel). If ATI had decided to present it's architecture in the same way as NVIDIA, we would have seen 2 shader math units and one completely independent texture unit.

In order to gain better understanding, here is a diagram of the parallelism and functionality of the shader units within the pixel pipelines of R420 and NV40:


ATI has essentially three large blocks that can push up to 5 operations per clock cycle


NV40 can be seen two blocks of a more amorphous configuration (but there are special cases that allow some of these parts to work at the same time within each block.

Interestingly enough, there haven't been any changes to the block diagram of a pixel pipeline at this level of detail from R3xx to R420.

The big difference in the pixel pipe architectures that gives the R420 GPU a possible upper hand in performance over NV40 is that texture operations can be done entirely in parallel with the other math units. When NV40 needs to execute a texture operation, it looses much of its math processing power (the texturing unit cannot operate totally independently of the first shader unit in the NV40 pixel pipeline). This is also a feature of R3xx that carried over to R420.

Understanding what this all means in terms of shader performance depends on the kind of code developers end up writing. We wanted to dedicate some time to hand mapping some shader code to both architecture's pixel pipelines in order to explain how each GPU handled different situations. Trial and error have led us to the conclusion that video card drivers have their work cut out for them when trying to optimize code; especially for NV40. There are multiple special cases that allow NVIDIA's architecture to schedule instructions during texturing operations on the shared math/texture unit, and some of the "OR" cases from our previous diagram of parallelism can be massaged into "and" cases when the right instructions are involved. This also indicates that performance gains due to compiler optimizations could be in NV40's future.

Generally, when running code with mixed math and texturing (with a little more math than texturing) ATI will lead in performance. This case is probably the most indicative of real code.

The real enhancements to the R420 pixel pipeline are deep within the engine. ATI hasn't disclosed to us the number of internal registers their architectures have, or how many pixels each GPU can maintain in flight at any given time, or even cache hit/miss latencies. We do know that, in addition to the extra registers (32 constant and 32 temp registers up from 12) and longer length shaders (somewhere between 512 and 1536 depending on what's being done) available to developers on R420, the number of internal registers has increased and the maximum number of pixels in flight has increased. These facts are really important in understanding performance. The fundamental layout of the pixel pipelines in R420 and NV40 are not that different, but the underlying hardware is where the power comes from. In this case, the number of internal pipeline stages in each pixel pipeline, and the ability of the hardware to hide the latency of a texture fetch are of the utmost importance.

The bottom line is that R420 has the potential to execute more PS 2.0 instructions per clock than NVIDIA in the pixel pipeline because of the way it handles texturing. Even though NVIDIA's scheduler can help to allow more math to be done in parallel with texturing, NV40's texture and math parallelism only approaches that of ATI. Combine that with the fact that R420 runs at a higher clock speed than NV40, and even more pixel shader work can get done in the same amount of time on R420 (which translates into the possibility for frames being rendered faster under the right conditions).

Of course, when working with fp32 data, NV40 is doing 25% more "work" per operation, and it's likely that the support for fp32 from the front of the shader pipeline to the back contributes greatly to the gap in the transistor count (as well as performance numbers). When fp16 is enabled in NV40, internal register pressure is decreased, and less work is being done than in fp32 mode. This results in improved performance for NV40, but questions abound as to real world image quality from NVIDIA's compiler and precision optimized shaders (we are currently exploring this issue and will be following up with a full image quality analysis of now current generation hardware).

As an extension of the fp32 vs. fp24 vs. fp16 debate, NV40's support of Shader Model 3.0 puts it at a slight performance disadvantage. By supporting fp32 all the way through the shader pipeline, flow control, fp16 to the framebuffer and all the other bells and whistles that have come along for the ride, NV40 adds complexity to the hardware, and size to the die. The downside for R420 is that it now lags behind on the feature set front. As we pointed out earlier, the only really new features of the R420 pixel shaders are: higher instruction count shader programs, 32 temporary registers, and a polygon facing register (which can help enable two sided lighting).

To round out the enhancements to the R420's pixel pipeline, ATI's F-Buffer has been tweaked. The F-Buffer is what ATI calls the memory that stores pixels that have come out of the pixel shader but still require another pass (or more) thorough the pixel shader pipeline in order to finish being processed. Since the F-Buffer can require anywhere from no memory to enough memory to handle every pixel coming down the pipeline, ATI have built "improved" memory management hardware into the GPU rather than relegating this task to the driver.

The R420 Vertex Pipeline Depth and Stencil with Hyper Z HD
Comments Locked

95 Comments

View All Comments

  • NullSubroutine - Thursday, May 6, 2004 - link

    Trog I agree with you for the most part, but there are some people who can use upgrades. I myself have bought expensive video cards in the past. I got the Geforce3 right when it came out (in top of the line alienware system for 1400 bucks), and it lasted me for 2-3 years. Now if someone spends 400-500 bucks on a video card that lasts them that long (2-3 years) its no different than if someone buys a 200 buck video card every year. I am one of those people who likes to buy new compoents when computing speed doubles and if I have the money I'll get what I can that will last me the longest. If I cant afford top of the line Ill get something that will get me by (9500pro last card I bought for 170 over a year ago).

    However I do agree with you that people who upgrade to the best every generation is silly.
  • TrogdorJW - Thursday, May 6, 2004 - link

    I'm sorry, but I simply have to laugh at anyone going on and on about how they're going to run out and buy the latest graphics cards from ATI or Nvidia right now. $400 to $500 for a graphics card is simply too much (and it's too much for a CPU as well). Besides, unless you have some dementia that requires you to run all games at 1600x1200 with 4xAA and 8xAF, there's very little need for either the 6800 Ultra or the X800 XT right now. Relax, take a deep breath, save some money, and forget about the pissing contest.

    So, is it just me, or is there an inverse relationship between a person's cost of computer hardware and their actual knowledge of computers? I have a coworker that is always spending money on upgrading his PC, and he really has no idea what he's doing. He went from an Athlon XP 2800+ (OC'ed to 2.4 GHz) to a P4 2.8 OC'ed to 3.7 GHz. He also went from a 9800 Pro 256 to a 9800 XT. In the past, he also had a GeForce FX 5900 Ultra. He tries to overclock all of his systems, they sound like a jet engine, and none of them are actually fully stable. In the last year, he has spent roughly $5000 on computer parts (although he has sold off some of the "old" parts like the 5900 Ultra). Performance of his system has probably improved by about 25% over the course of the year.

    Sorry for the rant, but behavior like that from *anybody* is just plain stupid. He's gone from 120 FPS in some games up to 150 FPS. Anyone here actually think he can tell the difference? I suppose it goes without saying that he's constantly crowing about his 3DMark scores. Now he's all hot to go out and buy the X800 XT cards, and he's been asking me when they'll be in stores. Like I care. They're nice cards, I'm sure, but why buy them before you actually have a game that needs the added performance?

    His current games du joir? Battlefield 1942 and Battlefield Vietnam. Yeah... those really need a high performance DX9 card. The 80+ FPS of the 9800 XT he has just isn't cutting it.

    So, if you read my description of this guy and think I'm way off base, go get your head examined. Save your money, because some day down the road you will be glad that you didn't spend everything you earned on computer parts. Enjoy life, sure, but having a faster car, faster computer, bigger house, etc. than someone else is worth pretty much jack and shit when it all comes down to it.

    /Rant. :D
  • a2y - Thursday, May 6, 2004 - link

    If a card is going to come up every few weeks then how do you guys choose which to buy?

    ATI have the trade-up section for old cards, is that any good?
  • gxshockwav - Thursday, May 6, 2004 - link

    Um...what happened to the posting of new Ge6 6850 benchmark numbers?
  • NullSubroutine - Thursday, May 6, 2004 - link

    Trog, its good to hear you were being nice, but I wasnt bashing THG, I love that site (besides this one) and I get alot of my tech info from there.

    What I normally do though is I take benchmarks from different sites then put them in Excel, make a little graph and see the % point differences between the tests. If you plan on buying a new vid card its important to find out if the Nvida or ATi card is faster on your type of system.

    And from what I found is that the AMD system from Atech performed better with Nvidia, and Intel system peformed better with ATi from THG (for Farcry and Unreal2004 only ones to be somewhat similar tests).

    #61 How much money did ATi spend when developing the R3xx line? I would venture to say a decent amount...somtimes companies invest more money in a design then refine it several times (at less cost) before starting from scratch again. ATi and Nvidia has done this for quite awhile. Also from what Ive heard the r3xx had the possibilty of 16 pipes to begin with..this true anyone?

    Texture memory about 256 doesnt really matter now b/c of the insane bandwidth the 8x apg has to offer, however one might see that 512 may come in handy after Doom3 comes out since they use shitloads of high res textures instead of high polygons for alot of detail. I dont see 512 coming out for a little while, espescially with ram prices.
  • NullSubroutine - Thursday, May 6, 2004 - link

    Trog, its good to hear you were being nice, but I wasnt bashing THG, I love that site (besides this one) and I get alot of my tech info from there.

    What I normally do though is I take benchmarks from different sites then put them in Excel, make a little graph and see the % point differences between the tests. If you plan on buying a new vid card its important to find out if the Nvida or ATi card is faster on your type of system.

    And from what I found is that the AMD system from Atech performed better with Nvidia, and Intel system peformed better with ATi from THG (for Farcry and Unreal2004 only ones to be somewhat similar tests).

    #61 How much money did ATi spend when developing the R3xx line? I would venture to say a decent amount...somtimes companies invest more money in a design then refine it several times (at less cost) before starting from scratch again. ATi and Nvidia has done this for quite awhile. Also from what Ive heard the r3xx had the possibilty of 16 pipes to begin with..this true anyone?

    Texture memory about 256 doesnt really matter now b/c of the insane bandwidth the 8x apg has to offer, however one might see that 512 may come in handy after Doom3 comes out since they use shitloads of high res textures instead of high polygons for alot of detail. I dont see 512 coming out for a little while, espescially with ram prices.
  • deathwalker - Thursday, May 6, 2004 - link

    Well...once again..someone is lying thru there teeth. What happen to the $399 entry price of the Pro model? Cheapest price on pricewatch it $478. Someone trying to cash in on the new buyer hysteria? I am impressed though with ATI's ability to step up to the plate and steal Nvidia's thunder.
  • a2y - Thursday, May 6, 2004 - link

    OMG OMG!! I almost gone to buy and build a new system with latest specs and graphics card! and was going for the nVidia 6800Ultra ! until just now i decided to see any news from ATI and discovered their new card!

    Man if ATI and nVidia are going to bring up a card every 2/3 weeks then i'll never be able to build this system!!!

    Being a (Pre)fan of half-life 2, I guess im going to wait until its released to buy a graphics card (meaning when we all die and go to hell).
  • remy - Wednesday, May 5, 2004 - link

    For the OpenGL vs D3D performance argument don't forget to take a look at Homeworld2 as it is an OpenGL game. ATI's hardware certainly seems to have come a long way since the 9700 Pro in that game!
  • TrogdorJW - Wednesday, May 5, 2004 - link

    NullSubroutine - It was meant as nice sarcasm, more or less. No offense intended. (I was also trying to head off this thread becoming a "THG sucks blah blah blah" tangent, as many in the past have done when someone mentions their reviews.)

    My basic point (without doing a ton of research) is that pretty much every hardware site has their own demos that they use for benchmarking. Given that the performance difference between the ATI and Nvidia cards was relatively constant (I think), it's generally safe to assume that the levels, setup, bots, etc. are not the same when you see differing scores. Now if you see to places using the same demo and the same system setup, and there's a big difference, then you can worry. I usually don't bother comparing benchmark numbers from two different sites since they are almost never the same configuration.

Log in

Don't have an account? Sign up now