Of Shader Details ...


One of the complaints with the NV3x architecture was its less than desirable shader performance. Code had to be well optimized for the architecture, and even then the improvement made to NVIDIA's shader compiler is the only reasons NV3x can compete with ATI's offerings.

There were a handful of little things that added up to hurt shader performance on NV3x, and it seems that NVIDIA has learned a great deal from its past. One of the main things that hurt NVIDIA's performance was that the front end of the shader pipe had a texture unit and a math unit, and instruction order made a huge difference. To fix this problem, NVIDIA added an extra math unit to the front of the vertex pipelines so that math and texture instructions no longer need to be interleaved as precisely as they had to be in NV3x. The added benefit is that twice the math throughput in NV40 means the performance of math intensive shaders approach a 2x gain per clock over NV3x (the ability to execute 2 instructions per clock per shader is called dual issue). Vertex units can still issue a texture command with a math command rather than two math commands. This flexibility and added power make it even easier to target with a compiler.

And then there's always register pressure. As anyone who has ever programmed on in x86 assembly will know, having a shortage of usable registers (storage slots) available to use makes it difficult to program efficiently. The specifications for shader model 3.0 bumps the number of temporary registers up to 32 from 13 in the vertex shader while still requiring at least 256 constant registers. In PS3.0, there are still 10 interpolated registers and 32 temp registers, but now there are 224 constant registers (up from 32). What this all adds up to mean is that developers can work more efficiently and work on large sets of data. This ends up being good for extending both the performance and the potential of shader programs.

There are 50% more vertex shader units bringing the total to 6, and there are 4 times as many pixel pipelines (16 units) in NV40. The chip was already large, so its not surprising that NVIDIA only doubled the number of texture units from 8 to 16 making this architecture 16x1 (whereas NV3x was 4x2). The architecture can handle 8x2 rendering for multitexture situations by using all 16 pixel shader units. In effect, the pixel shader throughput for multitextured situations is doubled, while single textured pixel throughput is quadrupled. Of course, this doesn't mean performance is always doubled or quadrupled, just that that's the upper bound on the theoretical maximum pixels per clock.

As if all this weren't enough, all the pixel pipes are dual issue (as with the vertex shader units) and coissue capable. DirectX 9 co-issue is the ability to execute two operations on different components of the same pixel at the same time. This means that (under the right conditions), both math units in a pixel pipe can be active at once, and two instructions can be run on different component data on a pixel in each unit. This gives a max of 4 instructions per clock per pixel pipe. Of course, how often this gets used remains to be seen.

On the texturing side of the pixel pipelines, we can get upto 16x anisotropic filtering with trilinear filtering (128 tap). We will take a look at anisotropic filtering in more depth a little later.

Theoretical maximums aside, all this adds up to a lot of extra power beyond what NV3x offered. The design is cleaner and more refined, and allows for much more flexibility and scalability. Since we "only" have 16 texture units coming out of the pipe, on older games it will be hard to get more than 2x performance per clock with NV40, but for newer games with single textured and pixel shaded rendering, we could see anywhere from 4x to 8x performance gain per clock cycle when compared to NV3x. Of course, NV38 is clocked about 18.8% faster than NV40. And performance isn't made by shaders alone. Filtering, texturing, antialising, and lots of other issues come into play. The only way we will be able to say how much faster NV40 is than NV38 will be (you guessed it) game performance tests. Don't worry, we'll get there. But first we need to check out the rest of the pipeline.
Power Requirements … And the Pipeline
Comments Locked

77 Comments

View All Comments

  • Regs - Wednesday, April 14, 2004 - link

    Wow, very impressive. Yet very costly. I'm very displeased with the power requirments however. I'm also hoping newer drivers will boost performance even more in games like Far cry. I was hoping to see at least 60 FPS @ 1280x1024 w/ 4x/8x. Even though it's not really needed for such a game and might be over kill, however It would of knocked me off my feet enough where I could over look the PSU requirement. But ripping my system apart yet again for just a video card seems unreasonable for the asking price of 400-500 dollars.
  • Verdant - Wednesday, April 14, 2004 - link

    i don't think the power issue is as big as some make it out to be, some review sites used a 350 W psu, and two connectors on the same lead and had no problems under load
  • dragonballgtz - Wednesday, April 14, 2004 - link

    I can't wait till December when I build me a new computer and use this card. But maybe by then the PCI-E version.
  • DerekWilson - Wednesday, April 14, 2004 - link

    #11 you are correct ... i seem to have lost an image somewhere ... i'll try to get that back up. sorry about that.
  • RyanVM - Wednesday, April 14, 2004 - link

    Just so you guys know, Damage (Tech Report) actually used a watt meter to determine the power consumption of the 6800. Turns out it's not much higher than a 5950.

    Also, it makes me cry that my poor 9700Pro is getting more than doubled up in a lot of the benchmarks :(
  • CrystalBay - Wednesday, April 14, 2004 - link

    Hi Derek, What kind of voltage fluctuations were you seeing... just kinda curious about the PSU...
  • PrinceGaz - Wednesday, April 14, 2004 - link

    A couple of comments so far...

    page 6 "Again, the antialiasing done in this unit is rotated grid multisample" - nVidia used an ordered grid before, only ATI previously used the superior rotated grid.

    page 8 - both pictures are the same, I think the link for the 4xAA one needs changing :)

    Can't wait to get to the rest :)
  • ZobarStyl - Wednesday, April 14, 2004 - link

    dang ive got a 450W...sigh. That power consumption is really gonna kill the upgradability of this card (but then again the x800 is slated for double molex as well). I know it's a bit strange but I'd like to see which of these cards (top end ones) can provide the best dual-screen capability...any GPU worth its salt comes with dual screen capabilities and my dually config needs a new vid card and I dont even know where to look for that...

    and as for cost...these cards blow away 9800XT's and 5950's...it wont be 3-4 fps above the other that makes me pick between a x800 and a 6800...it will be the price. Jeez, what are they slated to hit the market at, 450?
  • Icewind - Wednesday, April 14, 2004 - link

    Upgrade my PSU? I think not Nvidia! Lets see what you got Ati
  • LoneWolf15 - Wednesday, April 14, 2004 - link

    It looks like NVidia has listened to its customer base. I'm particularly interested in the hardware MPEG 1/2/4 encoder/decoder.

    Even so, I don't run anything that comes close to maxing my Sapphire Radeon 9700, so I don't think I'll buy a new card any time soon. I bought that card as a "future-proof" card like this one is, and guess what? The two games I wanted to play with it have not been released yet (HL2 and Doom3 of course), and who knows when they will be? At the time, Carmack and the programmers for Valve screamed that this would be the card to get for these games. Now they're saying different things. I don't game enough any more to justify top-end cards; frankly, and All-In-Wonder 9600XT would probably be the best current card for me, replacing the 9700 and my TV Wonder PCI.

Log in

Don't have an account? Sign up now