Original Link: http://www.anandtech.com/show/1293
NVIDIA GeForce 6800 Ultra: The Next Step Forwardby Derek Wilson on April 14, 2004 8:42 AM EST
- Posted in
Today has been a long time in coming, especially for NVIDIA. It has been almost 2 years since they were on top of the industry, and they definitely want their position back. With the troubles the NV3x line of cards had, NVIDIA has really been pushing to get every bit of performance out of their CineFX architecture as possible.
In pushing for performance they have produced a massive GPU that can push quite a lot of pixels. In addition, we are seeing the introduction of a handful of DX 9.0c (shader model 3.0) features. Of course, as with everything, there are pros and cons. Do the ups out weight the downs?
First lets look at what new features we can expect to see from this generation of graphics cards, then we'll take a look at the cost.
What's new in DX 9.0c
This year the latest in the DirectX API is getting a bit of a face lift. The new feature in DirectX 9.0c is the inclusion of Pixel Shader and Vertex Shader 3.0. Rather than calling this DirectX 9.1, Microsoft opted to go for a more "incremental" looking update. This can end up being a little misleading because whereas the 'a' and 'b' revisions mostly extended and tweaked functionality, the 'c' revision adds abilities that are absent from its predecessors.
Pixel Shader 3.0 (PS3.0) allows shader programs of over 65,000 lines and includes dynamic flow control (branching). This revision also requires that compliant hardware offer 4 Multiple Render Targets (MRT's allow shaders to draw to more than one location in memory at a time), full 32-bit floating point precision, shader antialiasing, and a total of ten texture coordinate inputs per pixel.
The main advantage here is the ability for developers to write longer, more complex, shader programs that run more efficiently. The flow control will give developers the freedom to write more intuitive code without sacrificing efficiency. Branching allows a shader program the expanded ability to make decisions based on its current state and inputs. Rather than having to run multiple shaders that do different things on different groups of pixels, developers can have a single shader handle an entire object and take care of all its shading needs. Our example of choice will be shading a tree: one shader can handle rendering the dynamics of each leaf, smooth new branches near the top, rugged old bark on the trunk, and dirty roots protruding from the soil.
Vertex Shader 3.0 extends its flow control ability by adding if/then/else statements and including the ability to call subroutines in shader programs. The instruction limit on VS3.0 is also extended to over 65000. Vertex textures are also supported, allowing more dynamic manipulation of vertices. This will get even more exciting when we make our way into the next DirectX revision which will allow for dynamic creation of vertices (think very cool particle systems and hardware morphing of geometry).
One of the coolest things that VS3.0 offers is something called instancing. This functionality can remove a lot of the overhead created by including multiple objects based on the same 3d model (these objects are called instances). Currently, the geometry for every model in the scene needs to be setup and sent to the GPU for rendering, but in the future developers can create as many instances of one model as they want from one vertex stream. These instances can be translated and manipulated by the vertex shader in order to add "individuality" to each instance of the model. To continue with our previous example, a developer can create a whole forest of trees from the vertex stream of one model. This takes pressure off of the CPU and the bus (less data is processed and sent to the GPU).
Now that we've seen what developers are looking at with DirectX 9.0c, let's take a look at how NVIDIA plans to bring these features to the world.
NV40 Under the Microscope
The NV40 chip itself is massive. Weighing in at a hefty 222 Million transistors, NVIDIA's newest GPU has more than three times the number of transistors as Intel's Northwood P4, and about 33% more transistors than the Pentium 4 EE. This die is droped onto a 40mm x 40mm flipchip BGA package.
NVIDIA doesn't publish their die size information, but we have been able to interpolate a little bit from the data we have available on their process information, and a very useful wafer shot.
As we can see, somewhere around 16 chips fit horizontally on the wafer, while they can squeeze in about 18 chips vertically. We know that NVIDIA uses a 130nm IBM process on 300mm wafers. We also know that the P4 EE is in the neighborhood of 250mm^2 in size. Doing the math indicates that the NV40 GPU is somewhere between 270mm^2 and 305mm^2. It is difficult to get a closer estimate because we don't know how much space is between each chip on the wafer (which also makes it hard to estimate waste per wafer).
Since we don't have information on yields either, it's hard to say how well NVIDIA will be making out on this GPU. Increasing the transistor count and die size will lower yields, and the retail value of cards based on NV40 will have the same price at release as when NV38 was released.
Of course, even if they don't end up making as much money as they want off of this card, throwing down the gauntlet and pushing everything as hard as they can will be worth it. After the GeForce FX series of cards failed to measure up to the hype (and the competition), NVIDIA has needed something to reestablish their position as performance leader in the industry. This industry can be brutal, and falling short twice is well nigh a death sentence.
But, all those transistors on such a big die must draw a lot of power right? Just how much juice do we need to feed this beast ...
Current generation graphics cards are near the limit for how much current they are allowed to pull from one connection. So, of course, the solution is to add a second power connection to the card. That's right, the GeForce 6800 Ultra requires two independent connections to the power supply. The lines could probably be connected to a fan with no problem, but each line should really be free of any other connection.
Of course, this is a bit of an inconvenience for people who (like the writer of this article) have 4 or more drives connected to their PCs. Power connections are a limited resource in PCs, and this certainly doesn't help. Of course, it might just be worth it. We'll only make you wait a little longer to find out.
The card doesn't necessarily max out both lines (and we are looking into measuring the amperage the cards draw), but, NVIDIA indicated (in the reviewers guide with which we were supplied) that we should use a 480W power supply in conjunction with the 6800 Ultra.
There are a couple factors at work here. First, obviously, the card needs a good amount of power. Second, power supplies generally partition the power they deliver. If you look on the side of a power supply, you'll see a list of voltage rails and amperages. The wattage ratings on a power supply usually indicate (for marketing purposes) the maximum wattage they could supply if the maximum current allowed was drawn on each line. It is not possible to draw all 350 watts of a 350 watt power supply across one connection (or even one rail). NVIDIA indicated that their card needs a stable 12 volt rail, but that generally power supplies offer a large portion of their 12 volt amperage to the motherboard (since the motherboard draws the most power in the system on all rails).
Many people have been worried about heat generated by a card that requires two power connections. Just to be clear, we aren't drawing twice the power because we have twice the connection, nor are we generating twice as much heat. It's a larger chip, it draws more power, but it won't be clocked as high (with the 6800 Ultra version coming in at 400MHz as opposed to the 5950's 475MHz).
Customers who end up buying this card will most likely need to upgrade their power supply as well. Obviously this isn't an optimal solution, and it will turn some people off. But, to those who like the performance numbers, it may be worth the investment. And there are obviously rumors circulating the net about ATI's next generation solution as well, but we will have to wait and see how they tackle the power problem in a few weeks.
Of Shader Details ...
One of the complaints with the NV3x architecture was its less than desirable shader performance. Code had to be well optimized for the architecture, and even then the improvement made to NVIDIA's shader compiler is the only reasons NV3x can compete with ATI's offerings.
There were a handful of little things that added up to hurt shader performance on NV3x, and it seems that NVIDIA has learned a great deal from its past. One of the main things that hurt NVIDIA's performance was that the front end of the shader pipe had a texture unit and a math unit, and instruction order made a huge difference. To fix this problem, NVIDIA added an extra math unit to the front of the vertex pipelines so that math and texture instructions no longer need to be interleaved as precisely as they had to be in NV3x. The added benefit is that twice the math throughput in NV40 means the performance of math intensive shaders approach a 2x gain per clock over NV3x (the ability to execute 2 instructions per clock per shader is called dual issue). Vertex units can still issue a texture command with a math command rather than two math commands. This flexibility and added power make it even easier to target with a compiler.
And then there's always register pressure. As anyone who has ever programmed on in x86 assembly will know, having a shortage of usable registers (storage slots) available to use makes it difficult to program efficiently. The specifications for shader model 3.0 bumps the number of temporary registers up to 32 from 13 in the vertex shader while still requiring at least 256 constant registers. In PS3.0, there are still 10 interpolated registers and 32 temp registers, but now there are 224 constant registers (up from 32). What this all adds up to mean is that developers can work more efficiently and work on large sets of data. This ends up being good for extending both the performance and the potential of shader programs.
There are 50% more vertex shader units bringing the total to 6, and there are 4 times as many pixel pipelines (16 units) in NV40. The chip was already large, so its not surprising that NVIDIA only doubled the number of texture units from 8 to 16 making this architecture 16x1 (whereas NV3x was 4x2). The architecture can handle 8x2 rendering for multitexture situations by using all 16 pixel shader units. In effect, the pixel shader throughput for multitextured situations is doubled, while single textured pixel throughput is quadrupled. Of course, this doesn't mean performance is always doubled or quadrupled, just that that's the upper bound on the theoretical maximum pixels per clock.
As if all this weren't enough, all the pixel pipes are dual issue (as with the vertex shader units) and coissue capable. DirectX 9 co-issue is the ability to execute two operations on different components of the same pixel at the same time. This means that (under the right conditions), both math units in a pixel pipe can be active at once, and two instructions can be run on different component data on a pixel in each unit. This gives a max of 4 instructions per clock per pixel pipe. Of course, how often this gets used remains to be seen.
On the texturing side of the pixel pipelines, we can get upto 16x anisotropic filtering with trilinear filtering (128 tap). We will take a look at anisotropic filtering in more depth a little later.
Theoretical maximums aside, all this adds up to a lot of extra power beyond what NV3x offered. The design is cleaner and more refined, and allows for much more flexibility and scalability. Since we "only" have 16 texture units coming out of the pipe, on older games it will be hard to get more than 2x performance per clock with NV40, but for newer games with single textured and pixel shaded rendering, we could see anywhere from 4x to 8x performance gain per clock cycle when compared to NV3x. Of course, NV38 is clocked about 18.8% faster than NV40. And performance isn't made by shaders alone. Filtering, texturing, antialising, and lots of other issues come into play. The only way we will be able to say how much faster NV40 is than NV38 will be (you guessed it) game performance tests. Don't worry, we'll get there. But first we need to check out the rest of the pipeline.
... And the Pipeline
The end of the pipeline consists of the ROP pixel pipeline. These are the units that take care of antialiasing, as well as z and color compression and final drawing of a pixel. There are 16 of these units, and they are capable of either computing one color+z pixel, or calculating 2 z/stencil operations per clock. This means that 32 z or stencil operations (think shadowing), or 16 pixels can be drawn per clock cycle. Thus NVIDIA has dubbed this architecture a 16x1 / 32x0 architecture. On a side note, they have retroactively dubbed the NV3x a 4x2 / 8x0 architecture.
Again, the antialiasing done in this unit is rotated grid multisample, Multiple Render Targets are supported, and floating point blending can be done.
Actually, this time around, NVIDIA is supporting front to back fp16 all the way from the software to the framebuffer. This will assist in things like HDR rendering, as the fp16 (or fp32) data calculated in the pixel shaders no longer needs to be converted to 8bit integer color for display.
Of course, we always have to wait a while before we can see the good stuff that comes of all this technology. So what else does this card have to offer?
Programmable Encoding Anyone?
That's right, NV4x includes a dedicated programmable video processor. The video processor is made up of an address, scalar, vector, and branch unit. The vector unit is a 16 way SIMD (a single instruction can operate on 16 different pieces of data at once) vector unit.
We don't have anything to test this thing with right now, but there is a whole lot this thing can do, including inverse 3:2 pulldown (conversion from interlaced TV format to progressive format better suited to computer monitors), colorspace conversion, gamma correction, MPEG 2 MPEG 4 WMV9 DiVX decoding and encoding, scaling, frame rate conversion, and anything else you'd like it to do for you.
This a very exciting feature to be included on the GPU. It essentially means that anyone with an NV4x chip including the video processor will be able to stream video all over the place, do very fast encoding, and offload a lot of work from the processor when it comes to video processing. Also, it could really help in multimedia and PVR style systems by lowering the necessary CPU power to something more affordable (that is, as long as this functionality is included across the board on NV4x chips).
This could actually really help even the playing field between Intel and AMD if it catches on ...
Anisotropic, Trilinear, and Antialiasing
There was a great deal of controversy last year over some of the "optimizations" NVIDIA included in some of their drivers. We have
NVIDIA's new driver defaults to the same adaptive anisotropic filtering and trilinear filtering optimizations they are currently using in the 50 series drivers, but users are now able to disable these features. Trilinear filtering optimizations can be turned off (doing full trilinear all the time), and a new "High Quality" rendering mode turns off adaptive anisotropic filtering. What this means is that if someone wants (or needs) to have accurate trilinear and anisotropic filtering they can. The disabling of trilinear optimizations is currently available in the 56.72
Unfortunately, it seems like NVIDIA will be switching to a method of calculating anisotropic filtering based on a weighted Manhattan distance calculation. We appreciated the fact that NVIDIA's previous implementation of anisotropic filtering employed a Euclidean distance calculation which is less sensitive to the orientation of a surface than a weighted Manhattan calculation.
This is how NVIDIA used to do Anisotropic filtering
This is Anisotropic under the 60.72 driver.
This is how ATI does Anisotropic Filtering.
The advantage is that NVIDIA now has a lower impact when enabling anisotropic filtering, and we will also be doing a more apples to apples comparison when it comes to anisotropic filtering (ATI also makes use of a weighted Manhattan scheme for distance calculations). In games where angled, textured, surfaces rotate around the z-axis (the axis that comes "out" of the monitor) in a 3d world, both ATI and NVIDIA will show the same fluctuations in anisotropic rendering quality. We would have liked to see ATI alter their implementation rather than NVIDIA, but there is something to be said for both companies doing the same thing.
We had a little time to play with the D3D AF Tester that we used in last years image quality article. We can confirm that turning off the trilinear filtering optimizations results in full trilinear being performed all the time. Previously, neither ATI nor NVIDIA did this much trilinear filtering, but check out the screenshots.
Trilinear optimizations enabled.
Trilinear optimizations disabled.
When comparing "Quality" mode to "High Quality" mode we didn't observe any difference in the anisotropic rendering fidelity. Of course, this is still a beta driver, so everything might not be doing what it's supposed to be doing yet. We'll definitely keep on checking this as the driver matures. For now, take a look.
High Quailty Mode.
On a very positive note, NVIDIA has finally adopted a rotated grid antialiasing scheme. Here we can take a glimpse at what the new method does for their rendering quailty in Jedi Knight: Jedi Academy.
Jedi Knight without AA
Jedi Knight with 4x AA
Its nice to finally see such smooth near vertical and horizontal lines from a graphics company other than ATI. Of course, ATI does have yet to throw its offering into the ring, and it is very possible that they've raised their own bar for filtering quality.
The Card and The Test
This is our NVIDIA GeForce 6800 Ultra engineering sample. No, it's not one slot, yes, it has 2 molex connectors, and generally its actually not very loud.
The 16x1 GeForce 6800 Ultra will be clocked at 400/550 (core/mem) and priced at $499, while its 12x1 little brother the GeForce 6800 non-ultra will be priced at $299 (clock speeds to be determined).
Here's a quick rundown of the key features:
. Vertex Shaders
° Support for Microsoft DirectX 9.0 Vertex Shader 3.0
° Displacement mapping
° Vertex frequency stream divider
° 65000+ instruction length programs
. Pixel Shaders
° Support for DirectX 9.0 Pixel Shader 3.0
° Full pixel branching support
° Support for Multiple Render Target (MRTs)
° 65000+ instruction length programs
. Next-Generation Texture Engine
° Up to 16 textures per rendering pass
° Support for 16-bit floating point format and 32-bit floating point format
° Support for non-power of two textures
° Support for sRGB texture format forgamma textures
° DirectX and S3TC texture compression
. Full 128-bit floating point precision through the entire rendering pipeline (64-bit max precision to framebuffer and display)
The chip is 222 Million transistors fabbed on a .13 micron process. Currently a 480W power supply and 2 free completely independent connections to the PSU are required. 8x Rotated grid multisample antialiasing, and 16x (128tap) anisotropic filtering are available.
Our test system:
AMD Athlon 64 3400+
1GB DDR RAM (OCZ Platinum at 2-2-3-6)
Seagate 120GB HD
PC Power & Cooling 510W ATX Power Supply
Other cards used in the tests:
NVIDIA GeForce FX 5950
ATI Radeon 9800 XT
ATI Radeon 9700 Pro
Aquamark 3 Performance
Aquamark is based on the game Aquanox, and has been widely used among the community to compare performance on DX 9.0 hardware. Even though the benchmark may be more popular than the game, this is still game code.
The GeForce 6800 Ultra is about 36% faster than the 9800 XT in this case, showing that it can handle the AquaNox shaders fairly well. Not all that impressive, for its first test considering all the advancements made to the shaders. Slightly interesting is the fact that the CPU and GPU scores in the benchmark were nearly identical.
F1 Challenge '99-'02 Performance
60% faster without AA (67% with 4xAA/8xAF) than NV38 in F1 Challenge is definitely not shabby. This kind of performance gain is more like it, and we are seeing nearly 2x the performance of the venerable 9700 Pro.
Final Fantasy XI Performance
While most of the point of the game was unknown to us, since we just had a self-running bench mark demo, at the very least it looks really great. Especially interesting are the waterfall effects on the splash screen at the beginning. In this case, textures and blending of the landscapes won out over particle and special effects, but it doesn't mean they weren't nice.
We do see some performance gain here, but this benchmark is very CPU, GPU, and AGP bus limited, so good luck to all the brave souls who venture into the vast Final Fantasy XI Online world.
The textures and lighting effects are what really stand out in this game. There are times when colors actually seem to come off the screen. Its obvious the developers paid great attention to detail when porting this game. Subtle sparks and other interesting particle effects which are usually the highlight to a game are thrust to the background in halo, because of the intricacy of the texture effects and shaders.
This is more like it: we are finally seeing that 2x performance gain Jen-Hsun hinted at a few weeks ago. Since Halo is one of the few PS2.0 intensive games that we have to benchmark, and its good to see that we really do see a larger performance gain when more shaders are involved. Also, we can see that NV40 scales better with resolution than the other cards we tested, as it pulls further away when we move to 16x12.
Homeworld 2 Performance
As intricate as the gameplay is for this game, the beautiful backgrounds help immerse you in the complex gameplay. If you like space sims, you'll be delighted at the overall look of this game. The backgrounds alone are very artistic and shadows and camera flares from suns are dramatic. The ships look clean and well textured, and explosions are very nice too. This is a pretty stylish space-real-time-strategy-game.
This game seems to be CPU limited in the test we chose (with the exception of the 9700 Pro), but these differences could be due to differences in rendering paths (ATI cards use PS2.0 for shadowing, while NVIDIA cards use shadow volumes) or the fact that the NVIDIA drivers are still beta. But, yes, all the problems we've seen in previous tests with this game were fixed with the latest patch (antialiasing works correctly now on both cards).
EVE: The Second Genesis Performance
Eve has a very unique graphical style which is sleek and elegant. The space scenes and structures look great, but another nice graphical element is the use of in-game translucent windows which manage your character. Even though windows are on your screen displaying data, you can still see through to the action going on around you. Anand's take on it was that it looks just like the linux desktop, but you can decide that for yourself.
We can see that the 50% performance gain NV40 exerts over NV38 puts it right on par with ATI's flagship card. This is the first time we've seen anything come close to the NVIDIA beast, but when 4xAA and 8xAF are enabled, we quickly see the scaling advantage of the GeForce 6800 Ultra rear its head.
Jedi Knight: Jedi Academy Performance
The over all graphics are good, but the nice particle effects and light saber effects make the game look very surreal. This and detailed and complex levels and textures help make JK:JA look like you're really in the star wars universe.
Enabling AA and AF just seals the deal for the 6800 Ultra here. The card scales better on both resolution and added filtering. We are again seeing a near doubling of performance.
This game captures brilliantly the look of an island paradise. Most amazing is the water, with its rich color, reflections, translucence, and ripples that break very naturally against the pure white sand of the islands. Equally amazing is the detail in the shadows on your gun as you pass through the dense jungle foliage. The Character models and structures are great, but set in such a rich environment almost seem dull. Weapon effects are very impressive however, but not over-the-top. The realistic explosions fit perfectly into the unique setting.
The 1.1 patch of this game makes note of the fact that PS3.0 is implimented on the NV40 path. We have (as of yet) been unable to determine exactly what function PS3.0 is serving. Maybe it's something useful like branching, or maybe it's marketing speak (technically fp32 is a PS3.0 requirement). We just won't know until we can get ahold of the developers.
We see the same scaling pattern here with the 6800 reaching a 60% performance improvement over the ATI Radeon 9800 XT at 1600x1200 with 4xAA and 8xAF.
Even more impressive is the fact that we were able to run the GeForce 6800 Ultra at playable frame rates at 2048x1536 with 16xAF (it didn't like enabling even 2xAA - this could either be a hardware or driver limitation, but we just won't know until the driver is more refined). Our demo we used ran between 30 and 40 fps and averaged 34. The game was very playable and very beautiful. It not only seems like we need a new power supply to run the card, but a new monitor to display the ultra high resolutions the card can render playable.
Neverwinter Nights: Shadow of the Undrentide Performance
The 3D Diablo style graphics are standard for a game of this type, but the effects from spells are really exquisite. There are times when the screen just lights up with almost blindingly brilliant fire effects, and the deep shadows seem to suck the gamer into the darkness. These effects are what make having a good graphics card important for this game.
We see some more good indications of scaling here as well. NWN does rely a lot on CPU power, but with 4xAA and 8xAF, the cards do seperate themselves.
Warcraft III: The Frozen Throne Performance
The improvement that playing this game in a 3D environment makes is amazing. If you were a big fan of the original, you won't be too confused with the look of this one. You get the ease of use and benefits of seemless zoom and rotate as well as the impressive special particle effects. When you zoom in close, you can see how much attention to detail the devs paid to the textures and shadows, and when you are up close a powerful magic spell can look simply stunning.
We only performed one test here because high resolutions and 4xAA/8xAF were the best way to seperate the cards' performance. The game also looked even better cranked all the way up. The 6800 carries away one of its more modest leads of just under 20%.
Wolfenstein: Enemy Territory Performance
This edition of the game is graphically superior in every way to its humble but addictive predecessor. The grittiness and over all feel of the graphics in this game pays tribute to the wolfenstein legacy. It's very reminiscent of the many on (and off) line world war games that have been gaining popularity recently.
We can see the 6800 pull away from the rest of the cards as stress on the graphics system increases. If there's one message we've seen over and over from these tests, it has been that the GeForce 6800 Ultra scales very well with currently available games.
Unreal Tournament 2004 Performance
Much like Jedi Knight, the particle effects really stand out in this game, and there are plenty to go 'round when in a tight firefight. The textures are generally nice, but the beauty and sheer volume of the special effects are impressive. The variety of muzzle flashes and explosions from weapons are great, and when you multiply it several times in a game with lots of people it definitely keeps your eyes busy. Even though this is only a DX7/8 based title, EPIC has done a solid job enancing their flagship title with massive textures and content to keep it fresh.
At lower resolutions, we can see the CPU limitation come into play, but cranking up the settings once again shows how strong this card is.
X2: The Threat Performance
This is definitely a visually stunning game. Even though we haven't had a chance to play the game, it scores points anyway for looking so darn beautiful. It's a sci-fi game, so there are flybys of vast starfields and intricate space stations, as well as asteroid fields and other space anomalies. The particle effects, textures and intricacy of the models in the game are amazing and work together to create an elegant futuristic reality. And it does it all without touching DX9.
More of the same here. No contest with the other cards in the test. This benchmark takes a long time to run, so we were especially appreciative of the added power we saw in NV40.
Final WordsSimilar to Anand's 9700 Pro introduction, the GeForce 6800 has set some pretty solid standards. We can now expect:
1) Very high performance in current and future games.Of course, we aren't crowning any kings yet, as ATI will soon be making its mark on this generation of GPUs. We will have to wait to find out what they can bring to the table, but it is definitely turning out to be an exciting battle. Even with the added power requirements, the kinds of performance gains we have seen are pretty substantial, and ATI will have a good fight on their hands.
2) The ability to play at 2048x1536 in just about any game currently available or soon to be made available, and
3) The ability to play virtually any game at 1600x1200 with 4X AA and 16X anisotropic filtering enabled at smooth frame rates.
We were able to achieve very smooth frame rates under Halo at 2048x1536, and 34fps under FarCry at the same resolution. Unfortuantely, the driver is currently not stable enough to do all the testing we wanted at this resolution, so we'll have to hold off on bringing a full set of benchmarks to the table until later.
Even though we have taken a cursory glance at anisotropic filtering and antialiasing, and we didn't notice any glaring problems while testing games, we will need to revisit the issue of image quality. We are planning on bringing out another image quality after ATI releases their card. One thing is for sure, both sides need to make sure they are generating the highest quality images to avoid recurrences of last years many controversies.
We are looking forward to the next month of battle, and we hope you are as excited as we are to see how this plays out.