Original Link: http://www.anandtech.com/show/1186

Our introduction to NV36 in the form of the GeForce FX 5700 Ultra has really been a different experience than we expected. We thought we would see similar gains on the 5600 that we saw the 5950 make over the 5900. We also didn't expect NVIDIA to drop the veil they've had on the technical aspects of their products.

From the first benchmark we ran, we knew this would turn out to be a very interesting turn of events. In going down to San Francisco for NVIDIA's Editor's Day event, we had planned on inquiring about just how they were able to extract the performance gains we will reveal in our benchmarks. We got more than we had bargained for when we arrived.

For the past few years, graphics companies haven't been very open about how they build their chips. The fast paced six month product cycle and highly competitive atmosphere (while good for consumers) hasn't been very conducive to in depth discussions of highly protected trade secrets. That's why we were very pleasantly surprised when we learned that NVIDIA would be dropping their guard and letting us in on the way NV35 (including NV36 and NV38) actually works. This also gives us insight into the entire NV3x line of GPUs, and, hopefully, gives us a glimpse into the near future of NVIDIA hardware as well.

Aside from divulging a good amount of technical information, NVIDIA had plenty of developers present (a response to ATI’s Shader Day, no doubt). For the purposes of this article, I would like to stick to the architectural aspects of the day rather than analyzing NVIDIA developer relations. It isn't a secret that NVIDIA spends a great deal of time, energy, and money on assisting game developers in achieving their graphical goals. But we believe that "the proof is in the pudding" so to speak. The important thing to us (and we hope to the general public) isn't which developers like and dislike working with an IHV, but the quality of the end product both parties produce. Truth be told, it is the developer's job to create software that works well on all popular platforms, and its the IHV's job to make sure there is sufficient technical support available for developers to get their job done.

We should note that NVIDIA is launching both the NV36 (GeForce FX 5700 Ultra) and the NV38 (GeForce FX 5950 Ultra) today, but since we have already covered the 5950 in our previous roundups we will focus on the 5700 Ultra exclusively today.

First let us look at the card itself.

The GeForce FX 5700 Ultra

As we have mentioned, the GeForce FX 5700 Ultra is based on the NV36 GPU. The core speed of the GPU on the eVGA card we tested was 475MHz. With 128MBs of DDR2 RAM running at 450MHz (900 MHz effective data rate), there is plenty of bandwidth to be had from this solution. As far as cooling goes, we can take a look at a typical 5700 Ultra board layout to see what we can expect:

The heatsink fan combo is fairly low profile, and this card will fit into an AGP slot without disturbing the neighboring PCI slot. Of course, we recommend leaving that slot open anyway, but its nice to have the option to use it if you need it. Though its not visible in this image, there is a heatsink on the back as well.

As far as the GeForce FX 5700 non-ultra version, we expect the clocks to hover somewhere around 425 core, 275 (550 effective) memory. NVIDIA has informed us that they are leaving these timings up to the OEMs, so we may see some variation in the playing field.

For testing our GeForce FX 5700 Ultra, we used the exact same setup as in our previous 9600XT review.

Now on to the architecture…


There was a great deal of talk about why architectural decisions were made, but we will concern ourselves more with what exists rather than why this path was chosen. Every architecture will have its advantages and disadvantages, but understanding what lies beneath is a necessary part of the equation for developers to create efficient code for any architecture.

The first thing of note is NVIDIA's confirmation that 3dcenter.de did a very good job of wading through the patents that cover the NV3x architecture. We will be going into the block diagram of the shader/texture core in this description, but we won't be able to take quite as technical a look at the architecture as 3dcenter. Right now, we are more interested in bringing you the scoop on how the NV36 gets its speed.

For our architecture coverage, we will jump right into the block diagram of the Shader/Texture core on NV35:

As we can see from this diagram, the architecture is very complex. The shader/texture core works by operating on "quads" at a time (in a SIMD manner). These quads enter the pipeline via the gatekeeper which handles managing which ones need to go through the pipe next. This includes quads that have come back for a second pass through the shader.

What happens in the center of this pipeline is dependent upon the shader code running or the texturing operations being done on the current set of quads. There are a certain few restrictions on what can be going on in here that go beyond simply the precision of the data. For instance, NV35 has a max of 32 registers (less if higher precision is used), the core texture unit is able to put (at most) two textures on a quad every clock cycle, the shader and combiners cannot all read the same register at the same time, along with limits on the number of triangles and quads that can be in flight at a time. These things have made it necessary for developers to pay more attention to what they are doing with their code than just writing code that produces the desired mathematic result. Of course, NVIDIA is going to try to make this less of a task through their compiler technology (which we will get to in a second).

Let us examine why the 5700 Ultra is able to pull out the performance increases we will be exploring shortly. Looking in the combiner stage of the block diagram, we can see that we are able to either have two combiners per clock or complete two math operations per clock. This was the same as NV31, with a very important exception: pre-NV35 architectures implement the combiner in fx12 (12 bit integer), NV35, NV36, and NV38 all have combiners that operate in full fp32 precision mode. This allows two more floating point operations to be done per clock cycle and is a very large factor in the increase in performance we have seen when we step up from NV30 to NV35 and from NV31 to NV36. In the end, the 5700 Ultra is a reflection of the performance delta between NV30 and NV38 for the midrange cards.

If you want to take a deeper look at this technology, the previously mentioned 3dcenter article is a good place to start. From here, we will touch on NVIDIA's Unified Compiler technology and explain how NVIDIA plans on making code run as efficiently as possible on their hardware with less hand optimization.

Compilation Integration

In order to maximize performance, the NV3x pipeline needs to be as full as possible all the time. For this to happen, special care needs to be taken in how instructions are issued to the hardware. One aspect of this is that the architecture benefits from interleaved pairs of different types of instructions (for instance: issue two texture instructions, followed by two math instructions, followed by two texture instructions, etc). This is in contrast to ATI's hardware which prefers to see a large block of texture instructions followed by a large block of math instructions for optimal results.

As per NVIDIA's sensitivity to instruction order, we can (most easily) offer the example of calculating a^2 * 2^b:

mul r0,a,a
exp r1,b
mul r0,r0,r1

-takes 2 cycles on NV35

exp r1,b
mul r0,a,a
mul r0,r0,r1

-takes 1 cycle on NV35

This is a trivial example, but it does the job of getting the point across. Obviously, there are real benefits to be had from doing simple standard compiler optimizations which don't effect the output of the code at all. What kind of optimizations are we talking about here? Allow us to elaborate.

Aside from instruction reordering to maximize the parallelism of the hardware, reordering can also help reduce register pressure if we minimize the live ranges of registers within independent data. Consider this:

mul r0,a,a
mul r1,b,b
st r0
st r1

If we reorder the instructions we can use only one register without affecting the outcome of the code:

mul r0,a,a
st r0
mul r0,b,b
st r0

Register allocation is a very hefty part of compiler optimization, but special care needs to be taken to do it correctly and quickly for this application. Commonly, a variety of graph coloring heuristics are available to compiler designers. It seems NVIDIA is using an interference graph style of register allocation, and is allocating registers per component, though we are unclear on what is meant by "component".

Dead code elimination is a very common optimization; essentially, if the developer includes code that can never be executed, we can eliminate this code from the program. Such situations are often revealed when performing multiple optimizations on code, but it’s still a useful feature for the occasional time a developer falls asleep at the screen.

There are a great many other optimizations that can be performed on code which have absolutely no effect on outcome. This is a very important aspect of computing, and only gets more complicated as computer technology gets more powerful. Intel's Itanium processors are prohibitive to hand coding, and no IA64 based processor would run code well unless the compiler that generated the code was able to specifically tailor that code to the parallel nature of the hardware. We are seeing the same type of thing here with NVIDIA's architecture.

Of course, NVIDIA has the added challenge of implementing a real-time compiler much like the java JIT, or Transmeta's code morphing software. As such, there are other very interesting time saving things they need to do with their compiler in order to reduce the impact of trying to adequately approximate the solution to an NP complete problem into am extremely small amount of time.

A shader cache is implemented to store previously compiled shaders; this means that shaders shouldn't have to be compiled more than once. Directed Acyclic Graphs (DAGs) of the code are used to fingerprint compiled shaders. There is also a stock set of common, precompiled, shaders that can get dropped in when NVIDIA detects what a developer is trying to accomplish. NVIDIA will need to take special care to make sure that this feature remains a feature and doesn't break anything, but we see this as a good thing as long no one feels the power of the dark side.

Also, until the most recent couple driver releases from NVIDIA, the real-time compiler didn't implement all of these important optimizations on shader code sent to the card by a game. The frame rate increases of beyond 50% with no image quality loss can be attributed to the enhancements of the real-time compiler NVIDIA has implemented. All of the performance we've previously seen has rested on how well NVIDIA and developers were able to hand code shaders and graphics subroutines.

Of course, writing "good code" (code that suits the hardware it’s written for) will help the compiler be more efficient as well. We certainly won't be seeing the end of NVIDIA sitting down at the table with developers to help them acclimate their code to NV3x hardware, but this Unified Compiler technology will definitely help us see better results from everyone's efforts.

Image Quality

We are currently working on an entire article devoted to the image quality of current generation graphics cards. Where there are important or overt visual anomalies, we will note them here. Other than that, our IQ judgments will be compiled into our coming article. We attempt to be as thorough as possible and delve into as many aspects of image quality as we can. Stay tuned, as its stacking up to be very interesting.

Aquamark3 Performance

Right out of the gate, the 5700 Ultra shows a solid performance increase over the 5600 Ultra. In fact, the NV36 based card surpasses the 9600 Pro in performance and comes very close to the 9600 XT in frame rate.

C&C Generals: Zero Hour Performance no AA/AF

The 5700 Ultra shows a bit of a performance edge over the other two NVIDIA cards we tested here, but still falls short of anything ATI. Oddly though, it looks to me like there is an issue with the different ways these cards are handling timing the frames. The ATI cards all have instantaneous maximum frame rates into the hundreds, while the 5700 Ultra only reaches 76. The 5600 and 4200 don't even make it over 60.

All of the cards have the same minimum frame rate at 15 frames per second.

C&C Generals: Zero Hour Performance 4xAA/8xAF

We see a similar trend with ATI cards coming out ahead of the NVIDIA solutions, but the 5700 Ultra does a good job of approaching the 9600 Pro in this benchmark. With the exception of the 4200 card, the min frame rates were again at 15.

EVE: The Second Genesis Performance no AA/AF

We have our first real lead in performance by the 5700 Ultra over everything else in its segment. We had higher instantaneous lows and highs with the 5700 Ultra here.

EVE: The Second Genesis Performance 4xAA/8xAF

The midrange ATI cards take back the performance lead from the 5700 Ultra here. For some reason, ATI has a much smaller performance hit for enabling 4xAA/8xAF in EVE.

F1 Challenge '99-'02 Performance no AA/AF

Even though the NVIDIA card overtakes the ATI card, we are still having the same visual quality issues we've been complaining about in past articles.

F1 Challenge '99-'02 Performance 4xAA/8xAF

At this point, 4XAA/8xAF cause the 5700 Ultras performance to drop below that of ATI. The visual shakiness issue also magnifies itself when 4xAA/8xAF are enabled in this game. Unfortunately, I haven't seen this game being mentioned in the release notes.

Final Fantasy XI Performance

The 5700 sees some marginal improvement over the 9600XT here. Unfortunately, neither solution is anywhere near the 9700 Pro's level of performance.

GunMetal Performance

The clear leader of the midrange in this benchmark is the 5700 Ultra. The average, low and high fps are ahead of the other midrange cards, and the 5700 Ultra even tries to snuggle up to the 9700 Pro here.

Halo Performance

The 5700 Ultra barely squeaks ahead of the 9600 XT here, showing a very impressive gain over its younger brother the 5600 Ultra.

Homeworld 2 Performance no AA/AF


Unfortunately, we are still unable to get an RV3xx based card to run Homeworld 2 without crashing. We've tried quite a few workarounds, but so far we've come up empty. This is a known issue, and it is being looked into. The issue that remains on NVIDIA hardware has to do with the software telling the hardware not to do AA even if its turned on in the driver. The developer and NVIDIA are working on this issue as well.

Jedi Knight: Jedi Academy Performance no AA/AF

NVIDIA comes out swinging in this benchmark. The NV36 seems to really like running Jedi Academy, as the card almost surpasses ATI's $500 card when AA and AF are left off.

Jedi Knight: Jedi Academy Performance 4xAA/8xAF

Turning on the filtering features helps the 9800 XT to put a little more distance between itself and the 5700, but the 9700 Pro only matches speed with the NV36 card.

Neverwinter Nights: Shadow of Undrendtide Performance no AA/AF

Everytime I look at this chart, I almost expect it to say lower is better somewhere with the Ti 4200 scoring so well. Of course, there's on contest in this game. As with Jedi Academy, Neverwinter is an OpenGL based game, and those tend to score well on NVIDIA hardware in general.

Neverwinter Nights: Shadow of Undrendtide Performance 4xAA/8xAF

Again, when more powerful filtering is enabled, we see the high end cards come back up to the top of the pack, and the Ti 4200 falls back to its rightful place. It is still good to see that the 5700 Ultra can keep up with the 9700 Pro even with AA/AF enabled.

SimCity 4 Performance no AA/AF

We are hoping that we will find a better way to benchmark SimCity 4, as the way we are doing so at present has some issues. When we scroll to the edge of the world, ATIs cards shoot up to over 250fps, while NVIDIAs cards hit a more moderate maximum in the mid 70s. In looking at our data right now, we can see that the instantaneous low framerates for all the NVIDIA cards hovered around 64 fps, while ATIs cards dropped to the mid 50's. Due to this fact, we are counting this benchmark a draw.

SimCity 4 Performance 4xAA/8xAF

When we turn on AA, the 9800 XT and the 9600 XT are able to maintain low framerates of about 58, while the NVIDIA cards only hit 38. We feel the high frame rates are skewing the data more than they should, but when AA/AF is turned on, ATIs cards definitly pull out in front.

Splinter Cell Performance

ATI's high end cards take the cake in this bench, while the 9600 XT is the performance leader for the midrange. The 5700 Ultra does a good job of keeping up and definitely takes this blow on the chin well.

Tomb Raider: Angel of Darkness

Unfortunately, none of the 128MB NVIDIA solutions we have tried can run TRAOD at 1024x768 under patch v49. Of course, we don't know whether this is a software issue, a driver issue, or a combination of the two. This obviously isn't something we expect from a "The way its meant to be played" game, but we hope NVIDIA and EIDOS will be able to work together to solve this problem, even if we are unable to benchmark with the game as a result.

Tron 2.0 Performance no AA/AF

The 9600 cards manage to land a good blow on the 5700 Ultra in this benchmark. Without filtering on, NVIDIA's card is able to keep its head above water.

Tron 2.0 Performance 4xAA/8xAF

Tron is the first game on the list where NVIDIA clearly doesn't perform nearly as well as ATI.

Unreal Tournament 2003 Performance no AA/AF

Unreal Tournament 2003 Performance 4xAA/8xAF

The midrange fight for unreal is lead by the GeForce FX 5700 Ultra both with and without AA/AF.

Warcraft III: Frozen Throne Performance no AA/AF

Warcraft III: Frozen Throne Performance 4xAA/8xAF

ATI clearly leads this benchmark, both with and without filtering.

Wolfenstein: Enemy Territory Performance no AA/AF

Wolfenstein: Enemy Territory Performance 4xAA/AF

Again, aside from the two highest end cards, the 5700 Ultra takes this benchmark. This is also another OpenGL based game.

X2: The Threat Performance no AA/AF

X2: The Threat Performance 4xAA/8xAF

Under X2: The Threat, the 5700 again leads the pack of midrange cards, but this time in a DX9 game. Of course, we are still seeing the jerkiness issues we've mentioned before, but the problem appears in the release notes for 52.16, and NVIDIA sites the next driver release as having a fix in place. Of course, we will have to judge that for ourselves when the time comes.

Final Words

After testing the GeForce FX 5700 Ultra, we have been very pleasantly surprised by NVIDIA. We mentioned last week that the 5700's new architecture might help to close the gap. In fact, NVIDIA has flipped the tables on ATI in the midrange segment and takes the performance crown with a late round TKO. It was a hard fought battle with many ties, but in the games where the NV36 based card took the performance lead, it lead with the style of a higher end card.

We are still recommending that people stay away from upgrading to a high end card until the game they are upgrading for is available. By that time, either new cards will have trickled out, or the prices will have fallen. We still don't have a way to predict what card will be best for you in the future. If you are dead set on getting a DX9 card, we recommend you look to the midrange cards.

Neither card can touch the 9700 Pro for price/performance right now. If the 9700 Pro is in your price range and you're looking for a better than midrange performer for a near midrange price, go ahead and pick one up.

The GeForce FX 5700 Ultra will be debuting at $199 after a mail in rebate. If $200 is your hard limit, and you need a midrange card right now, the 5700 Ultra is the way to go if you want solid frame rates.

If $200 is still a bit much, the Radeon 9600 Pro is a very healthy option; we have yet to see how the non-Ultra 5700 performs as it may also deserve some attention once it hits the streets.

What will also determine our recommendations in this segment is what clock speeds add-in card vendors actually ship the products at. We’ll be keeping an eye on that and update our recommendations accordingly.

Of course, we still have more to come in the form of image quality analysis. Our findings in that arena will affect what we recommend just as much as pure speed. Stay tuned for more.

Log in

Don't have an account? Sign up now