Compilation Integration

In order to maximize performance, the NV3x pipeline needs to be as full as possible all the time. For this to happen, special care needs to be taken in how instructions are issued to the hardware. One aspect of this is that the architecture benefits from interleaved pairs of different types of instructions (for instance: issue two texture instructions, followed by two math instructions, followed by two texture instructions, etc). This is in contrast to ATI's hardware which prefers to see a large block of texture instructions followed by a large block of math instructions for optimal results.

As per NVIDIA's sensitivity to instruction order, we can (most easily) offer the example of calculating a^2 * 2^b:

mul r0,a,a
exp r1,b
mul r0,r0,r1

-takes 2 cycles on NV35

exp r1,b
mul r0,a,a
mul r0,r0,r1

-takes 1 cycle on NV35

This is a trivial example, but it does the job of getting the point across. Obviously, there are real benefits to be had from doing simple standard compiler optimizations which don't effect the output of the code at all. What kind of optimizations are we talking about here? Allow us to elaborate.

Aside from instruction reordering to maximize the parallelism of the hardware, reordering can also help reduce register pressure if we minimize the live ranges of registers within independent data. Consider this:

mul r0,a,a
mul r1,b,b
st r0
st r1

If we reorder the instructions we can use only one register without affecting the outcome of the code:

mul r0,a,a
st r0
mul r0,b,b
st r0

Register allocation is a very hefty part of compiler optimization, but special care needs to be taken to do it correctly and quickly for this application. Commonly, a variety of graph coloring heuristics are available to compiler designers. It seems NVIDIA is using an interference graph style of register allocation, and is allocating registers per component, though we are unclear on what is meant by "component".

Dead code elimination is a very common optimization; essentially, if the developer includes code that can never be executed, we can eliminate this code from the program. Such situations are often revealed when performing multiple optimizations on code, but it’s still a useful feature for the occasional time a developer falls asleep at the screen.

There are a great many other optimizations that can be performed on code which have absolutely no effect on outcome. This is a very important aspect of computing, and only gets more complicated as computer technology gets more powerful. Intel's Itanium processors are prohibitive to hand coding, and no IA64 based processor would run code well unless the compiler that generated the code was able to specifically tailor that code to the parallel nature of the hardware. We are seeing the same type of thing here with NVIDIA's architecture.

Of course, NVIDIA has the added challenge of implementing a real-time compiler much like the java JIT, or Transmeta's code morphing software. As such, there are other very interesting time saving things they need to do with their compiler in order to reduce the impact of trying to adequately approximate the solution to an NP complete problem into am extremely small amount of time.

A shader cache is implemented to store previously compiled shaders; this means that shaders shouldn't have to be compiled more than once. Directed Acyclic Graphs (DAGs) of the code are used to fingerprint compiled shaders. There is also a stock set of common, precompiled, shaders that can get dropped in when NVIDIA detects what a developer is trying to accomplish. NVIDIA will need to take special care to make sure that this feature remains a feature and doesn't break anything, but we see this as a good thing as long no one feels the power of the dark side.

Also, until the most recent couple driver releases from NVIDIA, the real-time compiler didn't implement all of these important optimizations on shader code sent to the card by a game. The frame rate increases of beyond 50% with no image quality loss can be attributed to the enhancements of the real-time compiler NVIDIA has implemented. All of the performance we've previously seen has rested on how well NVIDIA and developers were able to hand code shaders and graphics subroutines.

Of course, writing "good code" (code that suits the hardware it’s written for) will help the compiler be more efficient as well. We certainly won't be seeing the end of NVIDIA sitting down at the table with developers to help them acclimate their code to NV3x hardware, but this Unified Compiler technology will definitely help us see better results from everyone's efforts.

Architecture Image Quality
Comments Locked

114 Comments

View All Comments

  • Anonymous User - Friday, October 24, 2003 - link

    these anonymous forusm are always a hoot.
  • Anonymous User - Friday, October 24, 2003 - link

    Derek takes it in the pooper
  • Anonymous User - Friday, October 24, 2003 - link

    #62 making 60k a year is still below the threshhold of being able to spend money on whatever you want and not giving a f&5k....if you made 1mil a year I highly doubt you wouldn't drop the $500 on the best card without thinking twice. So don't call other's dumb for buying video cards...maybe that's how they want to spend their money....If you saved some trips to the "Blue Oyster" I'm sure you'd have a $500 card as well.
  • Anonymous User - Friday, October 24, 2003 - link

    The message is damn clear, nvidia is using DDR2 memory to fill in the performance gaps.. Nvidia shuckhs!
  • Anonymous User - Friday, October 24, 2003 - link

    doesnt anon mean something in french?

  • Live - Friday, October 24, 2003 - link

    Anon postings should be disabled. If people dont have the energy to register the energy awarded to there post is likely to be the same minimal amount.
  • Anonymous User - Friday, October 24, 2003 - link

    #64, that makes perfect sense, just don't visit AnandTech. After all, it's not like you've just given them a page impression. lol

    Seriously, AnandTech will never lose readers or respect as long as they keep doing what they're doing. The critics here that break down every minute detail about what this review did "wrong" aren't gamers. If they were, they would realize that the IQ "differences" are so minuscule it's like trying to argue that nForce2 is incredibly faster than KT600, when the reality is that nForce2's attractiveness comes from its superior sound (APU), overclockability, and stability, most certainly not its “earth shattering” performance. nForce2’s better performance is simply a bonus to any half-intelligent hardware enthusiast, not its main selling point.
  • Anonymous User - Friday, October 24, 2003 - link

    watchu' talkin'bout willis?!
  • Anonymous User - Friday, October 24, 2003 - link

    Look, some of us see that these reviews seem to no longer reflect reality. What to do? Quit visiting the site, quit giving AT page impressions. Find reviews elsewhere; god knows there are enough other hardware sites to choose from.
  • Anonymous User - Friday, October 24, 2003 - link

    stop crying about the IQ. as #62 said "ESPECIALLY fps games where constant movement makes it almost impossible to notice the IQ differences". i would add - the difference between fx5950u and radeon 9800XT.

    i spent about 1/3 of the last 10 years playing games. i can call myself a GAMER. i want to play my games at at least 55-60 FPS and nothing else matters. i got radeon 9600pro. that's what i can affort. if fx5600u was faster i would've got it instead. brand doesn't matter if i got 60FPS at 1024x768.

Log in

Don't have an account? Sign up now