Wrapping Up the Architecture and Efficiency Discussion

Engineering is all about tradeoffs and balance. The choice to increase capability in one area may decrease capability in another. The addition of a feature may not be worth the cost of including it. In the worst case, as Intel found with NetBurst, an architecture may inherently flawed and a starting over down an entirely different path might be the best solution.

We are at a point where there are quite a number of similarities between NVIDIA and AMD hardware. They both require maintaining a huge number of threads in flight to hide memory and instruction latency. They both manage threads in large blocks of threads that share context. Caching, coalescing memory reads and writes, and handling resource allocation need to be carefully managed in order to keep the execution units fed. Both GT200 and RV770 execute branches via dynamic predication of direction a thread does not branch (meaning if a thread in a warp or wavefront branches differently from others, all threads in that group must execute both code paths). Both share instruction and constant caches across hardware that is SIMD in nature servicing multiple threads in one context in order to effect hardware that fits the SPMD (single program multiple data) programming model.

But the hearts of GT200 and RV770, the SPA (Steaming Processor Array) and the DPP (Data Parallel Processing) Array, respectively, are quite different. The explicitly scalar one operation per thread at a time approach that NVIDIA has taken is quite different from the 5 wide VLIW approach AMD has packed into their architecture. Both of them are SIMD in nature, but NVIDIA is more like S(operation)MD and AMD is S(VLIW)MD.


AMD's RV770, all built up and pretty

Filling the execution units of each to capacity is a challenge but looks to be more consistent on NVIDIA hardware, while in the cases where AMD hardware is used effectively (like Bioshock) we see that RV770 surpasses GTX 280 in not only performance but power efficiency as well. Area efficiency is completely owned by AMD, which means that their cost for performance delivered is lower than NVIDIA's (in terms of manufacturing -- R&D is a whole other story) since smaller ICs mean cheaper to produce parts.


NVIDIA's GT200, in all its daunting glory

While shader/kernel length isn't as important on GT200 (except that the ratio of FP and especially multiply-add operations to other code needs to be high to extract high levels of performance), longer programs are easier for AMD's compiler to extract ILP from. Both RV770 and GT200 must balance thread issue with resource usage, but RV770 can leverage higher performance in situations where ILP can be extracted from shader/kernel code which could also help in situations where the GT200 would not be able to hide latency well.

We believe based on information found on the CUDA forums and from some of our readers that G80's SPs have about a 22 stage pipeline and that GT200 is also likely deeply piped, and while AMD has told us that their pipeline is significantly shorter than this they wouldn't tell us how long it actually is. Regardless, a shorter pipeline and the ability to execute one wavefront over multiple scheduling cycles means massive amounts of TLP isn't needed just to cover instruction latency. Yes massive amounts of TLP are needed to cover memory latency, but shader programs with lots of internal compute can also help to do this on RV770.

All of this adds up to the fact that, despite the advent of DX10 and the fact that both of these architectures are very good at executing large numbers of independent threads very quickly, getting the most out of GT200 and RV770 requires vastly different approaches in some cases. Long shaders can benefit RV770 due to increased ILP that can be extracted, while the increased resource use of long shaders may mean less threads can be issued on GT200 causing lowered performance. Of course going the other direction would have the opposite effect. Caches and resource availability/management are different, meaning that tradeoffs and choices must be made in when and how data is fetched and used. Fixed function resources are different and optimization of the usage of things like texture filters and the impact of the different setup engines can have a large (and differing with architecture) impact on performance.

We still haven't gotten to the point where we can write simple shader code that just does what we want it to do and expect it to perform perfectly everywhere. Right now it seems like typical usage models favor GT200, while relative performance can vary wildly on RV770 depending on how well the code fits the hardware. G80 (and thus NVIDIA's architecture) did have a lead in the industry for months before R600 hit the scene, and it wasn't until RV670 that AMD had a real competitor in the market place. This could be part of the reason we are seeing fewer titles benefiting from the massive amount of compute available on AMD hardware. But with this launch, AMD has solidified their place in the market (as we will see the 4800 series offers a lot of value), and it will be very interesting to see what happens going forward.

AMD's RV770 vs. NVIDIA's GT200: Which one is More Efficient? One, er, Hub to Rule them All?
Comments Locked

215 Comments

View All Comments

  • DerekWilson - Wednesday, June 25, 2008 - link

    it looks like the witcher hits an artificial 72fps barrier ... not sure why as we are running 60hz displays, but that's our best guess. vsync is disabled, so it is likely a software issue.
  • JarredWalton - Wednesday, June 25, 2008 - link

    Again, try faster CPUs to verify whether you are game limited or if there is a different bottleneck. The Witcher has a lot of stuff going on graphically that might limit frame rates to 70-75 FPS without a 4GHz Core 2 Duo/Quad chip.
  • chizow - Wednesday, June 25, 2008 - link

    It looks like there seems to be a lot of this going on in the high-end, with GT200, multi-GPU and even RV770 chips hitting FPS caps. In some titles, are you guys using Vsync? I saw Assassin's Creed was frame capped, is there a way to remove the cap like there is with UE3.0 games? It just seems like a lot of the results are very flat as you move across resolutions, even at higher resolutions like 16x10 and 19x12.

    Another thing I noticed was that multi-GPU seems to avoid some of this frame capping but the single-GPUs all still hit a wall around the same FPS.

    Anyways, 4870 looks to be a great part, wondering if there will be a 1GB variant and if it will have any impact on performance.
  • DerekWilson - Wednesday, June 25, 2008 - link

    the only test i know where the multi-gpu cards get past a frame limit is oblivion.

    we always run with vsync disabled in games.

    we tend not to try forcing it off in the driver as interestingly that decrease performance in situations where it isn't needed.

    we do force off where we can, but assassins creed is limiting the frame rate in absentia of vsync.

    not sure about higher memory variants ... gddr5 is still pretty new, and density might not be high enough to hit that. The 4870 does have 16 memory chips on it for its 256-bit memory bus, so space might be an issue too ...
  • JarredWalton - Wednesday, June 25, 2008 - link

    Um, Derek... http://www.anandtech.com/video/showdoc.aspx?i=3320...">I think you're CPU/platform limited in Assassin's Creed. You'll certainly need something faster than 3.2GHz to get much above 63FPS in my experience. Try overclocking to 4.0GHz and see what happens.
  • weevil - Wednesday, June 25, 2008 - link

    I didnt see the heat or noise benchmarks?
  • gwynethgh - Wednesday, June 25, 2008 - link

    No info from Anandtech on heat or noise. The info on the 4870 is most needed as most reviews indicate the 4850 with the single slot design/cooler runs very hot. Does the two slot design pay off in better cooling, is it quiet?
  • DerekWilson - Wednesday, June 25, 2008 - link

    a quick not really well controlled tests shows the 4850 and 4870 to be on par in terms of heat ... but i can't really go more into it right now.

    the thing is quiet under normal operation but it spins up to a fairly decent level at about 84 degrees. at full speed (which can be heard when the system powers up or under ungodly load and ambient heat conditions) it sounds insanely loud.
  • legoman666 - Wednesday, June 25, 2008 - link

    I don't see the AA comparisons. There is no info on the heat or noise either.
  • DerekWilson - Wednesday, June 25, 2008 - link

    the aa comparison page had a problem with nested quotes in some cases in combination with some google ads on firefox (though it worked in safari ie and opera) ...

    this has been fixed ...

    for heat and noise our commentary is up, but we don't have any quantitative data here ... we just had so much else to pack into the review that we didn't quite get testing done here.

Log in

Don't have an account? Sign up now