Advancing Primitives: Dual Graphics Engines & New ROPs

AMD has clearly taken NVIDIA’s comments on geometry performance to heart. Along with issuing their manifesto with the 6800 series, they’ve also been working on their own improvements for their geometry performance. As a result AMD’s fixed function Graphics Engine block is seeing some major improvements for Cayman.

Prior to Cypress, AMD had 1 graphics engine, which contained 1 each of the fundamental blocks: the rasterizers/hierarchical-Z units, the geometry/vertex assemblers, and the tessellator. With Cypress AMD added a 2nd rasterizer and 2nd hierarchical-Z unit, allowing them to set up 32 pixels per clock as opposed to 16 pixels per clock. However while AMD doubled part of the graphics engine, they did not double the entirety of it, meaning their primitive throughput rate was still 1 primitive/clock, a typical throughput rate even at the time.


Cypress's Graphics Engine

In 2010 with the launch of Fermi, NVIDIA raised the bar on primitive performance, with rasterization moved to NVIDIA’s GPCs, NVIDIA could theoretically push out as many primitives/clock as they had GPCs, in the case of GF100/GF110 pushing this to 4 primitives/clock, a simply massive improvement in geometry performance for a single generation.

With Cayman AMD is catching up with NVIDIA by increasing their own primitive throughput rate, though not by as much as NVIDIA did with Fermi. For Cayman the rest of the graphics engine is being fully duplicated – Cayman will have 2 separate graphics engines, each containing one fundamental block, and each capable of pushing out 1 primitive/clock. Between the two of them AMD’s maximum primitive throughput rate will now be 2 primitives/clock; half as much as NVIDIA but twice that of Cypress.


Cayman's Dual Graphics Engines

As was the case for NVIDIA, splitting up rasterization and tessellation is not a straightforward and easy task. For AMD this meant teaching the graphics engine how to do tile-based load balancing so that the workload being spread among the graphics engines is being kept as balanced as possible. Furthermore AMD believes they have an edge on NVIDIA when it comes to design - AMD can scale the number of eraphics engines at will, whereas NVIDIA has to work within the logical confines of their GPC/SM/SP ratios. This tidbit would seem to be particularly important for future products, when AMD looks to scale beyond 2 graphics engines.

At the end of the day all of this tinking with the graphics engines is necessary in order for AMD to further improve their tessellation performance. AMD’s 7th generation tessellator improved their performance at lower tessellation factors where the tessellator was the bottleneck, but at higher tessellation factors the graphics engine itself is the bottleneck as the graphics engine gets swamped with more incoming primitives than it can set up in a single clock. By having two graphics engines and a 2-primitive/clock rasterization rate, AMD is shifting the burden back away from the graphics engine.

Just having two 7th generation-like tessellators goes a long way towards improving AMD’s tessellation performance. However all of that geometry can still lead to a bottleneck at times, which means it needs to be stored somewhere until it can be processed. As AMD has not changed any cache sizes for Cayman, there’s the same amount of cache for potentially thrice as much geometry, so in order to keep things flowing that geometry has to go somewhere. That somewhere is the GPU’s RAM, or as AMD likes to put it, their “off-chip buffer.” Compared to cache access RAM is slow and hence this isn’t necessarily a desirable action, but it’s much, much better than stalling the pipeline entirely while the rasterizers clear out the backlog.


Red = 6970. Yellow = 5870

Overall, clock for clock tessellation performance is anywhere between 1.5x and 3x that of Cypress. In situations where AMD’s already improved tessellation performance at lower tessellation factors plays a part, AMD approaches 3x performance; while at around a factor of 5 the performance drops to near 1.5x. Elsewhere performance is around 2x that of Cypress, representing the doubling of graphics engines.

Tessellation also plays a factor in AMD’s other major gaming-related improvement: ROP performance. As tessellation produces many mini triangles, these triangles begin to choke the ROPs when performing MSAA. Although tessellation isn’t the only reason, it certainly plays a factor in AMD’s reasoning for improving their ROPs to improve MSAA performance.

The 32 ROPs (the same as Cypress) have been tweaked to speed up processing of certain types of values. In the case of both signed and unsigned normalized INT16s, these operations are now 2x faster. Meanwhile FP32 operations are now 2x to 4x faster depending on the scenario. Finally, similar to shader read ops for compute purposes, ROP write ops for graphics purposes can be coalesced, improving performance by requiring fewer operations.

Cayman: The New Dawn of AMD GPU Computing Redefining TDP With PowerTune
Comments Locked

168 Comments

View All Comments

  • cyrusfox - Wednesday, December 15, 2010 - link

    You should totally be able to do a 4X1 display, 2 DP and 2 DVI, as long as one of those DP dells also has a DVI input. That would get rid of the need for your usb-vga adapter.
  • gimmeagdlaugh - Wednesday, December 15, 2010 - link

    Not sure why AMD 6970 has green bar,
    while NV 580 has red bar...?
  • medi01 - Wednesday, December 15, 2010 - link

    Also wondering. Did nVidia marketing guys called again?
  • Ryan Smith - Wednesday, December 15, 2010 - link

    I normally use green for new products. That's all there is to it.
  • JimmiG - Wednesday, December 15, 2010 - link

    Still don't like the idea of Powertune. Games with a high power load are the ones that fully utilize many parts of the GPU at the same time, while less power hungry games only utilize parts of it. So technically, the specifications are *wrong* as printed in the table on page one.

    The 6970 does *not* have 1536 stream processors at 880 MHz. Sure, it may have 1536 stream processors, and it may run at up to 880 MHz.. But not at the same time!

    So if you fully utilize all 1536 processors, maybe it's a 700 MHz GPU.. or to put it another way, if you want the GPU to run at 880 MHz, you may only utilize, say 1200 stream processors.
  • cyrusfox - Wednesday, December 15, 2010 - link

    I think Anand did a pretty good job of explaining at how it reasonably power throttles the card. Also as 3rd party board vendors will probably make work-arounds for people who abhor getting anything but the best performance(even at the cost of efficiency). I really don't think this is much of an issue, but a good development that is probably being driven by Fusion for Ontario, Zacate, and llano. Also only Metro 2033 triggered any reduction(850Mhz from 880Mhz). So your statement of a crippled GPU only holds for Furmark, nothing got handicapped to 700Mhz. Games are trying to efficiently use all the GPU has to offer, so I don't believe we will see many games at all trigger the use of powertune throttling.
  • JimmiG - Wednesday, December 15, 2010 - link

    Perhaps, but there's no telling what kind of load future DX11 games, combined with faster CPUs will put on the GPU. Programs like Furmark don't do anything unusual, they don't increase GPU clocks or voltages or anything like that - they just tell the GPU - "Draw this on the screen as fast as you can".

    It's the same dilemma overclockers face - Do I keep this higher overclock that causes the system to crash with stress tests but works fine with games and benchmarks? Or do I back down a few steps to guarantee 100% stability. IMO, no overclock is valid unless the system can last through the most rigorous stress tests without crashes, errors or thermal protection kicking in.

    Also, having a card that throttles with games available today tells me that it's running way to close to the thermal limit. Overclocking in this case would have to be defined as simply disabling the protection to make the GPU always work at the advertised speed.
    It's a lazy solution, what they should have done is go back to the drawing board until the GPU hits the desired performance target while staying within the thermal envelope. Prescott showed that you can't just keep adding stuff without any considerations for thermals or power usage.
  • AnnihilatorX - Wednesday, December 15, 2010 - link

    Didn't you see you can increase the throttle threshold by 20% in Catalyst Control Centre. This means 300W until it throttles, which in a sense disables the PowerTune.
  • Mr Perfect - Thursday, December 16, 2010 - link

    On page eight Ryan mentions that Metro 2033 DID get throttled to 700MHz. The 850MHz number was reached by averaging the amount of time Metro was at 880MHz with the time it ran at 700MHz.

    Which is a prime example of why I hate averages in reviews. If you have a significantly better "best case", you can get away with a particularly bad "worst case" and end up smelling like roses.
  • fausto412 - Wednesday, December 15, 2010 - link

    CPU's have been doing this for a while...and you are allowed to turn the feature off. AMD is giving you a range to go over.

    It will cut down on RMA's, Extend Reliability.

Log in

Don't have an account? Sign up now