Advancing Primitives: Dual Graphics Engines & New ROPs

AMD has clearly taken NVIDIA’s comments on geometry performance to heart. Along with issuing their manifesto with the 6800 series, they’ve also been working on their own improvements for their geometry performance. As a result AMD’s fixed function Graphics Engine block is seeing some major improvements for Cayman.

Prior to Cypress, AMD had 1 graphics engine, which contained 1 each of the fundamental blocks: the rasterizers/hierarchical-Z units, the geometry/vertex assemblers, and the tessellator. With Cypress AMD added a 2nd rasterizer and 2nd hierarchical-Z unit, allowing them to set up 32 pixels per clock as opposed to 16 pixels per clock. However while AMD doubled part of the graphics engine, they did not double the entirety of it, meaning their primitive throughput rate was still 1 primitive/clock, a typical throughput rate even at the time.


Cypress's Graphics Engine

In 2010 with the launch of Fermi, NVIDIA raised the bar on primitive performance, with rasterization moved to NVIDIA’s GPCs, NVIDIA could theoretically push out as many primitives/clock as they had GPCs, in the case of GF100/GF110 pushing this to 4 primitives/clock, a simply massive improvement in geometry performance for a single generation.

With Cayman AMD is catching up with NVIDIA by increasing their own primitive throughput rate, though not by as much as NVIDIA did with Fermi. For Cayman the rest of the graphics engine is being fully duplicated – Cayman will have 2 separate graphics engines, each containing one fundamental block, and each capable of pushing out 1 primitive/clock. Between the two of them AMD’s maximum primitive throughput rate will now be 2 primitives/clock; half as much as NVIDIA but twice that of Cypress.


Cayman's Dual Graphics Engines

As was the case for NVIDIA, splitting up rasterization and tessellation is not a straightforward and easy task. For AMD this meant teaching the graphics engine how to do tile-based load balancing so that the workload being spread among the graphics engines is being kept as balanced as possible. Furthermore AMD believes they have an edge on NVIDIA when it comes to design - AMD can scale the number of eraphics engines at will, whereas NVIDIA has to work within the logical confines of their GPC/SM/SP ratios. This tidbit would seem to be particularly important for future products, when AMD looks to scale beyond 2 graphics engines.

At the end of the day all of this tinking with the graphics engines is necessary in order for AMD to further improve their tessellation performance. AMD’s 7th generation tessellator improved their performance at lower tessellation factors where the tessellator was the bottleneck, but at higher tessellation factors the graphics engine itself is the bottleneck as the graphics engine gets swamped with more incoming primitives than it can set up in a single clock. By having two graphics engines and a 2-primitive/clock rasterization rate, AMD is shifting the burden back away from the graphics engine.

Just having two 7th generation-like tessellators goes a long way towards improving AMD’s tessellation performance. However all of that geometry can still lead to a bottleneck at times, which means it needs to be stored somewhere until it can be processed. As AMD has not changed any cache sizes for Cayman, there’s the same amount of cache for potentially thrice as much geometry, so in order to keep things flowing that geometry has to go somewhere. That somewhere is the GPU’s RAM, or as AMD likes to put it, their “off-chip buffer.” Compared to cache access RAM is slow and hence this isn’t necessarily a desirable action, but it’s much, much better than stalling the pipeline entirely while the rasterizers clear out the backlog.


Red = 6970. Yellow = 5870

Overall, clock for clock tessellation performance is anywhere between 1.5x and 3x that of Cypress. In situations where AMD’s already improved tessellation performance at lower tessellation factors plays a part, AMD approaches 3x performance; while at around a factor of 5 the performance drops to near 1.5x. Elsewhere performance is around 2x that of Cypress, representing the doubling of graphics engines.

Tessellation also plays a factor in AMD’s other major gaming-related improvement: ROP performance. As tessellation produces many mini triangles, these triangles begin to choke the ROPs when performing MSAA. Although tessellation isn’t the only reason, it certainly plays a factor in AMD’s reasoning for improving their ROPs to improve MSAA performance.

The 32 ROPs (the same as Cypress) have been tweaked to speed up processing of certain types of values. In the case of both signed and unsigned normalized INT16s, these operations are now 2x faster. Meanwhile FP32 operations are now 2x to 4x faster depending on the scenario. Finally, similar to shader read ops for compute purposes, ROP write ops for graphics purposes can be coalesced, improving performance by requiring fewer operations.

Cayman: The New Dawn of AMD GPU Computing Redefining TDP With PowerTune
Comments Locked

168 Comments

View All Comments

  • mac2j - Wednesday, December 15, 2010 - link

    Um - if you have the money for a 580 ... pick up another $80-100 and get 2 x 6950 - you'll get nearly the best possible performance on the market at a similar cost.

    Also I agree that Nvidia will push the 580 price down as much as possible... the problem is that if you believe all of the admittedly "unofficial" breakdowns ... it costs Nvidia 1.5-2x as much to make a 580 as it costs AMD to make a 6970.

    So its hard to be sure how far Nvidia can push down the price on the 580 before it ceases to become profitable - my guess is they'll focus on making a 565 type card which has almost 570 performance but for a manufacturing cost closer to what a 460 runs them.
  • fausto412 - Wednesday, December 15, 2010 - link

    yeah. AMD let us down on this here product. We see what gtx580 is and what 6970 is...i would say if you planning to spend 500...the gtx580 is worth it.
  • truepurple - Wednesday, December 15, 2010 - link

    "support for color correction in linear space"

    What does that mean?
  • Ryan Smith - Wednesday, December 15, 2010 - link

    There are two common ways to represent color, linear and gamma.

    Linear: Used for rendering an image. More generally linear has a simple, fixed relationship between X and Y, such that if you drew the relationship it would be a straight line. A linear system is easy to work with because of the simple relationship.

    Gamma: Used for final display purposes. It's a non-linear colorspace that was originally used because CRTs are inherently non-linear devices. If you drew out the relationship, it would be a curved line. The 5000 series is unable to apply color correction in linear space and has to apply it in gamma space, which for the purposes of color correction is not as accurate.
  • IceDread - Wednesday, December 15, 2010 - link

    Yet again we do not get to see hd 5970 in crossfire despite it being a single card! Is this an nvidia site?

    Anyway, for those of you who do want to see those results, here is a link to a professional Swedish site!

    http://www.sweclockers.com/recension/13175-amd-rad...

    Maybe there is some google translation available or so if you want to understand more than the charts shows.
  • medi01 - Wednesday, December 15, 2010 - link

    Wow, 5970 in crossfire consumes less than 580 in SLI.
    http://www.sweclockers.com/recension/13175-amd-rad...
  • ggathagan - Wednesday, December 15, 2010 - link

    Absolutely!!!
    There's no way on God's green earth that Anandtech doesn't currently have a pair of 5970's on hand, so that MUST be the reason.
    I'll go talk to Anand and Ryan right now!!!!
    Oh, wait, they're on a conference call with Huang Jen-Hsun.....

    I'd like to note that I do not believe Anadtech ever did a test of two 5970's, so it's somewhat difficult to supply non-existent into any review.
    Ryan did a single card test in November 2009.That is the only review I've found of any 5970's on the site.
  • vectorm12 - Wednesday, December 15, 2010 - link

    I was not aware of the fact that the 32nm process had been canned completely and was still expecting the 6970 to blow the 580 out of the water.

    Although we can't possibly know and are unlikely to ever find out what cayman at 32nm would have performed like I suspect AMD had to give up a good chunk of performance to fit it on the 389mm^2 40nm die.

    This really makes my choice easy as I'll pickup another cheap 5870 and run my system in CF.
    I think I'll be able to live with the performance until the refreshed cayman/next gen GPUs are ready for prime time.

    Ryan: I'd really like to see what ighashgpu can do with the new 6970 cards though. Although you produce a few GPGPU charts I feel like none of them really represent the real "number-crunching" performance of the 6970/6950.

    Ivan has already posted his analysis in his blog and it seems like the change from LWIV5 to LWIV4 made a negligible impact at the most. However I'd really love to see ighashgpu included in future GPU tests to test new GPUs and architectures.

    Thanks for the site and keep up the work guys!
  • slagar - Wednesday, December 15, 2010 - link

    Gaming seems to be in the process of bursting its own bubble. Graphics of games isn't keeping up with the hardware (unless you cound gaming on 6 monitors) because most developers are still targeting consoles with much older technology.
    Consoles won't upgrade for a few more years, and even then, I'm wondering how far we are from "the final console generation". Visual improvements in graphics are becoming quite incremental, so it's harder to "wow" consumers into buying your product, and the costs for developers is increasing, so it's becoming harder for developers to meet these standards. Tools will always improve and make things easier and more streamlined over time I suppose, but still... it's going to be an interesting decade ahead of us :)
  • darckhart - Wednesday, December 15, 2010 - link

    that's not entirely true. the hardware now allows not only insanely high resolutions, but it also lets those of us with more stringent IQ requirements (large custom texture mods, SSAA modes, etc) to run at acceptable framerates at high res in intense action spots.

Log in

Don't have an account? Sign up now