Advancing Primitives: Dual Graphics Engines & New ROPs

AMD has clearly taken NVIDIA’s comments on geometry performance to heart. Along with issuing their manifesto with the 6800 series, they’ve also been working on their own improvements for their geometry performance. As a result AMD’s fixed function Graphics Engine block is seeing some major improvements for Cayman.

Prior to Cypress, AMD had 1 graphics engine, which contained 1 each of the fundamental blocks: the rasterizers/hierarchical-Z units, the geometry/vertex assemblers, and the tessellator. With Cypress AMD added a 2nd rasterizer and 2nd hierarchical-Z unit, allowing them to set up 32 pixels per clock as opposed to 16 pixels per clock. However while AMD doubled part of the graphics engine, they did not double the entirety of it, meaning their primitive throughput rate was still 1 primitive/clock, a typical throughput rate even at the time.


Cypress's Graphics Engine

In 2010 with the launch of Fermi, NVIDIA raised the bar on primitive performance, with rasterization moved to NVIDIA’s GPCs, NVIDIA could theoretically push out as many primitives/clock as they had GPCs, in the case of GF100/GF110 pushing this to 4 primitives/clock, a simply massive improvement in geometry performance for a single generation.

With Cayman AMD is catching up with NVIDIA by increasing their own primitive throughput rate, though not by as much as NVIDIA did with Fermi. For Cayman the rest of the graphics engine is being fully duplicated – Cayman will have 2 separate graphics engines, each containing one fundamental block, and each capable of pushing out 1 primitive/clock. Between the two of them AMD’s maximum primitive throughput rate will now be 2 primitives/clock; half as much as NVIDIA but twice that of Cypress.


Cayman's Dual Graphics Engines

As was the case for NVIDIA, splitting up rasterization and tessellation is not a straightforward and easy task. For AMD this meant teaching the graphics engine how to do tile-based load balancing so that the workload being spread among the graphics engines is being kept as balanced as possible. Furthermore AMD believes they have an edge on NVIDIA when it comes to design - AMD can scale the number of eraphics engines at will, whereas NVIDIA has to work within the logical confines of their GPC/SM/SP ratios. This tidbit would seem to be particularly important for future products, when AMD looks to scale beyond 2 graphics engines.

At the end of the day all of this tinking with the graphics engines is necessary in order for AMD to further improve their tessellation performance. AMD’s 7th generation tessellator improved their performance at lower tessellation factors where the tessellator was the bottleneck, but at higher tessellation factors the graphics engine itself is the bottleneck as the graphics engine gets swamped with more incoming primitives than it can set up in a single clock. By having two graphics engines and a 2-primitive/clock rasterization rate, AMD is shifting the burden back away from the graphics engine.

Just having two 7th generation-like tessellators goes a long way towards improving AMD’s tessellation performance. However all of that geometry can still lead to a bottleneck at times, which means it needs to be stored somewhere until it can be processed. As AMD has not changed any cache sizes for Cayman, there’s the same amount of cache for potentially thrice as much geometry, so in order to keep things flowing that geometry has to go somewhere. That somewhere is the GPU’s RAM, or as AMD likes to put it, their “off-chip buffer.” Compared to cache access RAM is slow and hence this isn’t necessarily a desirable action, but it’s much, much better than stalling the pipeline entirely while the rasterizers clear out the backlog.


Red = 6970. Yellow = 5870

Overall, clock for clock tessellation performance is anywhere between 1.5x and 3x that of Cypress. In situations where AMD’s already improved tessellation performance at lower tessellation factors plays a part, AMD approaches 3x performance; while at around a factor of 5 the performance drops to near 1.5x. Elsewhere performance is around 2x that of Cypress, representing the doubling of graphics engines.

Tessellation also plays a factor in AMD’s other major gaming-related improvement: ROP performance. As tessellation produces many mini triangles, these triangles begin to choke the ROPs when performing MSAA. Although tessellation isn’t the only reason, it certainly plays a factor in AMD’s reasoning for improving their ROPs to improve MSAA performance.

The 32 ROPs (the same as Cypress) have been tweaked to speed up processing of certain types of values. In the case of both signed and unsigned normalized INT16s, these operations are now 2x faster. Meanwhile FP32 operations are now 2x to 4x faster depending on the scenario. Finally, similar to shader read ops for compute purposes, ROP write ops for graphics purposes can be coalesced, improving performance by requiring fewer operations.

Cayman: The New Dawn of AMD GPU Computing Redefining TDP With PowerTune
Comments Locked

168 Comments

View All Comments

  • MeanBruce - Wednesday, December 15, 2010 - link

    TechPowerUp.com shows the 6850 as 95percent or almost double the performance of the 4850 and 100percent more efficient than the 4850@1920x1200. I also am upgrading an old 4850, as far as the 6950 check their charts when they come up later today.
  • mapesdhs - Monday, December 20, 2010 - link


    Today I will have completed by benchmark pages comparing 4890, 8800GT and
    GTX 460 1GB (800 and 850 core speeds), in both single and CF/SLI, for a range
    of tests. You should be able to extrapolate between known 4850/4890 differences,
    the data I've accumulated, and known GTX 460 vs. 68xx/69xx differences (baring
    in mind I'm testing with 460s with much higher core clocks than the 675 reference
    speed used in this article). Email me at mapesdhs@yahoo.com and I'll send you
    the URL once the data is up. I'm testing with 3DMark06, Unigine (Heaven, Tropics
    and Sanctuary), X3TC, Stalker COP, Cinebench, Viewperf and PT Boats. Later
    I'll also test with Vantage, 3DMark11 and AvP.

    Ian.
  • ZoSo - Wednesday, December 15, 2010 - link

    Helluva 'Bang for the Buck' that's for sure! Currently I'm running a 5850, but I have been toying with the idea of SLI or CF. For a $300 difference, CF is the way to go at this point.
    I'm in no rush, I'm going to wait at least a month or two before I pull any triggers ;)
  • RaistlinZ - Wednesday, December 15, 2010 - link

    I'm a bit underwhelmed from a performance standpoint. I see nothing that will make me want to upgrade from my trusty 5870.

    I would like to see a 2x6950 vs 2x570 comparison though.
  • fausto412 - Wednesday, December 15, 2010 - link

    exactly my feelings.

    it's like thinking Miss Universe is about to screw you and then you find out it's her mom....who's probably still hot...but def not miss universe
  • Paladin1211 - Wednesday, December 15, 2010 - link

    CF scaling is truly amazing now, I'm glad that nVidia has something to catch up in terms of driver. Meanwhile, the ATI wrong refresh rate is not fixed, it stucks at 60hz where the monitor can do 75hz. "Refresh force", "refresh lock", "ATI refresh fix", disable /enable EDID, manually set monitor attributes in CCC, EDID hack... nothing works. Even the "HUGE" 10.12 driver can't get my friend's old Samsung SyncMaster 920NW to work at its native 1440x900@75hz, both in XP 32bit and win 7 64bit. My next monitor will be an 120hz for sure, and I don't want to risk and ruin my investment, AMD.
  • mapesdhs - Monday, December 20, 2010 - link


    I'm not sure if this will help fix the refresh issue (I do the following to fix max res
    limits), but try downloading the drivers for the monitor but modify the data file
    before installing them. Check to ensure it has the correct genuine max res and/or
    max refresh.

    I've been using various models of CRT which have the same Sony tube that can
    do 2048 x 1536, but every single vendor that sells models based on this tube has
    drivers that limited the max res to 1800x1440 by default, so I edit the file to enable
    2048 x 1536 and then it works fine, eg. HP P1130.

    Bit daft that drivers for a monitor do not by default allow one to exploit the monitor
    to its maximum potential.

    Anyway, good luck!!

    Ian.
  • techworm - Wednesday, December 15, 2010 - link

    future DX11 games will stress GPU and video RAM incrementally and it is then that 6970 will shine so it's obvious that 6970 is a better and more future proof purchase than GTX570 that will be frame buffer limited in near future games
  • Nickel020 - Wednesday, December 15, 2010 - link

    In the table about whether PowerTune affects an application or not there's a yes for 3DMark, and in the text you mention two applications saw throttling (with 3DMark it would be three). Is this an error?

    Also, you should maybe include that you're measuring the whole system power in the PowerTune tables, it might be confusing for people who don't read your reviews very often to see that the power draw you measured is way higher than the PowerTune level.

    Reading the rest now :)
  • stangflyer - Wednesday, December 15, 2010 - link

    Sold my 5970 waiting for 6990. With my 5970 playing games at 5040x1050 I would always have a 4th extended monitor hooked up to a tritton uve-150 usb to vga adapter. This would let me game while having the fourth monitor display my teamspeak, afterburner, and various other things.
    Question is this!! Can i use the new 6950/6970 and use triple monitor and also use a 4th screen extended at the same time? I have 3 matching dell native display port monitors and a fourth with vga/dvi. Can I use the 2 dp's and the 2 dvi's on the 6970 at the same time? I have been looking for this answer for hours and can't find it! Thanks for the help.

Log in

Don't have an account? Sign up now