Advancing Primitives: Dual Graphics Engines & New ROPs

AMD has clearly taken NVIDIA’s comments on geometry performance to heart. Along with issuing their manifesto with the 6800 series, they’ve also been working on their own improvements for their geometry performance. As a result AMD’s fixed function Graphics Engine block is seeing some major improvements for Cayman.

Prior to Cypress, AMD had 1 graphics engine, which contained 1 each of the fundamental blocks: the rasterizers/hierarchical-Z units, the geometry/vertex assemblers, and the tessellator. With Cypress AMD added a 2nd rasterizer and 2nd hierarchical-Z unit, allowing them to set up 32 pixels per clock as opposed to 16 pixels per clock. However while AMD doubled part of the graphics engine, they did not double the entirety of it, meaning their primitive throughput rate was still 1 primitive/clock, a typical throughput rate even at the time.


Cypress's Graphics Engine

In 2010 with the launch of Fermi, NVIDIA raised the bar on primitive performance, with rasterization moved to NVIDIA’s GPCs, NVIDIA could theoretically push out as many primitives/clock as they had GPCs, in the case of GF100/GF110 pushing this to 4 primitives/clock, a simply massive improvement in geometry performance for a single generation.

With Cayman AMD is catching up with NVIDIA by increasing their own primitive throughput rate, though not by as much as NVIDIA did with Fermi. For Cayman the rest of the graphics engine is being fully duplicated – Cayman will have 2 separate graphics engines, each containing one fundamental block, and each capable of pushing out 1 primitive/clock. Between the two of them AMD’s maximum primitive throughput rate will now be 2 primitives/clock; half as much as NVIDIA but twice that of Cypress.


Cayman's Dual Graphics Engines

As was the case for NVIDIA, splitting up rasterization and tessellation is not a straightforward and easy task. For AMD this meant teaching the graphics engine how to do tile-based load balancing so that the workload being spread among the graphics engines is being kept as balanced as possible. Furthermore AMD believes they have an edge on NVIDIA when it comes to design - AMD can scale the number of eraphics engines at will, whereas NVIDIA has to work within the logical confines of their GPC/SM/SP ratios. This tidbit would seem to be particularly important for future products, when AMD looks to scale beyond 2 graphics engines.

At the end of the day all of this tinking with the graphics engines is necessary in order for AMD to further improve their tessellation performance. AMD’s 7th generation tessellator improved their performance at lower tessellation factors where the tessellator was the bottleneck, but at higher tessellation factors the graphics engine itself is the bottleneck as the graphics engine gets swamped with more incoming primitives than it can set up in a single clock. By having two graphics engines and a 2-primitive/clock rasterization rate, AMD is shifting the burden back away from the graphics engine.

Just having two 7th generation-like tessellators goes a long way towards improving AMD’s tessellation performance. However all of that geometry can still lead to a bottleneck at times, which means it needs to be stored somewhere until it can be processed. As AMD has not changed any cache sizes for Cayman, there’s the same amount of cache for potentially thrice as much geometry, so in order to keep things flowing that geometry has to go somewhere. That somewhere is the GPU’s RAM, or as AMD likes to put it, their “off-chip buffer.” Compared to cache access RAM is slow and hence this isn’t necessarily a desirable action, but it’s much, much better than stalling the pipeline entirely while the rasterizers clear out the backlog.


Red = 6970. Yellow = 5870

Overall, clock for clock tessellation performance is anywhere between 1.5x and 3x that of Cypress. In situations where AMD’s already improved tessellation performance at lower tessellation factors plays a part, AMD approaches 3x performance; while at around a factor of 5 the performance drops to near 1.5x. Elsewhere performance is around 2x that of Cypress, representing the doubling of graphics engines.

Tessellation also plays a factor in AMD’s other major gaming-related improvement: ROP performance. As tessellation produces many mini triangles, these triangles begin to choke the ROPs when performing MSAA. Although tessellation isn’t the only reason, it certainly plays a factor in AMD’s reasoning for improving their ROPs to improve MSAA performance.

The 32 ROPs (the same as Cypress) have been tweaked to speed up processing of certain types of values. In the case of both signed and unsigned normalized INT16s, these operations are now 2x faster. Meanwhile FP32 operations are now 2x to 4x faster depending on the scenario. Finally, similar to shader read ops for compute purposes, ROP write ops for graphics purposes can be coalesced, improving performance by requiring fewer operations.

Cayman: The New Dawn of AMD GPU Computing Redefining TDP With PowerTune
Comments Locked

168 Comments

View All Comments

  • AnnihilatorX - Thursday, December 16, 2010 - link

    I disagree with you rarson

    This is what sets Anandtech apart, it has quality over quantity.
    Anandtech is the ONLY review site which offers me comprehensive information on the architecture, with helpful notes on the expected future gaming performance. It mention AMD intended the 69xx to run on 35nm, and made sacrifices. If you go to Guru3D''s review, the editor in the conclusion stated that he doesn't know why the performance lacks the wow factor. Anandtech answered that question with the process node.

    If you want to read reviews only, go onto google and search for 6850 review, or go to DailyTech's daily recent hardware review post, you can find over 15 plain reviews. Even easier, just use the Quick Navigation menu or the Table of Content in the freaking first page of article. This laziness does not entrice sypathy.
  • Quidam67 - Thursday, December 16, 2010 - link

    Rarson's comments may have been a little condescending in their tone, but I think the critism was actually constructive in nature.

    You can argue the toss about whether the architecture should be in a separate article or not, but personally speaking, I actually would prefer it was broken out. I mean, for those who are interested, simply provide a hyper-link, that way everyone gets what they want.

    In my view, a review is a review and an analysis on architecture can compliment that review but should not actually a part of the review itself. A number of other sites follow this formula, and provide both, but don't merge them together as one super-article, and there are other benefits to this if you read on.

    The issue of spelling anf grammer is trivial, but in fact could be symptomatic of a more serious problem, such as the sheer volume of work Ryan has to perform in the time-frame provided, and the level of QA being squeesed in with it. Given the nature of NDA's, perhaps it might take the pressure off if the review did come first, and the architecture second, so the time-pressures weren't quite so restrictive.

    Lastly, employing a professional proof-reader is hardly an insult to the original author. It's no different than being a software engineer (which I am) and being backed up by a team of quality test analysts. It certainly makes you sleep better when stuff goes into production. Why should Ryan shoulder all the responsibility?
  • silverblue - Thursday, December 16, 2010 - link

    I do hope you're joking. :) (can't tell at this early time)
  • Arnulf - Thursday, December 16, 2010 - link

    "... unlike Turbo which is a positive feedback mechanism."

    Turbo is a negative feedback mechanism. If it was a positive feedback mechanism (= a consequence of an action resulting in further action in same direction) the CPU would probably burn up almost instantly after Turbo triggered as its clock would increase indefinitely, ever more following each increase, the higher the temperature, the higher the frequency. This is not how Turbo works.

    Negative feedback mechanism is a result of an action resulting in reaction (= action in the opposite direction). In the case of CPUs and Turbo it's this to temperature reaction that keeps CPU frequency under control. The higher the temperature, the lower the frequency. This is how Turbo and PowerTune work.

    The fact that Turbo starts at lower frequency and ramps it up and that PowerTune starts at higher frequency and brings it down has no bearing on whether the mechanism of control is called "positive" or "negative" feedback.

    Considering your fondness for Wikipedia (as displayed by the reference in the article) you might want to check out these:

    http://en.wikipedia.org/wiki/Negative_feedback
    http://en.wikipedia.org/wiki/Positive_feedback

    and more specifically:

    http://en.wikipedia.org/wiki/Negative_feedback#Con...
  • Ryan Smith - Thursday, December 16, 2010 - link

    Hi Arnulf;

    Fundamentally you're right, so I won't knock you. I guess you could say I'm going for a very loose interpretation there. The point I'm trying to get across is that Turbo provides a performance floor, while PowerTune is a performance ceiling. People like getting extra performance for "free" more than they like "losing" performance. Hence one experience is positive and one is negative.

    I think in retrospect I should have used positive/negative reinforcement instead of feedback.
  • Soda - Thursday, December 16, 2010 - link

    Anyone noticed that the edge missing og the boards 8-pin power connector ?

    Apparently the AMD made a mistake in the reference design of the board and didn't calculating the space needed by the cooler.

    If you look closely on the power connector in http://images.anandtech.com/doci/4061/6970Open.jpg you'll notice the missing edge.

    For a full story on the matter you can go to http://www.hardwareonline.dk/nyheder.aspx?nid=1060...
    For the english speaking people I suggest the googlish version here http://translate.google.com/translate?hl=da&sl...

    There are some pictures to backup the claim the mistake made AMD here.

    Though it haven't been confirmed by AMD if this is only a mistake on the review boards or all cards of the 69xx series.
  • versesuvius - Thursday, December 16, 2010 - link

    I have a 3870, on a 17 inch monitor, and everything is fine as long as games go. The hard disk gets in the way sometimes, but that is just about it. All the games run fine. No problem at all. Oh, there's more: They run better on the lousy XBOX. Why the new GPU then? Giant monitors? Three of them? Six of them? (The most fun I had on Anandtech was looking at pictures of AT people trying to stabilize them on a wall). Oh, the "Compute GPU"? Wouldn't that fit on a small PCI card, and act like the old 486 coprecessor, for those who have some use for it? Or is it just a silly excuse for not doing much at all, or rather not giving much to the customers, and still charge the same? The "High End"! In an ideal world the prices of things go down, and more and more people can afford them. That lovely capitalist idea was turned on its head, sometime in the eighties of the last century, and instead the notion of value was reinvented. You get more value, for the same price. You still have to pay $400 for your graphic card, even though you do not need the "Compute GPU", and you do not need the aliased superduper antialiasing that nobody yet knows how to achieve in software. Can we have a cheap 4870? No that is discontinued. The 58 series? Discontinued. There are hundreds of thousands or to be sure, millions of people who will pay 50 dollars for one. All ATI or Nvidia need to do is to fine tune the drivers and reduce power consumption. Then again, that must be another "High End" story. In fact the only tale that is being told and retold is "High End"s and "Fool"s, (i.e. "We can do whatever we want with the money that you don't have".) Until better, saner times. For now, long live the console. I am going to buy one, instead of this stupid monstrosity and its equally stupid competitive monstrosity. Cheaper, and gets the job done in more than one way.

    End of Rant.
    God Bless.
  • Necc - Thursday, December 16, 2010 - link

    So True.
  • Ananke - Thursday, December 16, 2010 - link

    Agree. I have 5850 and it does work fine, and I got it on day one at huge discount, but still - it is kind of worthless. Our entertainment comes more exclusively from consoles, and I discrete high end card that commands above $100 price tag is worthless. It is nice touch, but I have no application for it in everyday life, and several months later is already outdated or discontinued.

    My guess, integrated in the CPU graphics will take over, and the mass market discrete cards will have the fate of the dinosaurs very soon.
  • Quidam67 - Thursday, December 16, 2010 - link

    Wonderfully subversive commentary. Loved it.

    Still, the thing I like about the High end (I'll never buy it until my Mortgage is done with) is that it filters down to the middle/low end.

    Yes, lots of discontinued product lines but for example, I thought the HD5770 was a fantastic product. Gave ample performance for maintstream gamers in a small form-factor (you can even get it in single slot) with low heat and power requirements meaning it was a true drop-in upgrade to your existing rig, with a practical upgrade path to Crossfire X.

    As for the xbox, that hardware is so outdated now that even the magic of software optimisation (a seemingly lost art in the world of PC's) cannot disguise the fact that new games are not going to look any better, or run any faster, than those that came out at launch. Was watching GT5 in demo the other day and with all the hype about how realistic it looks (and plays) I really couldn't get past the massive amount of Jaggies on screen. Also, very limited damage modelling, and in my view that's a nod towards hardware limitations rather than a game-design consideration.

Log in

Don't have an account? Sign up now