Cypress: What’s New

With our refresher out of the way, let’s discuss what’s new in Cypress.

Starting at the SPU level, AMD has added a number of new hardware instructions to the SPUs and sped up the execution of other instruction, both in order to improve performance and to meet the requirements of various APIs. Among these changes are that some dot products have been reduced to single-cycle computation when they were previously multi-cycle affairs. DirectX 11 required operations such as bit count, insert, and extract have also been added. Furthermore denormal numbers have received some much-needed attention, and can now be handled at full speed.


Perhaps the most interesting instruction added however is an instruction for Sum of Absolute Differences (SAD). SAD is an instruction of great importance in video encoding and computer vision due to its use in motion estimation, and on the RV770 the lack of a native instruction requires emulating it in no less than 12 instructions. By adding a native SAD instruction, the time to compute a SAD has been reduced to a single clock cycle, and AMD believes that it will result in a significant (>2x) speedup in video encoding.

The clincher however is that SAD not an instruction that’s part of either DirectX 11 or OpenCL, meaning DirectX programs can’t call for it, and from the perspective of OpenCL it’s an extension. However these APIs leave the hardware open to do what it wants to, so AMD’s compiler can still use the instruction, it just has to know where to use it. By identifying the aforementioned long version of a SAD in code it’s fed, the compiler can replace that code with the native SAD, offering the native SAD speedup to any program in spite of the fact that it can’t directly call the SAD. Cool, isn’t it?

Last, here is a breakdown of what a single Cypress SP can do in a single clock cycle:

  • 4 32-bit FP MAD per clock
  • 2 64-bit FP MUL or ADD per clock
  • 1 64-bit FP MAD per clock
  • 4 24-bit Int MUL or ADD per clock
  • SFU : 1 32-bit FP MAD per clock

Moving up the hierarchy, the next thing we have is the SIMD. Beyond the improvements in the SPs, the L1 texture cache located here has seen an improvement in speed. It’s now capable of fetching texture data at a blistering 1TB/sec. The actual size of the L1 texture cache has stayed at 16KB. Meanwhile a separate L1 cache has been added to the SIMDs for computational work, this one measuring 8KB. Also improving the computational performance of the SIMDs is the doubling of the local data share attached to each SIMD, which is now 32KB.


At a high level, the RV770 and Cypress SIMDs look very similar

The texture units located here have also been reworked. The first of these changes are that they can now read compressed AA color buffers, to better make use of the bandwidth they have. The second change to the texture units is to improve their interpolation speed by not doing interpolation. Interpolation has been moved to the SPs (this is part of DX11’s new Pull Model) which is much faster than having the texture unit do the job. The result is that a texture unit Cypress has a greater effective fillrate than one under RV770, and this will show up under synthetic tests in particular where the load-it and forget-it nature of the tests left RV770 interpolation bound. AMD’s specifications call for 68 billion bilinear filtered texels per second, a product of the improved texture units and the improved bandwidth to them.

Finally, if we move up another level, here is where we see the cause of the majority of Cypress’s performance advantage over RV770. AMD has doubled the number of SIMDs, moving from 10 to 20. This means twice the number of SPs and twice the number of texture units; in fact just about every statistic that has doubled between RV770 and Cypress is a result of doubling the SIMDs. It’s simple in concept, but as the SIMDs contain the most important units, it’s quite effective in boosting performance.


However with twice as many SIMDs, there comes a need to feed these additional SIMDs, and to do something with their products. To achieve this, the 4 L2 caches have been doubled from 64KB to 128KB. These large L2 caches can now feed data to L1 caches at 435GB/sec, up from 384GB/sec in RV770. Along with this the global data share has been quadrupled to 64KB.


RV770 vs...


Cypress

Next up, the ROPs have been doubled in order to meet the needs of processing data from all of those SIMDs. This brings Cypress to 32 ROPs. The ROPs themselves have also been slightly enhanced to improve their performance; they can now perform fast color clears, as it turns out some games were doing this hundreds of times between frames. They are also responsible for handling some aspects of AMD’s re-introduced Supersampling Anti-Aliasing mode, which we will get to later.

 

Last, but certainly not least, we have the changes to what AMD calls the “graphics engine”, primarily to bring it into compliance with DX11. RV770’s greatly underutilized tessellator has been upgraded to full DX11 compliance, giving it Hull Shader and Domain Shader capabilities, along with using a newer algorithm to reduce tessellation artifacts. A second rasterizer has also been added, ostensibly to feed the beast that is the 20 SIMDs.

A Quick Refresher on the RV770 DirectX11 Redux
Comments Locked

327 Comments

View All Comments

  • SiliconDoc - Monday, September 28, 2009 - link

    When the GTX295 still beats the latest ati card, your wish probably won't come true. Not only that, ati's own 4870x2 just recently here promoted as the best value, is a slap in it's face.
    It's rather difficult to believe all those crossfire promoting red ravers suddenly getting a different religion...
    Then we have the no DX11 released yet, and the big, big problem...
    NO 5870'S IN THE CHANNELS, reports are it's runs hot and the drivers are beta problematic.
    ---
    So, celebrating a red revolution of market share - is only your smart aleck fantasy for now.
    LOL - Awwww...

  • silverblue - Monday, September 28, 2009 - link

    It's nearly as fast as a dual GPU solution. I'd say that was impressive.

    DirectX 11 comes out in less than a month... hardly a wait. It's not as if the card won't do DX9/10.

    Hot card? Designed to be that way. If it was a real issue they'd have made the exhaust larger.

    Beta problematic drivers? Most ATI launches seem to go that way. They'll be fixed soon enough.
  • SiliconDoc - Monday, September 28, 2009 - link

    Gee, I thought the red rooster said nvidia sales will be low for a while, and I pointed out why they won't be, and you, well you just couldn'r handle that.
    I'd say a 60.96% increase in a nex gen gpu is "impressive", and that's what Nvidia did just this last time with GT200.
    http://www.anandtech.com/video/showdoc.aspx?i=3334...">http://www.anandtech.com/video/showdoc.aspx?i=3334...
    --
    BTW - the 4870 to 4890 move had an additional 3M core transistors, and we were told by you and yours that was not a "rebrand".

    BUT - the G80 move to G92 added 73M core transistors, and you couldn't stop shrieking "rebrand".
    ---
    nearly as fast= second best
    DX11 in a month = not now and too early
    hot card -= IT'S OK JUST CLAIM ATI PLANNED ON IT BEING HOT !ROFL, IT'S OK TO LIE ABOUT IT IN REVIEWS, TOO ! COCKA DOODLE DOOO!
    beta drivers = ALL ATI LAUNCHES GO THAT WAY, NOT "MOST"
    ----
    Now, you can tell smart aleck this is a paper launch like the 4870, the 4770, and now this 5870 and no 5850, becuase....
    "YOU'LL PUT YOUR HEAD IN THE SAND AND SCREAM IN CAPS BECAUSE THAT'S HOW YOU ROLL IN RED ROOSERVILLE ! "
    (thanks for the composition Jared, it looks just as good here as when you add it to my posts, for "convenience" of course)
  • ClownPuncher - Monday, September 28, 2009 - link

    It would be awesome if you were to stop posting altogether.
  • SiliconDoc - Monday, September 28, 2009 - link

    It would be awesome if this 5870 was 60.96% better than the last ati card, but it isn't.
  • JarredWalton - Monday, September 28, 2009 - link

    But the 5870 *is* up to 65% faster than the 4890 in the tested games. If you were to compare the GTX 280 to the 9800 GX2, it also wasn't 60% faster. In fact, 9800 GX2 beat the GTX 280 in four out of seven tested games, tied it in one, and only trailed in two games: Enemy Territory (by 13%) and Oblivion (by 3%), making ETQW the only substantial win for the GT200.

    So we're biased while you're the beacon of impartiality, I suppose, since you didn't intentionally make a comparison similar to comparing apples with cantaloupes. Comparing ATI's new card to their last dual-GPU solution is the way to go, but NVIDIA gets special treatment and we only compare it with their single GPU solution.

    If you want the full numbers:

    1) On average, the 5870 is 30% faster than the 4890 at 1680x1050, 35% faster at 1920x1200, and 45% faster at 2560x1600.

    2) Note that the margin goes *up* as resolution increases, indicating the major bottleneck is not memory bandwidth at anything but 2560x1600 on the 5870.

    3) Based on the old article you linked, GTX 280 was on average 5% slower than 9800X2 and 59% faster than the 9800 GTX - the 9800X2 was 6.4% faster than the GTX 280 in the tested titles.

    4) Making the same comparisons, 5870 is only 3.4% faster than the 4870X2 in the tested games and 45% faster than the 4890HD.

    Now, the games used for testing are completely different, so we have to throw that out. DoW2 is a huge bonus in favor of the 5870 and it scales relatively poorly with CF, hurting the X2. But you're still trying to paint a picture of the 5870 as a terrible disappointment when in fact we could say it essentially equals what NVIDIA did with the GTX 280.

    On average, at 2560x1600, if NVIDIA's GT300 were to come out and be 60% faster than the GTX 285, it will beat ATI's 5870 by about 15%. If it's the same price, it's the clear choice... if you're willing to wait a month or two. That's several "ifs" for what amounts to splitting hairs. There is no current game that won't run well on the HD 5870 at 2560x1600, and I suspect that will hold true of the GT300 as well.

    (FWIW, Crysis: Warhead is as bad as it gets, and dropping 4xAA will boost performance by at least 25%. It's an outlier, just like Crysis, since the higher settings are too much for anything but the fastest hardware. "High" settings are more than sufficient.)
  • SiliconDoc - Tuesday, September 29, 2009 - link

    In other words, even with your best fudging and whining about games and all the rest, you can't even bring it with all the lies from the 15-30 percent people are claiming up to 60.96%
    --
    Yes, as I thought.
  • zshift - Thursday, September 24, 2009 - link

    My thoughts exactly ;)

    I knew the 5870 was gonna be great based on the design philosophy that AMD/ATi had with the 4870, but I never thought I'd see anything this impressive. LESS power, with MORE power! (pun intended), and DOUBLE the speed, at that!

    Funny thing is, I was actually considering an Nvidia gpu when I saw how impressive PhysX was on Batman AA. But I think I would rather have near double the frame rates compared to seeing extra paper fluffing around here and there (though the scenes with the scarecrow are downright amazing). I'll just have to wait and see how the GT300 series does, seeing as I can't afford any of this right now (but boy, oh boy, is that upgrade bug itching like it never has before).
  • SiliconDoc - Thursday, September 24, 2009 - link

    Fine, but performance per dollar is on the very low end, often the lowest of all the cards. That's why it was omitted here.
    http://www.techpowerup.com/reviews/ATI/Radeon_HD_5...">http://www.techpowerup.com/reviews/ATI/Radeon_HD_5...
    THE LOWEST overall, or darn near it.
  • erple2 - Friday, September 25, 2009 - link

    So what you're saying then is that everyone should buy the 9500 GT and ignore everything else? If that's the most important thing to you, then clearly, that's what you mean.

    I think that the performance per dollar metrics that are shown are misleading at best and terrible at worst. It does not take into account that any frame rates significantly above your monitor refresh are for all intents and purposes wasted, and any frame rates significantly below 30 should by heavily weighted negatively. I haven't seen how techpowerup does their "performance per dollar" or how (if at all) they weight the FPS numbers in the dollar category.

    SLI/Crossfire has always been a lose-lose in the "performance per dollar" category. Curiously, I don't see any of the nvidia SLI cards listed (other than the 295).

    That sounds like biased "reporting" on your part.

Log in

Don't have an account? Sign up now