Cayman: The New Dawn of AMD GPU Computing

We’ve already covered how the shift from VLIW5 to VLIW4 is beneficial for AMD’s computing efforts: narrower SPUs are easier to fully utilize, FP64 performance improves to 1/4th FP32 performance, and the space savings give AMD room to lay down additional SIMDs to improve performance. But if Cayman is meant to be a serious effort by AMD to relaunch themselves in to the GPU computing market and to grab a piece of NVIDIA’s pie, it takes more than just new shaders to accomplish the task. Accordingly, AMD has been hard at work to round out the capabilities of their latest GPU to make it a threat for NVIDIA’s Fermi architecture.

AMD’s headline compute feature is called asynchronous dispatch, a long word that actually does a pretty good job of describing what it does. To touch back on Fermi for a moment, with Fermi NVIDIA introduced support for parallel kernels, giving Fermi the ability to execute multiple kernels at once. AMD in turn is following NVIDIA’s approach of executing multiple kernels at once, but is going to take it one step further.

The limit of NVIDIA’s design is that while Fermi can execute multiple kernels at once, each one must come from the same CPU thread. Independent threads/applications for example cannot issue their own kernels and have them execute in parallel, rather the GPU must context switch between them. With asynchronous dispatch AMD is going to allow independent threads/applications to issue kernels that execute in parallel. On paper at least, this would give AMD’s hardware a significant advantage in this scenario (context switching is expensive), one that would likely eclipse any overall performance advantages NVIDIA had.

Fundamentally asynchronous dispatch is achieved by having the GPU hide some information about its real state from applications and kernels, in essence leading to virtualization of GPU resources. As far as each kernel is concerned it’s running in its own GPU, with its own command queue and own virtual address space. This places more work on the GPU and drivers to manage this shared execution, but the payoff is that it’s better than context switching.

For the time being the catch for asynchronous dispatch is that it requires API support. As DirectCompute is a fixed standard this just isn’t happening – at least not with DirectCompute 11. Asynchronous dispatch will be exposed under OpenCL in the form of an extension.

Meanwhile the rest of AMD’s improvements are focusing on memory and cache performance. While the fundamental architecture is not changing, there are several minor changes here to improve compute performance. The Local Data Store attached to each SIMD is now able to bypass the cache hierarchy and Global Data Store by having memory fetches read directly in to the LDS. Meanwhile Cayman is getting a 2nd DMA engine, improving memory reads & writes by allowing Cayman to execute two at once in each direction.

Finally, read ops from shaders are being sped up a bit. Compared to Cypress, Cayman can coalesce them in to fewer operations.

As today’s launch is primarily about the Radeon HD 6900 series AMD isn’t going too much in depth on the compute side of things, so everything here is a fairly high level overview of the architecture. Once AMD has Firestream cards ready to go with Cayman in them, there will likely be more to talk about.

VLIW4: Finding the Balance Between TLP, ILP, and Everything Else Advancing Primitives: Dual Graphics Engines & New ROPs
POST A COMMENT

168 Comments

View All Comments

  • 529th - Sunday, December 19, 2010 - link

    Great job on this review. Excellent writing and easy to read.

    Thanks
    Reply
  • marc1000 - Sunday, December 19, 2010 - link

    yes, that's for sure. we will have to wait a little to see improvements from VLIW4. but my point is the "VLIW processors" count, they went up by 20%. with all other improvements, I was expecting a little more performance, just that.

    but in the other hand, I was reading the graphs, and decided that 6950 will be my next card. it has double the performance of 5770 in almost all cases. that's good enough for me.
    Reply
  • Iketh - Friday, December 24, 2010 - link

    This is how they've always reviewed new products? And perhaps the biggest reason AT stands apart from the rest? You must be new to AT?? Reply
  • WhatsTheDifference - Sunday, December 26, 2010 - link

    the 4890? I see every nvidia config, never a card overlooked there, ever, but the ATI's (then) top card is conspicuously absent. long as you include the 285, there's really no excuse for the omission. honestly, what's the problem? Reply
  • PeteRoy - Friday, December 31, 2010 - link

    All games released today are in the graphic level of the year 2006, how many games do you know that can bring the most out of this card? Crysis from 2007? Reply
  • Hrel - Tuesday, January 11, 2011 - link

    So when are all these tests going to be re-run at 1920x1080 cause quite frankly that's what I'm waiting for. I don't care about any resolution that doesn't work on my HDTV. I want 1920x1080, 1600x900 and 1280x720. If you must include uber resolutions for people with uber money then whatever; but those people know to just buy the fastest card out there anyway so they don't really need performance numbers to make up their mind. Money is no object so just buy nvidia's most expensive card and ur off. Reply
  • AKP1973 - Thursday, October 13, 2011 - link

    Have you guys noticed the "load GPU temp" of the 6870 in XFIRE?... It produced so very low heat than any enthusiast card in a multi-GPU setup. That's one of the best XFIRE card in our time today if you value price, performance, cool temp, and silence.! Reply
  • Travisryno - Wednesday, April 26, 2017 - link

    It's dishonest referring to enhanced 8x as 32x. There are industry standards for this, which AMD, NEC, 3DFX, SGI, SEGA AM2, etc..(everybody) always follow(ed), then nVidia just makes their own...
    Just look how convoluted it is..
    Reply

Log in

Don't have an account? Sign up now