XDMA: Improving Crossfire

Over the past year or so a lot of noise has been made over AMD’s Crossfire scaling capabilities, and for good reason. With the evolution of frame capture tools such as FCAT it finally became possible to easily and objectively measure frame delivery patterns.  The results of course weren’t pretty for AMD, showcasing that Crossfire may have been generating plenty of frames, but in most cases it was doing a very poor job of delivering them.

AMD for their part doubled down on the situation and began rolling out improvements in a plan that would see Crossfire improved in multiple phases. Phase 1, deployed in August, saw a revised Crossfire frame pacing scheme implemented for single monitor resolutions (2560x1600 and below) which generally resolved AMD’s frame pacing in those scenarios. Phase 2, which is scheduled for next month, will address multi-monitor and high resolution scaling, which faces a different set of problems and requires a different set of fixes than what went into phase 1.

The fact that there’s even a phase 2 brings us to our next topic of discussion, which is a new hardware DMA engine in GCN 1.1 parts called XDMA. Being first utilized on Hawaii, XDMA is the final solution to AMD’s frame pacing woes, and in doing so it is redefining how Crossfire is implemented on 290X and future cards. Specifically, AMD is forgoing the Crossfire Bridge Interconnect (CFBI) entirely and moving all inter-GPU communication over the PCIe bus, with XDMA being the hardware engine that makes this both practical and efficient.

But before we get too far ahead of ourselves, it would be best to put the current Crossfire situation in context before discussing how XDMA deviates from it.

In AMD’s current CFBI implementation, which itself dates back to the X1900 generation, a CFBI link directly connects two GPUs and has 900MB/sec of bandwidth. In this setup the purpose of the CFBI link is to transfer completed frames to the master GPU for display purposes, and to so in a direct GPU-to-GPU manner to complete the job as quickly and efficiently as possible.

For single monitor configurations and today’s common resolutions the CFBI excels at its task. AMD’s software frame pacing algorithms aside, the CFBI has enough bandwidth to pass around complete 2560x1600 frames at over 60Hz, allowing the CFBI to handle the scenarios laid out in AMD’s phase 1 frame pacing fix.

The issue with the CFBI is that while it’s an efficient GPU-to-GPU link, it hasn’t been updated to keep up with the greater bandwidth demands generated by Eyefinity, and more recently 4K monitors. For a 3x1080p setup frames are now just shy of 20MB/each, and for a 4K setup frames are larger still at almost 24MB/each. With frames this large CFBI doesn’t have enough bandwidth to transfer them at high framerates – realistically you’d top out at 30Hz or so for 4K – requiring that AMD go over the PCIe bus for their existing cards.

Going over the PCIe bus is not in and of itself inherently a problem, but pre-GCN 1.1 hardware lacks any specialized hardware to help with the task. Without an efficient way to move frames, and specifically a way to DMA transfer frames directly between the cards without involving CPU time, AMD has to resort to much uglier methods of moving frames between the cards, which are in part responsible for the poor frame pacing we see today on Eyefinity/4K setups.

CFBI Crossfire At 4K: Still Dropping Frames

For GCN 1.1 and Hawaii in particular, AMD has chosen to solve this problem by continuing to use the PCIe bus, but by doing so with hardware dedicated to the task. Dubbed the XDMA engine, the purpose of this hardware is to allow CPU-free DMA based frame transfers between the GPUs, thereby allowing AMD to transfer frames over the PCIe bus without the ugliness and performance costs of doing so on pre-GCN 1.1 cards.

With that in mind, the specific role of the XDMA engine is relatively simple. Located within the display controller block (the final destination for all completed frames) the XDMA engine allows the display controllers within each Hawaii GPU to directly talk to each other and their associated memory ranges, bypassing the CPU and large chunks of the GPU entirely. Within that context the purpose of the XDMA engine is to be a dedicated DMA engine for the display controllers and nothing more. Frame transfers and frame presentations are still directed by the display controllers as before – which in turn are directed by the algorithms loaded up by AMD’s drivers – so the XDMA engine is not strictly speaking a standalone device, nor is it a hardware frame pacing device (which is something of a misnomer anyhow). Meanwhile this setup also allows AMD to implement their existing Crossfire frame pacing algorithms on the new hardware rather than starting from scratch, and of course to continue iterating on those algorithms as time goes on.

Of course by relying solely on the PCIe bus to transfer frames there are tradeoffs to be made, both for the better and for the worse. The benefits are of course the vast increase in memory bandwidth (PCIe 3.0 x16 has 16GB/sec available versus .9GB/sec for CFBI) not to mention allowing Crossfire to be implemented without those pesky Crossfire bridges. The downside to relying on the PCIe bus is that it’s not a dedicated, point-to-point connection between GPUs, and for that reason there will bandwidth contention, and the latency for using the PCIe bus will be higher than the CFBI. How much worse depends on the configuration; PCIe bridge chips for example can both improve and worsen latency depending on where in the chain the bridges and the GPUs are located, not to mention the generation and width of the PCIe link. But, as AMD tells us, any latency can be overcome by measuring it and thereby planning frame transfers around it to take the impact of latency into account.

Ultimately AMD’s goal with the XDMA engine is to make PCIe based Crossfire just as efficient, performant, and compatible as CFBI based Crossfire, and despite the initial concerns we had over the use of the PCIe bus, based on our test results AMD appears to have delivered on their promises.

The XDMA engine alone can’t eliminate the variation in frame times, but in its first implementation it’s already as good as CFBI in single monitor setups, and being free of the Eyefinity/4K frame pacing issues that still plague CFBI, is nothing short of a massive improvement over CFBI in those scenarios. True to their promises, AMD has delivered a PCie based Crossfire implementation that incurs no performance penalty versus CFBI, and on the whole fully and sufficiently resolves AMD’s outstanding frame pacing issues. The downside of course is that XDMA won’t help the 280X or other pre-GCN 1.1 cards, but at the very least going forward AMD finally has demonstrated that they have frame pacing fully under control.

On a side note, looking at our results it’s interesting to see that despite the general reuse of frame pacing algorithms, the XDMA Crossfire implementation doesn’t exhibit any of the distinct frame time plateaus that the CFBI implementation does. The plateaus were more an interesting artifact than a problem, but it does mean that AMD’s XDMA Crossfire implementation is much more “organic” like NVIDIA’s, rather than strictly enforcing a minimum frame time as appeared to be the case with CFBI.

Hawaii: Tahiti Refined PowerTune: Improved Flexibility & Fan Speed Throttling
Comments Locked

396 Comments

View All Comments

  • Antiflash - Thursday, October 24, 2013 - link

    I've usually prefer Nvidia Cards, but they have it well deserved when decided to price GK110 to the stratosphere just "because they can" and had no competition. That's poor way to treat your customers and taking advantage of fanboys. Full implementation of Tesla and Fermi were always priced around $500. Pricing Keppler GK110 at $650+ was stupid. It's silicon after all, you should get more performance for the same price each year. Not more performance at a premium price as Nvidia tried to do this generation. AMD is not doing anything extraordinary here they are just not following nvidia price gouging practices and $550 is their GPU at historical market prices for their flagship GPU. We would not have been having this discussion if Nvidia had done the same with GK110.
  • blitzninja - Saturday, October 26, 2013 - link

    OMG, why won't you people get it? The Titan is a COMPUTE-GAMING HYBRID card, it's for professionals who run PRO apps (ie. Adobe Media product line, 3D Modeling, CAD, etc) but are also gamers and don't want to have SLI setups for gaming + compute or they can't afford to do so.

    A Quadro card is $2500, this card has 1 less SMX unit and no PRO customer driver support but is $1000 and does both Gaming AND Compute, as far as low-level professionals are concerned this thing is the very definition of steal. Heck, you SLI two of these things and you're still up $500 from a K6000.

    What usually happens is the company they work at will have Quadro workstations and at home the employee has a Titan. Sure it's not as good but it gets the job done until you get back to work.

    Please check your shit. Everyone saying R9 290X--and yes I agree for gaming it's got some real good price/performance--destroys the Titan is ignorant and needs to do some good long research into:
    A. How well the Titan sold
    B. The size of the compute market and MISSING PRICE POINTS in said market.
    C. The amount of people doing compute who are also avid gamers.
  • chimaxi83 - Thursday, October 24, 2013 - link

    Impressive. This cards beats Nvidia on EVERY level! Price, performance, features, power..... every level. Nvidia paid the price for gouging it's customers, they are going to lose a ton of marketshare. I doubt they have anything to match this for at least a year.
  • Berzerker7 - Thursday, October 24, 2013 - link

    Sounds like a bot. The card is worse than a Titan on every point except high resolution (read: 4K), including power, temperature and noise.
  • testbug00 - Thursday, October 24, 2013 - link

    Er, the Titan beats it on being higher priced, looking nicer, having a better cooler and using less power.

    even in 1080p a 290x approxs ties (slightly ahead according to techpowerup (4%)) the Titan.

    Well, a $550 card that can tie a $1000 card in a resolution a card that fast really shouldn't be bought for (seriously, if you are playing in 1200p or less there is no reason to buy any GPU over $400 unless you plan to ugprade screens soon)
  • Sancus - Thursday, October 24, 2013 - link

    The Titan was a $1000 card when it was released.... 8 months ago. So for 8 months nvidia has had the fastest card and been able to sell it at a ridiculous price premium(even at $1000, supply of Titans was quite limited, so it's not like they would have somehow benefited from setting the price lower... in fact Titan would probably have made more money for Nvidia at an even HIGHER price).

    The fact that ATI is just barely matching Nvidia at regular resolutions and slightly beating them at 4k, 8 months later, is a baseline EXPECTATION. It's hardly an achievement. If they had released anything less than the 290X they would have completely embarrassed themselves.

    And I should point out that they're heavily marketing 4k resolution for this card and yet frame pacing in Crossfire even with their 'fixes' is still pretty terrible, and if you are seriously planning to game at 4k you need Crossfire to be actually usable, which it has never really been.
  • anubis44 - Thursday, October 24, 2013 - link

    The margin of victory for the R9 290X over the Titan at 4K resolutions is not 'slight', it's substantial. HardOCP says it's 10-15% faster on average. That's a $550 card that's 10-15% faster than a $1000 card.

    What was that about AMD being embarassed?
  • Sancus - Thursday, October 24, 2013 - link

    By the time more than 1% of the people buying this card even have 4k monitors 20nm cards will have been on sale for months. Not only that but you would basically go deaf next to a Crossfire 290x setup which is what you need for 4k. And anyway, the 290x is faster only because it's been monstrously over clocked beyond the ability of its heatsink to cool it properly. 780/Titan are still far more viable 2/3/4 GPU cards because of their superior noise and power consumption.

    All 780s overclock to considerably faster than this card at ALL resolutions so the gtx 780ti is probably just an OCed 780, and it will outperform the 290x while still being 10db quieter.
  • DMCalloway - Thursday, October 24, 2013 - link

    You mention monstrously OC'ing the 290x yet have no problem OC'ing the 780 in order to create a 780ti. Everyone knows that aftermarket coolers will keep the noise and temps. in check when released. Let's deal with the here and now, not speculate on future cards. Face it; AMD at least matches or beats a card costing $100 more which will cause Nvidia to launch the 780ti at less than current 780 prices.
  • Sancus - Thursday, October 24, 2013 - link

    You don't understand how pricing works. AMD is 8 months late to the game. They've released a card that is basically the GTX Titan, except it uses more than 50W more power and has a bargain basement heatsink. That's why it's $100 cheaper. Because AMD is the one who are far behind and the only way for them to compete is on price. They demonstrably can't compete purely based on performance, if the 290X was WAY better than the GTX Titan, AMD would have priced it higher because guess what, AMD needs to make a profit too -- and they consistently have lost money for years now.

    The company that completely owned the market to the point they could charge $1000 for a video card are the winners here, not the one that arrived out of breath at the finish line 8 months later.

    I would love for AMD to be competitive *at a competitive time* so that we didn't have to pay $650 for a GTX 780, but the fact of the matter is that they're simply not.

Log in

Don't have an account? Sign up now