XDMA: Improving Crossfire

Over the past year or so a lot of noise has been made over AMD’s Crossfire scaling capabilities, and for good reason. With the evolution of frame capture tools such as FCAT it finally became possible to easily and objectively measure frame delivery patterns.  The results of course weren’t pretty for AMD, showcasing that Crossfire may have been generating plenty of frames, but in most cases it was doing a very poor job of delivering them.

AMD for their part doubled down on the situation and began rolling out improvements in a plan that would see Crossfire improved in multiple phases. Phase 1, deployed in August, saw a revised Crossfire frame pacing scheme implemented for single monitor resolutions (2560x1600 and below) which generally resolved AMD’s frame pacing in those scenarios. Phase 2, which is scheduled for next month, will address multi-monitor and high resolution scaling, which faces a different set of problems and requires a different set of fixes than what went into phase 1.

The fact that there’s even a phase 2 brings us to our next topic of discussion, which is a new hardware DMA engine in GCN 1.1 parts called XDMA. Being first utilized on Hawaii, XDMA is the final solution to AMD’s frame pacing woes, and in doing so it is redefining how Crossfire is implemented on 290X and future cards. Specifically, AMD is forgoing the Crossfire Bridge Interconnect (CFBI) entirely and moving all inter-GPU communication over the PCIe bus, with XDMA being the hardware engine that makes this both practical and efficient.

But before we get too far ahead of ourselves, it would be best to put the current Crossfire situation in context before discussing how XDMA deviates from it.

In AMD’s current CFBI implementation, which itself dates back to the X1900 generation, a CFBI link directly connects two GPUs and has 900MB/sec of bandwidth. In this setup the purpose of the CFBI link is to transfer completed frames to the master GPU for display purposes, and to so in a direct GPU-to-GPU manner to complete the job as quickly and efficiently as possible.

For single monitor configurations and today’s common resolutions the CFBI excels at its task. AMD’s software frame pacing algorithms aside, the CFBI has enough bandwidth to pass around complete 2560x1600 frames at over 60Hz, allowing the CFBI to handle the scenarios laid out in AMD’s phase 1 frame pacing fix.

The issue with the CFBI is that while it’s an efficient GPU-to-GPU link, it hasn’t been updated to keep up with the greater bandwidth demands generated by Eyefinity, and more recently 4K monitors. For a 3x1080p setup frames are now just shy of 20MB/each, and for a 4K setup frames are larger still at almost 24MB/each. With frames this large CFBI doesn’t have enough bandwidth to transfer them at high framerates – realistically you’d top out at 30Hz or so for 4K – requiring that AMD go over the PCIe bus for their existing cards.

Going over the PCIe bus is not in and of itself inherently a problem, but pre-GCN 1.1 hardware lacks any specialized hardware to help with the task. Without an efficient way to move frames, and specifically a way to DMA transfer frames directly between the cards without involving CPU time, AMD has to resort to much uglier methods of moving frames between the cards, which are in part responsible for the poor frame pacing we see today on Eyefinity/4K setups.

CFBI Crossfire At 4K: Still Dropping Frames

For GCN 1.1 and Hawaii in particular, AMD has chosen to solve this problem by continuing to use the PCIe bus, but by doing so with hardware dedicated to the task. Dubbed the XDMA engine, the purpose of this hardware is to allow CPU-free DMA based frame transfers between the GPUs, thereby allowing AMD to transfer frames over the PCIe bus without the ugliness and performance costs of doing so on pre-GCN 1.1 cards.

With that in mind, the specific role of the XDMA engine is relatively simple. Located within the display controller block (the final destination for all completed frames) the XDMA engine allows the display controllers within each Hawaii GPU to directly talk to each other and their associated memory ranges, bypassing the CPU and large chunks of the GPU entirely. Within that context the purpose of the XDMA engine is to be a dedicated DMA engine for the display controllers and nothing more. Frame transfers and frame presentations are still directed by the display controllers as before – which in turn are directed by the algorithms loaded up by AMD’s drivers – so the XDMA engine is not strictly speaking a standalone device, nor is it a hardware frame pacing device (which is something of a misnomer anyhow). Meanwhile this setup also allows AMD to implement their existing Crossfire frame pacing algorithms on the new hardware rather than starting from scratch, and of course to continue iterating on those algorithms as time goes on.

Of course by relying solely on the PCIe bus to transfer frames there are tradeoffs to be made, both for the better and for the worse. The benefits are of course the vast increase in memory bandwidth (PCIe 3.0 x16 has 16GB/sec available versus .9GB/sec for CFBI) not to mention allowing Crossfire to be implemented without those pesky Crossfire bridges. The downside to relying on the PCIe bus is that it’s not a dedicated, point-to-point connection between GPUs, and for that reason there will bandwidth contention, and the latency for using the PCIe bus will be higher than the CFBI. How much worse depends on the configuration; PCIe bridge chips for example can both improve and worsen latency depending on where in the chain the bridges and the GPUs are located, not to mention the generation and width of the PCIe link. But, as AMD tells us, any latency can be overcome by measuring it and thereby planning frame transfers around it to take the impact of latency into account.

Ultimately AMD’s goal with the XDMA engine is to make PCIe based Crossfire just as efficient, performant, and compatible as CFBI based Crossfire, and despite the initial concerns we had over the use of the PCIe bus, based on our test results AMD appears to have delivered on their promises.

The XDMA engine alone can’t eliminate the variation in frame times, but in its first implementation it’s already as good as CFBI in single monitor setups, and being free of the Eyefinity/4K frame pacing issues that still plague CFBI, is nothing short of a massive improvement over CFBI in those scenarios. True to their promises, AMD has delivered a PCie based Crossfire implementation that incurs no performance penalty versus CFBI, and on the whole fully and sufficiently resolves AMD’s outstanding frame pacing issues. The downside of course is that XDMA won’t help the 280X or other pre-GCN 1.1 cards, but at the very least going forward AMD finally has demonstrated that they have frame pacing fully under control.

On a side note, looking at our results it’s interesting to see that despite the general reuse of frame pacing algorithms, the XDMA Crossfire implementation doesn’t exhibit any of the distinct frame time plateaus that the CFBI implementation does. The plateaus were more an interesting artifact than a problem, but it does mean that AMD’s XDMA Crossfire implementation is much more “organic” like NVIDIA’s, rather than strictly enforcing a minimum frame time as appeared to be the case with CFBI.

Hawaii: Tahiti Refined PowerTune: Improved Flexibility & Fan Speed Throttling
POST A COMMENT

396 Comments

View All Comments

  • Sandcat - Friday, October 25, 2013 - link

    That depends on what you define as 'acceptable frame rates'. Yeah, you do need a $500 card if you have a high refresh rate monitor and use it for 3d games, or just improved smoothness in non-3d games. A single 780 with my brothers' 144hz Asus monitor is required to get ~90 fps (i7-930 @ 4.0) in BF3 on Ultra with MSAA.

    The 290x almost requires liduid...the noise is offensive. Kudos to those with the equipment, but really, AMD cheaped out on the cooler in order to hit the price point. Good move, imho, but too loud for me.
    Reply
  • hoboville - Thursday, October 24, 2013 - link

    Yup, and it's hot. It will be worth buying once the manufacturers can add their own coolers and heat pipes.

    AMD has always been slower at lower res, but better in the 3x1080p to 6x1080p arena. They have always aimed for high-bandwidth memory, which is always performs better at high res. This is good for you as a buyer because it means you'll get better scaling at high res. It's essentially forward-looking tech, which is good for those who will be upgrading monitors in the new few years when 1440p IPS starts to be more affordable. At low res the bottleneck isn't RAM, but computer power. Regardless, buying a Titan / 780 / 290X for anything less than 1440p is silly, you'll be way past the 60-70 fps human eye limit anyway.
    Reply
  • eddieveenstra - Sunday, October 27, 2013 - link

    Maybe 60-70fps is the limit. but at 120Hz 60FPS will give noticable lag. 75 is about the minimum. That or i'm having eagle eyes. The 780gtx still dips in the low framerates at 120Hz (1920x1080). So the whole debate about titan or 780 being overkill @1080P is just nonsense. (780gtx 120Hz gamer here) Reply
  • hoboville - Sunday, October 27, 2013 - link

    That really depends a lot on your monitor. When they talked about Gsync and frame lag and smoothness, they mentioned when FPS doesn't exactly match the refresh rate you get latency and bad frame timing. That you have this problem with a 120 Hz monitor is no surprise as at anything less than 120 FPS you'll see some form of stuttering. When we talk about FPS > refresh rate then you won't notice this. At home I use a 2048x1152 @ 60 Hz and beyond 60 FPS all the extra frames are dropped, where as in your case you'll have some frames "hang" when you are getting less than 120 FPS, because the frames have to "sit" on the screen for an interval until the next one is displayed. This appears to be stuttering, and you need to get a higher FPS from the game in order for the frame delivery to appear smoother. This is because apparent delay decreases as a ratio of [delivered frames (FPS) / monitor refresh speed]. Once the ratio is small enough, you can no longer detect apparent delay. In essence 120 Hz was a bad idea, unless you get Gsync (which means a new monitor).

    Get a good 1440p IPS at 60 Hz and you won't have that problem, and the image fidelity will make you wonder why you ever bought a monitor with 56% of 1440p pixels in the first place...
    Reply
  • eddieveenstra - Sunday, October 27, 2013 - link

    To be honnest. I would never think about going back to 60Hz. I love 120Hz but don't know a thing about IPS monitors. Thanks for the response....

    Just checked it and that sounds good. When becoming more affordable i will start thinking about that. Seems like the IPS monitors are better with colors and have less blur@60Hz than TN. link:http://en.wikipedia.org/wiki/IPS_panel
    Reply
  • Spunjji - Friday, October 25, 2013 - link

    Step 1) Take data irrespective of different collection methods.

    Step 2) Perform average of data.

    Step 3) Completely useless results!

    Congratulations, sir; you have broken Science.
    Reply
  • nutingut - Saturday, October 26, 2013 - link

    But who cares if you can play at 90 vs 100 fps? Reply
  • MousE007 - Thursday, October 24, 2013 - link

    Very true, but remember, the only reason nvidia prices their cards where they are is because they could. (Eg Intel CPUs v AMD) Having said that, I truly welcome the competition as it makes it better for all of us, regardless of which side of the fence you sit. Reply
  • valkyrie743 - Thursday, October 24, 2013 - link

    the card runs at 95C and sucks power like no tomorrow. only only beats the 780 by a very little. does not overclock well.

    http://www.youtube.com/watch?v=-lZ3Z6Niir4
    and
    http://www.youtube.com/watch?v=3OHKWMgBhvA

    http://www.overclock3d.net/reviews/gpu_displays/am...

    i like his review. its pure honest and shows the facts. im not a nvidia fanboy nore am i a amd fanboy. but ill take nvidia right how over amd.

    i do like how this card is priced and the performance for the price. makes the titan not worth 1000 bucks (or the 850 bucks it goes used on forums) but as for the 780. if you get a non reference 780. it will be faster than the 290x and put out LESS heat and LESS noise. as well as use less power.

    plus gtx 780 TI is coming out in mid November which will probably cut the cost of the current 780 too 550 and and this card would be probably aorund 600 and beat this card even more.
    Reply
  • jljaynes - Friday, October 25, 2013 - link

    you say the review sticks with the facts - he starts off talking about how ugly the card is so it needs to beat a titan. and then the next sentence he says the R9-290X will cost $699.

    he sure seems to stick with the facts.
    Reply

Log in

Don't have an account? Sign up now