XDMA: Improving Crossfire

Over the past year or so a lot of noise has been made over AMD’s Crossfire scaling capabilities, and for good reason. With the evolution of frame capture tools such as FCAT it finally became possible to easily and objectively measure frame delivery patterns.  The results of course weren’t pretty for AMD, showcasing that Crossfire may have been generating plenty of frames, but in most cases it was doing a very poor job of delivering them.

AMD for their part doubled down on the situation and began rolling out improvements in a plan that would see Crossfire improved in multiple phases. Phase 1, deployed in August, saw a revised Crossfire frame pacing scheme implemented for single monitor resolutions (2560x1600 and below) which generally resolved AMD’s frame pacing in those scenarios. Phase 2, which is scheduled for next month, will address multi-monitor and high resolution scaling, which faces a different set of problems and requires a different set of fixes than what went into phase 1.

The fact that there’s even a phase 2 brings us to our next topic of discussion, which is a new hardware DMA engine in GCN 1.1 parts called XDMA. Being first utilized on Hawaii, XDMA is the final solution to AMD’s frame pacing woes, and in doing so it is redefining how Crossfire is implemented on 290X and future cards. Specifically, AMD is forgoing the Crossfire Bridge Interconnect (CFBI) entirely and moving all inter-GPU communication over the PCIe bus, with XDMA being the hardware engine that makes this both practical and efficient.

But before we get too far ahead of ourselves, it would be best to put the current Crossfire situation in context before discussing how XDMA deviates from it.

In AMD’s current CFBI implementation, which itself dates back to the X1900 generation, a CFBI link directly connects two GPUs and has 900MB/sec of bandwidth. In this setup the purpose of the CFBI link is to transfer completed frames to the master GPU for display purposes, and to so in a direct GPU-to-GPU manner to complete the job as quickly and efficiently as possible.

For single monitor configurations and today’s common resolutions the CFBI excels at its task. AMD’s software frame pacing algorithms aside, the CFBI has enough bandwidth to pass around complete 2560x1600 frames at over 60Hz, allowing the CFBI to handle the scenarios laid out in AMD’s phase 1 frame pacing fix.

The issue with the CFBI is that while it’s an efficient GPU-to-GPU link, it hasn’t been updated to keep up with the greater bandwidth demands generated by Eyefinity, and more recently 4K monitors. For a 3x1080p setup frames are now just shy of 20MB/each, and for a 4K setup frames are larger still at almost 24MB/each. With frames this large CFBI doesn’t have enough bandwidth to transfer them at high framerates – realistically you’d top out at 30Hz or so for 4K – requiring that AMD go over the PCIe bus for their existing cards.

Going over the PCIe bus is not in and of itself inherently a problem, but pre-GCN 1.1 hardware lacks any specialized hardware to help with the task. Without an efficient way to move frames, and specifically a way to DMA transfer frames directly between the cards without involving CPU time, AMD has to resort to much uglier methods of moving frames between the cards, which are in part responsible for the poor frame pacing we see today on Eyefinity/4K setups.

CFBI Crossfire At 4K: Still Dropping Frames

For GCN 1.1 and Hawaii in particular, AMD has chosen to solve this problem by continuing to use the PCIe bus, but by doing so with hardware dedicated to the task. Dubbed the XDMA engine, the purpose of this hardware is to allow CPU-free DMA based frame transfers between the GPUs, thereby allowing AMD to transfer frames over the PCIe bus without the ugliness and performance costs of doing so on pre-GCN 1.1 cards.

With that in mind, the specific role of the XDMA engine is relatively simple. Located within the display controller block (the final destination for all completed frames) the XDMA engine allows the display controllers within each Hawaii GPU to directly talk to each other and their associated memory ranges, bypassing the CPU and large chunks of the GPU entirely. Within that context the purpose of the XDMA engine is to be a dedicated DMA engine for the display controllers and nothing more. Frame transfers and frame presentations are still directed by the display controllers as before – which in turn are directed by the algorithms loaded up by AMD’s drivers – so the XDMA engine is not strictly speaking a standalone device, nor is it a hardware frame pacing device (which is something of a misnomer anyhow). Meanwhile this setup also allows AMD to implement their existing Crossfire frame pacing algorithms on the new hardware rather than starting from scratch, and of course to continue iterating on those algorithms as time goes on.

Of course by relying solely on the PCIe bus to transfer frames there are tradeoffs to be made, both for the better and for the worse. The benefits are of course the vast increase in memory bandwidth (PCIe 3.0 x16 has 16GB/sec available versus .9GB/sec for CFBI) not to mention allowing Crossfire to be implemented without those pesky Crossfire bridges. The downside to relying on the PCIe bus is that it’s not a dedicated, point-to-point connection between GPUs, and for that reason there will bandwidth contention, and the latency for using the PCIe bus will be higher than the CFBI. How much worse depends on the configuration; PCIe bridge chips for example can both improve and worsen latency depending on where in the chain the bridges and the GPUs are located, not to mention the generation and width of the PCIe link. But, as AMD tells us, any latency can be overcome by measuring it and thereby planning frame transfers around it to take the impact of latency into account.

Ultimately AMD’s goal with the XDMA engine is to make PCIe based Crossfire just as efficient, performant, and compatible as CFBI based Crossfire, and despite the initial concerns we had over the use of the PCIe bus, based on our test results AMD appears to have delivered on their promises.

The XDMA engine alone can’t eliminate the variation in frame times, but in its first implementation it’s already as good as CFBI in single monitor setups, and being free of the Eyefinity/4K frame pacing issues that still plague CFBI, is nothing short of a massive improvement over CFBI in those scenarios. True to their promises, AMD has delivered a PCie based Crossfire implementation that incurs no performance penalty versus CFBI, and on the whole fully and sufficiently resolves AMD’s outstanding frame pacing issues. The downside of course is that XDMA won’t help the 280X or other pre-GCN 1.1 cards, but at the very least going forward AMD finally has demonstrated that they have frame pacing fully under control.

On a side note, looking at our results it’s interesting to see that despite the general reuse of frame pacing algorithms, the XDMA Crossfire implementation doesn’t exhibit any of the distinct frame time plateaus that the CFBI implementation does. The plateaus were more an interesting artifact than a problem, but it does mean that AMD’s XDMA Crossfire implementation is much more “organic” like NVIDIA’s, rather than strictly enforcing a minimum frame time as appeared to be the case with CFBI.

Hawaii: Tahiti Refined PowerTune: Improved Flexibility & Fan Speed Throttling
Comments Locked

396 Comments

View All Comments

  • ninjaquick - Thursday, October 24, 2013 - link

    so 4-5% faster than Titan?
  • Drumsticks - Thursday, October 24, 2013 - link

    If the 780Ti is $599, then that means the 780 should see at least a $150 (nearly 25%!) price drop, which is good with me.
  • DMCalloway - Thursday, October 24, 2013 - link

    So, what you are telling me is Nvidia is going to stop laughing- all- the- way- to-the-bank and price the 780ti for less than current 780 prices? Current 780 owners are going to get HOT and flood the market with used 780's.
  • dragonsqrrl - Thursday, October 24, 2013 - link

    Why is it that this is only ever the case when Nvidia performs a massive price drop? Nvidia price drop = early adopters getting screwed (even though 780 has been out for ~6 months now). AMD price drop = great value for enthusiasts, go AMD! ... lolz.
  • Minion4Hire - Thursday, October 24, 2013 - link

    Titan is a COMPUTE card. A poor man's (relatively speaking) proper compute solution. The fact that it is also a great gaming card is almost incidental. No one needs a 6GB frame buffer for gaming right now. The Titan comparisons are nearly meaningless.

    The "nearly" part is the unknown 780 TI. Nvidia could enable the remaining CUs on 780 to at least give the TI comparable performance to Titan. But who cares that Titan is $1000? It isn't really relevant.
  • ddriver - Thursday, October 24, 2013 - link

    Even much cheaper radeons compeltely destroy the titan as well as every other nvidia gpu in compute, do not be fooled by a single, poorly implemented test, the nvidia architecture plainly sucks in double precision performance.
  • ShieTar - Thursday, October 24, 2013 - link

    Since "much cheaper" Radeons tend to deliver 1/16th DP performance, you seem to not really know what you are talking about. Go read up on a relevant benchmark suite on professional and compute cards, e.g. http://www.tomshardware.com/reviews/best-workstati... The only tasks where AMD cards shine are those implemented in OpenCL.
  • ddriver - Thursday, October 24, 2013 - link

    "Much cheaper" relative to the price of the titan, not entry level radeons... You clutched onto a straw and drowned...

    OpenCL is THE open and portable industry standard for parallel computing, did you expect radeons to shine at .. CUDA workloads LOL, I'd say OpenCL performance is all I really need, it has been a while since I played or cared about games.
  • Pontius - Tuesday, October 29, 2013 - link

    I'm in the same boat as you ddriver, all I care about is OpenCL in these articles. I go straight to that section usually =)
  • TheJian - Friday, October 25, 2013 - link

    You're neglecting the fact that everything you can do professionally in openCL you can already do faster in cuda. Cuda is taught in 600+ universities for a reason. It is in over 200 pro apps and has been funded for 7+yrs unlike opencl which is funded by a broke company hoping people will catch on one day :) Anandtech refuses to show cuda (gee they do have an AMD portal after all...LOL) but it exists and is ultra fast. You really can't name a pro app that doesn't have direct support or support via plugin for Cuda. And if you're buying NV and running opencl instead of cuda (like anand shows calling it compute crap) you're an idiot. Why don't they run Premiere instead of Sony crap for video editing? Because Cuda works great for years in it. Same with Photoshop etc...

    You didn't look at folding@home DP benchmark here in this review either I guess. 2.5x faster than 290x. As you can see it depends on what you do and the app you use. I consider F@H stupid use of electricity but that's just me...LOL. Find anything where OpenCL (or any AMD stuff, directx, opengl) beats CUDA. Compute doesn't just mean OpenCL, it means CUDA too! Dumb sites just push openCL because its OPEN...LOL. People making money use CUDA and generally buy quadro or tesla (they own 90% of the market for a reason, or people would just buy radeons right?).
    http://www.anandtech.com/show/7457/the-radeon-r9-2...
    DP in F@H here. Titan sort of wins right? 2.5x or so over 290x :) It's comic both here and toms uses a bunch of junk synthetic crap (bitmining, Asics do that now, basemark junk, F@H, etc) to show how good AMD is, but forget you can do real work with Cuda (heck even bitmining can be done with cuda)

    When you say compute, I think CUDA, not opencl on NV. As soon as you toss in Cuda the compute story changes completely. Unfortunately even Toms refuses to pit OpenCL vs. Cuda just like here at anandtech (but that's because both love OpenCL and hate proprietary stuff). But at least they show you in ShieTar's link (which craps out, remove the . at the end of the link) that Titan kills even the top quadro cards (it's a Tesla remember for $1500 off). It's 2x+ faster than quadro's in almost everything they tested. So yeah, Titan is very worth it for people who do PRO stuff AND game.
    http://www.tomshardware.com/reviews/best-workstati...
    For the lazy, fixed ShieTar's link.

    All these sites need to do is fire up 3dsmax, cinema4d, Blender, adobe (pick your app, After Effect, Premiere, Photoshop) and pit Cuda vs. OpenCL. Just pick an opencl plugin for AMD (luxrender) and Octane/furryball etc for NV then run the tests. Does AMD pay all these sites to NOT do this? I comment and ask on every workstation/vid card article etc at toms, they never respond...LOL. They run pure cuda, then pure opencl, but act like they never meet. They run crap like basemark for photo/video editing opencl junk (you can't make money on that), instead of running adobe and choosing opencl(or directx/opengl) for AMD and Cuda for NV. Anandtech runs Sony Vegas which a quick google shows has tons of problems with NV. Heck pit Sony/AMD vs. Adobe/NV. You can run the same tests in both on video, though it would be better to just use adobe for both but they won't do that until AMD gets done optimizing for the next rev...ROFL. Can't show AMD in a bad light here...LOL. OpenCL sucks compared to Cuda (proprietary or not...just the truth).

Log in

Don't have an account? Sign up now