A Brief History on Multi-GPU with Dissimilar GPUs

Before we dive into our results, let’s talk briefly about the history of efforts to render games with multiple, dissimilar GPUs. After the development of PCI Express brought about the (re)emergence of NVIDIA’s SLI and AMD’s CrossFire, both companies eventually standardized their multi-GPU rendering efforts across the same basic technology. Using alternate frame rendering (AFR), NVIDIA and AMD would have the GPUs in a multi-GPU setup each render a separate frame. With the drivers handing off frames to each GPU in an alternating manner, AFR was the most direct and most compatible way to offer multi-GPU rendering as it didn’t significantly disrupt the traditional game rendering paradigms. There would simply be two (or more) GPUs rendering frames instead of one, with much of the work abstracted by the GPU drivers.

Using AFR allowed for relatively rapid multi-GPU support, but it did come with tradeoffs as well. Alternating frames meant that inter-frame dependencies needed to be tracked and handled, which in turn meant that driver developers had to add support for games on a game-by-game basis. Furthermore the nature of distributing the work meant that care needed to be taken to ensure each GPU rendered at an even pace so that the resulting in-game motion was smooth, a problem AMD had to face head-on in 2013. Finally, because AFR had each GPU rendering whole frames, it worked best when GPUs were as identical as possible in performance; a performance gap would at best require a faster card to spend some time waiting on the slower card, and at worse exacerbate the aforementioned frame pacing issues. As a result NVIDIA only allows identical cards to be paired up in SLI, and AMD only allows a slightly wider variance (typically cards using the same GPU).

In 2010 LucidLogix set out to do one better, leveraging their graphics expertise to develop their Hydra technology. By using a combination of a hardware and software, Hydra could intercept DirectX and OpenGL calls and redistribute them to split up rendering over multiple, and for the first time, dissimilar GPUs. Long a dream within the PC gaming space (and subject of a few jokes), the possibilities for using dissimilar GPUs via Hydra were immense – pairing up GPUs not only from different vendors, but of differing performance as well – resolving some of AFR’s shortcomings while allowing gamers to do things such as reuse old video cards and still receive a performance benefit.

However in the long run the Hydra technology failed to catch on. The process of splitting up API calls, having multiple GPUs render them, and compositing them back together to a single frame proved to be harder than LucidLogix expected, and as a result Hydra’s compatibility was poor and performance gains were limited. Coupled with the cost of the hardware, the licensing, and the fact that Hydra boards were never SLI certified (preventing typical NVIDIA SLI operation) meant that Hydra had a quick exit from motherboards.

In the end what LucidLogix was attempting was a valiant effort, but in retrospect one that was misguided. Working at the back-end of the rendering chain and manipulating API calls can work, but it is a massive amount of effort and it has hardware developers aiming at a moving target, requiring constant effort to keep up with new games. AMD and NVIDIA’s driver-level optimizations don’t fare too much better in this respect; there are vendor-specific shortcuts such as NVAPI that simplify this some, but even AMD and NVIDIA have to work to keep up with new games. This is why they need to issue driver updates and profile updates so frequently in order to get the best performance out of CrossFire and SLI.

But what if there was a better way to manage multiple GPUs and assign work to them? Would it be possible to do a better job working from the front-end of the rendering chain? This is something DirectX 12 sets out to answer with its multi-adapter modes.

DirectX 12 Multi-GPU

In DirectX 12 there are technically three different modes for multi-adapter operation. The simplest of these modes is what Microsoft calls Implicit Multi-Adapter. Implicit Multi-Adapter is essentially the lowest rung of multi-adapter operation, intended to allow developers to use the same AFR-friendly techniques as they did with DirectX 11 and before. This model retains the same limited ability for game developers to control the multi-GPU rendering process, which limits the amount of power they have, but also limits their responsibilities as well. Consequently, just as with DirectX 11 mutli-GPU, in implicit mode much of the work is offloaded to the drivers (and practically speaking, AMD and NVIDIA).

While the implicit model has the most limitations, the lack of developer responsibilities also means it’s the easiest to implement. In an era where multi-platform games are common, even after developers make the switch to DirectX 12 they may not want to undertake the effort to support Explicit Multi-Adapter, as the number of PC owners with multiple high-powered GPU is a fraction of the total PC gaming market. And in that case, with help from driver developers, implicit is the fastest path towards supporting multiple GPUs.

What’s truly new to DirectX 12 then are its Explicit Multi-Adapter (EMA) modes. As implied by the name, these modes require game developers to explicitly program for multi-GPU operation, specifying how work will be assigned to each GPU, how memory will be allocated, how the GPUs will communicate, etc. By giving developers explicit control over the process, they have the best chance to extract the most multi-GPU performance out of a system, as they have near-absolute control over both the API and the game, allowing them to work with more control and more information than any of the previously discussed multi-GPU methods. The cost of using explicit mode is resources: with great power comes great responsibility, and unlike implicit mode, game developers must put in a fair bit of work to make explicit mode work, and more work yet to make it work well.

Within EMA there are two different ways to address GPUs: linked mode and unlinked mode. Unlinked mode is essentially the baseline mode for EMA, and offers the bulk of EMA’s features. Linked mode on the other hand builds unlinked that by offering yet more functionality in exchange for much tighter restrictions on what adapters can be used.

The ultimate purpose of unlinked mode is to allow developers to take full advantage of all DirectX 12 capable GPU resources in a system, at least so long as they are willing to do all of the work required to manage those resources. Unlinked mode, as opposed to linked mode and implicit multi-adapter, can work with DX12 GPUs from any vendor, providing just enough abstraction to allow GPUs to exchange data but putting everything else in the developer’s hands. Depending on what developers want to do, unlinked mode can be used for anything from pairing up two dGPUs to pairing up a dGPU with an iGPU, with the GPUs being a blank slate of sort for developers to use as they see fit for whatever algorithms and technologies they opt to use.

As the base mode for DirectX 12 multi-GPU, unlinked mode presents each as its own device, with its own memory, its own command processor, and more, accurately representing the layout of the physical hardware. What DirectX 12’s EMA brings to the table that’s new is that it allows developers to exchange data between GPUs, going beyond just finished, rendered images and potentially exchanging partially rendered frames, buffers, and other forms of data. It’s the ability to exchange multiple data types that gives EMA its power and its flexibility, as without it, it wouldn’t be possible to implement much more than AFR. EMA is the potential for multiple GPUs to work together, be it similar or disparate; no more and no less.

If this sounds very vague that’s because it is, and that in turn is because the explicit API outstrips what today’s hardware is capable of. Compared to on-board memory, any operations taking place over PCI Express are relatively slow and high latency. Some GPUs in turn handle this better than others, but at the end of the day the PCIe bus is still a bottleneck at a fraction of the speed of local memory. That means while GPUs can work together, they must do so intelligently, as we’re not yet at the point where GPUs can quickly transfer large amounts of data from each other.

Because EMA is a blank slate, it ultimately falls to developers to put it to good use; DirectX 12 just supplies the tools. Traditional AFR implementations are one such option, and so are splitting up workloads in other fashions such as split-frame rendering (SFR), or even methods where one GPU doesn’t render a complete frame or fractions of a complete frame, passing off frames at different stages to different GPUs.

But practically speaking, a lot of the early focus on EMA development and promotion is on dGPU + iGPU, and this is because the vast majority of PCs with a dGPU also have an iGPU. Relative to even a $200 dGPU, an iGPU is going to offer a fraction of the performance, but it’s also a GPU resource that is otherwise going unused. Epic Games has been experimenting with using EMA to have iGPUs do post-processing, as finished frames are relatively small (1080p60 with an FP32 frame is only 2GB/sec, a fraction of PCIe 3.0 x16’s bandwidth), post-processing is fairly lightweight in resource requirements, and it typically has a predictable processing time.

Moving on, building on top of unlinked mode is EMA’s linked mode. Linked mode is by and large the equivalent of SLI/CrossFire for EMA, and is designed for systems where all GPUs being used are near-identical. Within linked mode all of the GPUs are pooled and presented to applications as a single GPU, just with multiple command processors and multiple memory pools due to the limits of the PCIe bus. Because linked mode is restricted to similar GPUs, developers gain even more power and control, as linked GPUs will be from the same vendor and use the same data formats at every step.

Broadly speaking, linked mode will be both easier and harder for developers to use relative to unlinked mode. Unlike unlinked mode there are certain assumptions that can be made about the hardware and what it’s capable of, and developers won’t need to juggle the complications of using GPUs from multiple vendors at once. On the other hand this is the most powerful mode because of all of the options it presents developers, with more complex rendering techniques likely to be necessary to extract the full performance benefit of linked mode.

Ultimately, one point that Microsoft and developers have continually reiterated in their talks is that explicit multi-adapter is that like so many other low-level aspects of DirectX 12, it’s largely up to the developers to put the technology to good use. The API provides a broad set of capabilities – tempered a bit by hardware limitations and how quickly GPUs can exchange data – but unlike DirectX 11 and implicit multi-adapter, it’s developers that define how GPUs should work together. So whether a game supports any kind of EMA operation and whether this means combining multiple dGPUs from the same vendor, multiple dGPUs from different vendors, or a dGPU and an iGPU is a question of software more than it is of hardware.

Previewing DirectX 12 Explicit Multi-Adapter with Ashes of the Singularity Ashes of the Singularity: Unlinked Explicit Multi-Adapter w/AFR & The Test
Comments Locked

180 Comments

View All Comments

  • jimjamjamie - Tuesday, October 27, 2015 - link

    [pizza-making intensifies]
  • geniekid - Monday, October 26, 2015 - link

    On one hand the idea of unlinked EMA is awesome. On the other hand, I have to believe 95% of developers will shy away from implementing anything other than AFR in their game due to the sheer amount of effort the complexity would add to their QA/debugging process. If Epic manages to pull off their post-processing offloading I would be very impressed.
  • DanNeely - Monday, October 26, 2015 - link

    I'd guess it'd be the other way around. SLI/XFire AFR is complicated enough that it's normally only done for big budget AAA games. Other than replacing two vendor APIs with a single OS API DX12 doesn't seem to offer a whole lot of help there; so I don't expect to see a lot change.

    Handing off the tail end of every frame seems simpler; especially since the frame pacing difficulties that make AFR so hard and require a large amount of per game work won't be a factor. This sounds like something that could be baked into the engines themselves, and that shouldn't require a lot of extra work on the game devs part. Even if it ends up only being a modest gain for those of us with mid/high end GPUs; it seems like it could end up being an almost free gift.
  • nightbringer57 - Monday, October 26, 2015 - link

    That's only half relevant.
    I wonder how much can be implemented at the engine level. This kind of thing may be at least partially transparent to devs if says Unreal Engine and Unity get compatibility for it... I don't know how much it can do, though.
  • andrewaggb - Monday, October 26, 2015 - link

    Agreed, I would hope that if the Unreal Engine, Unity, Frostbite etc support it that maybe 50% or more of new games will support it.

    We'll have to see though. The idea of having both an AMD and Nvdia card in the same machine is both appealing and terrifying. Occasionally games work better on one than the other, so you might avoid some pain sometimes, but I'm sure you'd get a whole new set of problems sometimes as well.

    I think making use of the iGPU and discrete cards is probably the better scenario to optimize for. (Like Epic is apparently doing)
  • Gigaplex - Monday, October 26, 2015 - link

    Problems such as NVIDIA intentionally disabling PhysX when an AMD GPU is detected in the system, even if it's not actively being used.
  • Friendly0Fire - Monday, October 26, 2015 - link

    It really depends on a lot of factors I think, namely how complex the API ends up being.

    For instance, I could really see shadow rendering being offloaded to one GPU. There's minimal crosstalk between the two GPUs, the shadow renderer only needs geometry and camera information (quick to transfer/update) and only outputs a single frame buffer (also very quick to transfer), yet the process of shadow rendering is slow and complex and requires extremely high bandwidth internally, so it'd be a great candidate for splitting off.

    Then you can also split off the post-processing to the iGPU and you've suddenly shaved maybe 6-8ms off your frame time.
  • Oogle - Monday, October 26, 2015 - link

    Yikes. Just one more exponential factor to add when doing benchmarks. More choice is great for us consumers. But reviews and comparisons are going to start looking more complicated. I'll be interested to see how guys will make recommendations when it comes to multi-gpu setups.
  • tipoo - Monday, October 26, 2015 - link

    Wow, seems like a bigger boost than I had anticipated. Will be nice to see all that unused silicon (in dGPU environments) getting used.
  • gamerk2 - Monday, October 26, 2015 - link

    As this test is a smaller number of combinations it’s not clear where the bottlenecks are, but it’s none the less very interesting how we get such widely different results depending on which card is in the lead. In the GTX 680 + HD 7970 setup, either the GTX 680 is a bad leader or the HD 7970 is a bad follower, and this leads to this setup spinning its proverbial wheels. Otherwise letting the HD 7970 lead and GTX 680 follow sees a bigger performance gain than we would have expected for a moderately unbalanced setup with a pair of cards that were never known for their efficient PCIe data transfers. So long as you let the HD 7970 lead, at least in this case you could absolutely get away with a mixed GPU pairing of older GPUs.


    Drivers. Pretty much that simple. Odds are, the NVIDIA drivers are treating the HD 7970 the same way it's treating the 680 GTX, which will result in performance problems. AMD and NVIDIA use very different GPU architectures, and you're seeing it here. NVIDIA is probably attempting to utilize the 7970 in a way it just can't handle.

    I'd be very interested to see something like 680/Titan, or some form of lower/newer setup, which is what most people would actually use this for (GPU upgrade).

Log in

Don't have an account? Sign up now