A Brief History on Multi-GPU with Dissimilar GPUs

Before we dive into our results, let’s talk briefly about the history of efforts to render games with multiple, dissimilar GPUs. After the development of PCI Express brought about the (re)emergence of NVIDIA’s SLI and AMD’s CrossFire, both companies eventually standardized their multi-GPU rendering efforts across the same basic technology. Using alternate frame rendering (AFR), NVIDIA and AMD would have the GPUs in a multi-GPU setup each render a separate frame. With the drivers handing off frames to each GPU in an alternating manner, AFR was the most direct and most compatible way to offer multi-GPU rendering as it didn’t significantly disrupt the traditional game rendering paradigms. There would simply be two (or more) GPUs rendering frames instead of one, with much of the work abstracted by the GPU drivers.

Using AFR allowed for relatively rapid multi-GPU support, but it did come with tradeoffs as well. Alternating frames meant that inter-frame dependencies needed to be tracked and handled, which in turn meant that driver developers had to add support for games on a game-by-game basis. Furthermore the nature of distributing the work meant that care needed to be taken to ensure each GPU rendered at an even pace so that the resulting in-game motion was smooth, a problem AMD had to face head-on in 2013. Finally, because AFR had each GPU rendering whole frames, it worked best when GPUs were as identical as possible in performance; a performance gap would at best require a faster card to spend some time waiting on the slower card, and at worse exacerbate the aforementioned frame pacing issues. As a result NVIDIA only allows identical cards to be paired up in SLI, and AMD only allows a slightly wider variance (typically cards using the same GPU).

In 2010 LucidLogix set out to do one better, leveraging their graphics expertise to develop their Hydra technology. By using a combination of a hardware and software, Hydra could intercept DirectX and OpenGL calls and redistribute them to split up rendering over multiple, and for the first time, dissimilar GPUs. Long a dream within the PC gaming space (and subject of a few jokes), the possibilities for using dissimilar GPUs via Hydra were immense – pairing up GPUs not only from different vendors, but of differing performance as well – resolving some of AFR’s shortcomings while allowing gamers to do things such as reuse old video cards and still receive a performance benefit.

However in the long run the Hydra technology failed to catch on. The process of splitting up API calls, having multiple GPUs render them, and compositing them back together to a single frame proved to be harder than LucidLogix expected, and as a result Hydra’s compatibility was poor and performance gains were limited. Coupled with the cost of the hardware, the licensing, and the fact that Hydra boards were never SLI certified (preventing typical NVIDIA SLI operation) meant that Hydra had a quick exit from motherboards.

In the end what LucidLogix was attempting was a valiant effort, but in retrospect one that was misguided. Working at the back-end of the rendering chain and manipulating API calls can work, but it is a massive amount of effort and it has hardware developers aiming at a moving target, requiring constant effort to keep up with new games. AMD and NVIDIA’s driver-level optimizations don’t fare too much better in this respect; there are vendor-specific shortcuts such as NVAPI that simplify this some, but even AMD and NVIDIA have to work to keep up with new games. This is why they need to issue driver updates and profile updates so frequently in order to get the best performance out of CrossFire and SLI.

But what if there was a better way to manage multiple GPUs and assign work to them? Would it be possible to do a better job working from the front-end of the rendering chain? This is something DirectX 12 sets out to answer with its multi-adapter modes.

DirectX 12 Multi-GPU

In DirectX 12 there are technically three different modes for multi-adapter operation. The simplest of these modes is what Microsoft calls Implicit Multi-Adapter. Implicit Multi-Adapter is essentially the lowest rung of multi-adapter operation, intended to allow developers to use the same AFR-friendly techniques as they did with DirectX 11 and before. This model retains the same limited ability for game developers to control the multi-GPU rendering process, which limits the amount of power they have, but also limits their responsibilities as well. Consequently, just as with DirectX 11 mutli-GPU, in implicit mode much of the work is offloaded to the drivers (and practically speaking, AMD and NVIDIA).

While the implicit model has the most limitations, the lack of developer responsibilities also means it’s the easiest to implement. In an era where multi-platform games are common, even after developers make the switch to DirectX 12 they may not want to undertake the effort to support Explicit Multi-Adapter, as the number of PC owners with multiple high-powered GPU is a fraction of the total PC gaming market. And in that case, with help from driver developers, implicit is the fastest path towards supporting multiple GPUs.

What’s truly new to DirectX 12 then are its Explicit Multi-Adapter (EMA) modes. As implied by the name, these modes require game developers to explicitly program for multi-GPU operation, specifying how work will be assigned to each GPU, how memory will be allocated, how the GPUs will communicate, etc. By giving developers explicit control over the process, they have the best chance to extract the most multi-GPU performance out of a system, as they have near-absolute control over both the API and the game, allowing them to work with more control and more information than any of the previously discussed multi-GPU methods. The cost of using explicit mode is resources: with great power comes great responsibility, and unlike implicit mode, game developers must put in a fair bit of work to make explicit mode work, and more work yet to make it work well.

Within EMA there are two different ways to address GPUs: linked mode and unlinked mode. Unlinked mode is essentially the baseline mode for EMA, and offers the bulk of EMA’s features. Linked mode on the other hand builds unlinked that by offering yet more functionality in exchange for much tighter restrictions on what adapters can be used.

The ultimate purpose of unlinked mode is to allow developers to take full advantage of all DirectX 12 capable GPU resources in a system, at least so long as they are willing to do all of the work required to manage those resources. Unlinked mode, as opposed to linked mode and implicit multi-adapter, can work with DX12 GPUs from any vendor, providing just enough abstraction to allow GPUs to exchange data but putting everything else in the developer’s hands. Depending on what developers want to do, unlinked mode can be used for anything from pairing up two dGPUs to pairing up a dGPU with an iGPU, with the GPUs being a blank slate of sort for developers to use as they see fit for whatever algorithms and technologies they opt to use.

As the base mode for DirectX 12 multi-GPU, unlinked mode presents each as its own device, with its own memory, its own command processor, and more, accurately representing the layout of the physical hardware. What DirectX 12’s EMA brings to the table that’s new is that it allows developers to exchange data between GPUs, going beyond just finished, rendered images and potentially exchanging partially rendered frames, buffers, and other forms of data. It’s the ability to exchange multiple data types that gives EMA its power and its flexibility, as without it, it wouldn’t be possible to implement much more than AFR. EMA is the potential for multiple GPUs to work together, be it similar or disparate; no more and no less.

If this sounds very vague that’s because it is, and that in turn is because the explicit API outstrips what today’s hardware is capable of. Compared to on-board memory, any operations taking place over PCI Express are relatively slow and high latency. Some GPUs in turn handle this better than others, but at the end of the day the PCIe bus is still a bottleneck at a fraction of the speed of local memory. That means while GPUs can work together, they must do so intelligently, as we’re not yet at the point where GPUs can quickly transfer large amounts of data from each other.

Because EMA is a blank slate, it ultimately falls to developers to put it to good use; DirectX 12 just supplies the tools. Traditional AFR implementations are one such option, and so are splitting up workloads in other fashions such as split-frame rendering (SFR), or even methods where one GPU doesn’t render a complete frame or fractions of a complete frame, passing off frames at different stages to different GPUs.

But practically speaking, a lot of the early focus on EMA development and promotion is on dGPU + iGPU, and this is because the vast majority of PCs with a dGPU also have an iGPU. Relative to even a $200 dGPU, an iGPU is going to offer a fraction of the performance, but it’s also a GPU resource that is otherwise going unused. Epic Games has been experimenting with using EMA to have iGPUs do post-processing, as finished frames are relatively small (1080p60 with an FP32 frame is only 2GB/sec, a fraction of PCIe 3.0 x16’s bandwidth), post-processing is fairly lightweight in resource requirements, and it typically has a predictable processing time.

Moving on, building on top of unlinked mode is EMA’s linked mode. Linked mode is by and large the equivalent of SLI/CrossFire for EMA, and is designed for systems where all GPUs being used are near-identical. Within linked mode all of the GPUs are pooled and presented to applications as a single GPU, just with multiple command processors and multiple memory pools due to the limits of the PCIe bus. Because linked mode is restricted to similar GPUs, developers gain even more power and control, as linked GPUs will be from the same vendor and use the same data formats at every step.

Broadly speaking, linked mode will be both easier and harder for developers to use relative to unlinked mode. Unlike unlinked mode there are certain assumptions that can be made about the hardware and what it’s capable of, and developers won’t need to juggle the complications of using GPUs from multiple vendors at once. On the other hand this is the most powerful mode because of all of the options it presents developers, with more complex rendering techniques likely to be necessary to extract the full performance benefit of linked mode.

Ultimately, one point that Microsoft and developers have continually reiterated in their talks is that explicit multi-adapter is that like so many other low-level aspects of DirectX 12, it’s largely up to the developers to put the technology to good use. The API provides a broad set of capabilities – tempered a bit by hardware limitations and how quickly GPUs can exchange data – but unlike DirectX 11 and implicit multi-adapter, it’s developers that define how GPUs should work together. So whether a game supports any kind of EMA operation and whether this means combining multiple dGPUs from the same vendor, multiple dGPUs from different vendors, or a dGPU and an iGPU is a question of software more than it is of hardware.

Previewing DirectX 12 Explicit Multi-Adapter with Ashes of the Singularity Ashes of the Singularity: Unlinked Explicit Multi-Adapter w/AFR & The Test
Comments Locked

180 Comments

View All Comments

  • mosu - Thursday, October 29, 2015 - link

    Did you ever owned or touched an Iris HD 6000? or at least know someone who did?
  • wiak - Friday, October 30, 2015 - link

    eDRAM...
    if AMD goes HBM2 like they did in the past with ddr3 sideport memory

    just a taught
    AMD Zen 4-8 Core with Radeon (2048+ shaders, 2 or 4GB HBM2 (either as slot on mb or ondie like fury)

    i think i read somewhere there will be a single socket for APUs and CPUs,
    so amd lineup can be a Zen CPU with 8-16 cores for perf system and a Zen APU with 4-8 cores, 2048+ shaders and hbm2 for mainstream/laptops computers
  • Michael Bay - Thursday, October 29, 2015 - link

    If it actually could, we would be able to buy it. No such luck.
  • Revdarian - Thursday, October 29, 2015 - link

    Well, it has currently two offerings, one called Xbox One, and the other one that is more powerful is called the Playstation 4.

    Those are technically APU's, developed by AMD, and can be bought at the moment. Just saying, it is possible.
  • Midwayman - Monday, October 26, 2015 - link

    Seems like it would be great to do post effects and free up the main gpu to work on rendering.
  • Alexvrb - Monday, October 26, 2015 - link

    Agreed, as far as dGPU and iGPU cooperation goes I think Epic is on to something there. Free 10% performance boost? Why not. Now for dGPU + dGPU modes, I am not killed on the idea of unlinked mode. Seems like developers would have their work cut out for them with all the different possible configurations. Linked mode makes the most sense to me for consistency and relative difficulty to implement. Plus anyone using multiple GPUs is already used to using a pair of the same GPUs.

    Regardless of whether they go linked or unlinked though... I'd really like them to do something other than AFR. Split-frame, tile-based, something, anything. Blech.
  • Refuge - Monday, October 26, 2015 - link

    For high end AAA Titles likened mode would be optimum, I agree. Allows for their fast releases, and still gives a great performance boost. Their target demographic is already used to having to jump through hoops to get the results they want. Getting identical GPU's won't affect them.

    For games with extended lifetimes like MMO's such as WoW, Swtor, etc, etc. Unlikened mode is worth the investment, as it allows your game to hit a MUCH wider customer base with increased graphical performance. These are crowds that are easy to pole for data so they would easily know who they are directing their efforts towards, and the lifespan of the game make the extra man hours a worthy investment.
  • Gadgety - Tuesday, October 27, 2015 - link

    @alexvrb And game testers have their work cut out for them as well, testing all sorts of hardware configurations.

    In addition game developers will likely see the need for new skill sets, and likely this will benefit larger outfits being able to cope with developing and tuning their games to various hard ware combinations.
  • DanNeely - Tuesday, October 27, 2015 - link

    I suspect most small devs will continue to use their engine in the normal way, not taking any more advantage of most of the DX12 multi-GPU features any more than they did SLI/XFire in DX11 or prior. The only exception I see might be offloading post-processing to the IGP. That looks like a much simpler split to implement; and might be something they could get for free from the next version of their engine.
  • nightbringer57 - Monday, October 26, 2015 - link

    Wow. I didn't expect this to work this well.

    Just out of curiosity... Could you get a few more data points to show how a Titan X + Fury X/Fury X + Titan X would fare?

Log in

Don't have an account? Sign up now