Drilling Down: DX11 And The Multi-Threaded Game Engine
In spite of the fact that multi-threaded programming has been around for decades, mainstream programmers didn't start focusing on parallel programming until multi-core CPUs started coming along. Much general purpose code is straightforward as a single thread; extracting performance via parallel programming can be difficult and isn't always obvious. Even with talented programmers, Amdahl's Law is a bitch: your speed up from parallelization is limited by the percent of code that is necessarily sequential.
Currently, in game development, rendering is one of those "necessarily" sequential tasks. DirectX 10 isn't set up to appropriately handle multiple threads all throwing commands at the GPU. That doesn't mean parallelization of renderers can't happen, but it does limit speed up because costly synchronization techniques or management threads need to be implemented in order to make sure nothing steps out of line. All this limits the benefit of parallelization and discourages programmers from trying too hard. After all, it's a better idea to put more of your effort into areas where performance can be improved more significantly. (John Carmack put it really well once, but I can't remember the quote... and I'm doing too much benchmarking to go look for it now. :-P)
No matter what anyone does, some stuff in the renderer will need to be sequential. Programs, textures, and resources must be loaded up; geometry happens before pixel processing; draw calls intended to be executed while a certain state is active must have that state set first and not changed until completion. Even in such a massively parallel machine, order must be maintained for many things. But order doesn't always matter.
Making more things thread-safe through an extended device interface using multiple contexts and making a lot of synchronization overhead the responsibility of the API and/or graphics driver, Microsoft has enabled game developers to more easily and effortlessly thread not only their rendering code, but their game code as well. These things will also work on DX10 hardware running on a system with DX11, though some missing hardware optimizations will reduce the performance benefit. But the fundamental ability to write code differently will go a long way to getting programmers more used to and better at parallelization. Let's take a look at the tools available to accomplish this in DX11.
First up is free threaded asynchronous resource loading. That's a bit of a mouthful, but this feature gives developers the ability to upload programs, textures, state objects, and all resources in a thread-safe way and, if desired, concurrent with the rendering process. This doesn't mean that all this stuff will get pushed up in parallel with rendering, as the driver will manage what gets sent to the GPU and when based on priority, but it does mean the developer no longer has to think about synchronizing or manually prioritizing resource loading. Multiple threads can start loading whatever resources they need whenever they need them. The fact that this can also be done concurrently with rendering could improve performance for games that stream in data for massive open worlds in addition to enabling multi-threaded opportunities.
In order to enable this and other threading, the D3D device interface is now split into three separate interfaces: the Device, the Immediate Context, and the Deferred Context. Resource creation is done through the Device. The Immediate Context is the interface for setting device state, draw calls, and queries. There can only be one Device and one Immediate Context. The Deferred Context is another interface for state and draw calls, but many can exist in one program and can be used as the per-thread interface (Deferred Contexts themselves are thread unsafe though). Deferred Contexts and the free threaded resource creation through the device are where DX11 gets it multi-threaded benefit.
Multiple threads submit state and draw calls to their Deferred Context which complies a display list that is eventually executed by the Immediate Context. Games will still need a render thread, and this thread will use the Immediate Context to execute state and draw calls and to consume the display lists generated by Deferred Contexts. In this way, the ultimate destination of all state and draw calls is the Immediate Context, but fine grained synchronization is handled by the API and the display driver so that parallel threads can be better used to contribute to the rendering process. Some limitations on Deferred Contexts include the fact that they cannot query the device and they can't download or read back anything from the GPU. Deferred Contexts can, however, consume the display lists generated by other Deferred Contexts.
The end result of all this is that the future will be more parallel friendly. As two and four core CPUs become more and more popular and 8 and 16 (logical) core CPUs are on the horizon, we need all the help we can get when trying to extract performance from parallelism. This is a good move for DirectX and we hope it will help push game engines to more fully utilize more than two or even four cores when the time comes.