Preemption Improved: Fine-Grained Preemption for Time-Critical Tasks

Continuing our discussion thus far on asynchronous compute and concurrency, the Pascal architecture includes another major feature update related to how work is scheduled. For those of you who have caught our earlier coverage of NVIDIA’s Pascal P100 accelerator, then you should already have an idea of what this is, as NVIDIA touted it as an HPC feature as well. I am of course speaking about fine-grained preemption.

Before we start, in writing this article I spent some time mulling over how to best approach the subject of fine-grained preemption, and ultimately I’m choosing to pursue it on its own page, and not on the same page as concurrency. Why? Well although it is an async compute feature – and it’s a good way to get time-critical independent tasks started right away – its purpose isn’t to improve concurrency.

Asynchronous compute is in a sense a catch-all term, as the asynchronous execution of tasks has a number of different uses. But for consumer products, it’s important to make a distinction between those features that improve concurrency and allow a GPU to get more work done – which is the best understood feature of asynchronous compute – and other features that make more novel use of async. Fine-grained preemption is distinctly in the latter category.

So what is preemption then? In a nutshell, it’s the ability to interrupt an active task (context switch) on a processor and replace it with another task, with the further ability to later resume where you left off. Historically this is a concept that’s more important for CPUs than GPUs, especially back in the age of single core CPUs, as preemption was part of how single core CPUs managed to multitask in a responsive manner. GPUs, for their part, have supported context switching and basic preemption as well for quite some time, however until the last few years it has not been a priority, as GPUs are meant to maximize throughput in part by rarely switching tasks.


(Robert Berger/EE Times: Getting real (time) about embedded GNU/Linux)

Preemption in turn is defined by granularity; how quickly a processor can actually context switch. This ranges from coarse grained-context switching, which is essentially only being able to context switch at certain points in execution, out to fine-grained context switching that allows for an almost immediate switch at any point in time. What’s new for Pascal then is that preemptive context switching just got a lot finer grained, especially for compute.

Diving deeper, to understand what NVIDIA has done, let’s talk about prior generation architectures. Previously, with Maxwell 2 and older architectures, NVIDIA only offered very coarse grained preemption. For graphics tasks or mixed graphics/compute tasks, NVIDIA could only preempt at the boundary of draw calls – collections of triangles and other draw commands, potentially encompassing dozens of polygons and hundreds (if not thousands) of pixels and threads.


Preemption Circa Maxwell 2

The great thing about preempting at a draw call boundary is that it’s relatively clean, as draw calls are a very large and isolated unit of work. The problem with preempting at a draw call boundary is that draw calls are a very large unit of work; just because you call for the preemption doesn’t mean you’re going to get it any time soon. This analogy extends over to the compute side as well, as whole blocks of threads needed to be completed before a preemption could take place.

In fact it’s probably the compute side that’s a better real-world example of the problem with coarse-grained preemption. If you’ve ever run a GPU compute program and had your display driver reboot due to Timeout Detection and Recovery (TDR), then you’ve experienced coarse-grained preemption. For end-user responsiveness and quality purposes, Windows will reset a GPU if it doesn’t respond for more than around 2 seconds. With coarse-grained preemption, it is possible to schedule a compute task that takes so long to run that a block can’t complete in time for the Windows check-in, triggering the TDR reset. Consequently, there is good reason to support finer grained preemption for both graphics and compute tasks.

This in turn is where Pascal steps in. Along with the aforementioned improvements to how Pascal can fill up its execution pipelines, Pascal also implements a radically improved preemption ability. Depending on whether it’s a graphics or a pure compute task, Pascal can now preempt at the thread level or even the instruction level respectively.

Starting with the case of a graphics task or a mixed graphics + compute task, Pascal can now interrupt at the thread level. For a compute workload this is fairly self-explanatory. Meanwhile for a graphics workload the idea is very similar. Though we’re accustomed to working with pixels as the fundamental unit in a graphics workload, under the hood the pixel is just another thread. As a result the ability to preempt at a thread has very similar consequences for both a graphics workload and the compute threads mixed in with a graphics workload.

With Maxwell 2 and earlier architectures, the GPU would need to complete the whole draw call before preempting. However now with Pascal it can preempt at the pixel level within a triangle, within a draw call. When a preemption request is received, Pascal will stop rasterizing new pixels, let the currently rastered pixels finish going through the CUDA cores, and finally initiate the context switch once the above is done. NVIDIA likes to call this “Pixel Level Preemption.”

In terms of absolute time the benefit of course varies from game to game, and also a bit of luck depending on where in the draw call you are when the preemption request is made. But in general, draw call size and complexity has been going up over the years due to a combination of CPU limits (draw calls are expensive under DX11/OpenGL, which is why we have DX12/Vulkan) and the fact that pixel shaders continue to get longer, consequently taking more time for a given pixel/thread to fully execute. But in the end, the result is that Pascal can now execute a preemptive context switch for graphics much more rapidly than Maxwell 2 could.

Meanwhile in a pure compute scenario (i.e. running a CUDA program) Pascal takes things one step further. Not satisfied with preempting at the thread level, Pascal can actually preempt at the lowest level of them all, the instruction level. This means preempting a thread mid-flow, before the next instruction begins. The thread doesn’t even need to reach completion; with instruction level preemption, the thread can be pulled virtually immediately, Vaudeville Hook style.

This kind of fine-grained preemption is the most powerful, but it’s also primarily on the GPU for compute purposes. The greatest significance to consumers is allowing CUDA compute applications to run without triggering the TDR watchdog, as now they can stop on the spot and let the watchdog complete its work. This, not-so-coincidentally, has been one of Microsoft’s long-term goals with WDDM development, as they’ve been pushing GPU vendors to support finer-grained preemption for this very reason. Though more broadly speaking, NVIDIA’s plans are greater than just WDDM and the TDR watchdog, such as allowing functionality like interactive debugging of CUDA programs on a single-GPU system (previously they were greater limitations and you’d often need multiple GPUs). 

The trade-off for finer-grained preemption is that the deeper you go – the more “in the middle of things” you allow the interruption – the more work it is to context switch. Preempting at the draw call level involves very little state information, preempting at the thread level involves a fair bit of state, and preempting at the instruction level involves a massive amount of state, including the full contents of the L1 caches and the register files. In the case of the latter you’re looking at a minimum of several megabytes of state information that have to be recorded and sent to VRAM for storage. So finer-grained preemption is a very big deal at the implementation level, requiring new hardware to load and save states quickly, and then get it off to VRAM without skipping a beat.

The actual time cost for preemption varies with the workload, but at the most basic level, when the GPU is ready to execute the context switch, NVIDIA tells us that it can be done in under 100us (0.1ms), or about 170,000 clock cycles. Relative to the GPU this is not an insignificant amount of time, and while it’s much faster than the total context switch time from Maxwell 2, it does mean that context switching is still a somewhat expensive operation (roughly 50-100x more so than on a modern Intel CPU). So context switching still needs to be used intelligently, and, for best performance, infrequently.

Meanwhile I’ll quickly note that while the current Pascal drivers only implement thread/pixel level preemption for graphics and mixed workloads, as you might suspect from Pascal’s instruction level preemption capabilities for compute, this is something of an arbitrary decision. Because instruction level preemption is a fundamental Pascal ability, it could be made to do so for graphics as well as compute. The reason why NVIDIA has gone this route is one of cost/benefit tradeoffs; the DirectX team doesn’t believe that preemption at this level is necessary for graphics, in part because pixel shaders aren’t prone to long loops like dedicated compute kernels are, so the next thread boundary will come up relatively quickly. And preempting at the thread boundary is easier (i.e. there’s less state) than at the instruction level. This is something that NVIDIA could theoretically change course on in the future, but at least for now the belief is that the need/demand for such fine-grained preemption in a graphics context isn’t there.

Finally, speaking of graphics, let’s talk about what pixel/thread level preemption is going to be used for. Overall there aren’t too many cases where you want to context switch in the middle of a graphics task, but there is one case that NVIDIA considers very important: asynchronous time warp. We’ve covered asynchronous time warp before, so I won’t fully rehash it here, but in short it’s a neat hack for VR where the frame being prepared for a headset is warped at the last possible moment, using up-to-the-millisecond positioning data, in order to give the illusion that the frame is newer than it actually is. The benefit is that it reduces the perceived input lag of VR, which is one of the causes of simulator sickness.

NVIDIA has supported async time warp since Oculus enabled the feature in their SDK, however given the above limitations of Maxwell 2, the architecture wasn’t able to execute an async time warp very efficiently. Assuming there weren’t any free SMs to work on a time warp right away, the need to wait until the end of a draw call meant that it could potentially be a relatively large amount of time until the context switch took place. This required the software stack to be relatively conservative with the preemption call, making the call early in order to ensure that the time warp could execute before it’s too late.

With Pascal’s much improved preemption capabilities, this significantly changes how conservative async time warp needs to be. Pascal’s finer-grained preemption means that the preemption request can come much later. The benefit is that this both better ensures that the async time warp actually occurs in time – there’s virtually no chance of a thread taking too long to finish – and it means that much less GPU time is wasted from the GPU idling due to the conservative preemption request. The end result is that the performance impact of async time warp is reduced on Pascal as compared to Maxwell 2, allowing the GPU to spend more time in every refresh interval doing productive work.

All of that said, I don’t have a good feel for the benefit in terms of numbers; like so many other cases it’s going to depend on the specific game in use, and for that matter how well the game is saturating the GPU given the fixed 90Hz update interval. And though this puts Maxwell 2 in a lesser light, at least so far I haven’t personally noticed Maxwell 2 struggling to execute an async time warp. That Maxwell 2 does as well as it does is likely a testament to the viability of conservative preemption requests, which is to say that it will be interesting if we ever get a chance to see just how much performance Maxwell 2 had to leave on the table to reliably execute async time warps.

Asynchronous Concurrent Compute: Pascal Gets More Flexible Simultaneous Multi-Projection: Reusing Geometry on the Cheap
POST A COMMENT

200 Comments

View All Comments

  • TestKing123 - Wednesday, July 20, 2016 - link

    Sorry, too little too late. Waited this long, and the first review was Tomb Raider DX11?! Not 12?

    This review is both late AND rushed at the same time.
    Reply
  • Mat3 - Wednesday, July 20, 2016 - link

    Testing Tomb Raider in DX11 is inexcusable.

    http://www.extremetech.com/gaming/231481-rise-of-t...
    Reply
  • TheJian - Friday, July 22, 2016 - link

    Furyx still loses to 980ti until 4K at which point the avg for both cards is under 30fps, and the mins are both below 20fps. IE, neither is playable. Even in AMD's case here we're looking at 7% gain (75.3 to 80.9). Looking at NV's new cards shows dx12 netting NV cards ~6% while AMD gets ~12% (time spy). This is pretty much a sneeze and will as noted here and elsewhere, it will depend on the game and how the gpu works. It won't be a blanket win for either side. Async won't be saving AMD, they'll have to actually make faster stuff. There is no point in even reporting victory at under 30fps...LOL.

    Also note in that link, while they are saying maxwell gained nothing, it's not exactly true. Only avg gained nothing (suggesting maybe limited by something else?), while min fps jumped pretty much exactly what AMD did. IE Nv 980ti min went from 56fps to 65fps. So while avg didn't jump, the min went way up giving a much smoother experience (amd gained 11fps on mins from 51 to 62). I'm more worried about mins than avgs. Tomb on AMD still loses by more than 10% so who cares? Sort of blows a hole in the theory that AMD will be faster in all dx12 stuff...LOL. Well maybe when you force the cards into territory nobody can play at (4k in Tomb Raiders case).

    It would appear NV isn't spending much time yet on dx12, and they shouldn't. Even with 10-20% on windows 10 (I don't believe netmarketshare's numbers as they are a msft partner), most of those are NOT gamers. You can count dx12 games on ONE hand. Most of those OS's are either forced upgrades due to incorrect update settings (waking up to win10...LOL), or FREE on machine's under $200 etc. Even if 1/4 of them are dx12 capable gpus, that would be NV programming for 2.5%-5% of the PC market. Unlike AMD they were not forced to move on to dx12 due to lack of funding. AMD placed a bet that we'd move on, be forced by MSFT or get console help from xbox1 (didn't work, ps4 winning 2-1) so they could ignore dx11. Nvidia will move when needed, until then they're dominating where most of us are, which is 1080p or less, and DX11. It's comic when people point to AMD winning at 4k when it is usually a case where both sides can't hit 30fps even before maxing details. AMD management keeps aiming at stuff we are either not doing at all (4k less than 2%), or won't be doing for ages such as dx12 games being more than dx11 in your OS+your GPU being dx12 capable.

    What is more important? Testing the use case that describes 99.9% of the current games (dx11 or below, win7/8/vista/xp/etc), or games that can be counted on ONE hand and run in an OS most of us hate. No hate isn't a strong word here when the OS has been FREE for a freaking year and still can't hit 20% even by a microsoft partner's likely BS numbers...LOL. Testing dx12 is a waste of time. I'd rather see 3-4 more dx11 games tested for a wider variety although I just read a dozen reviews to see 30+ games tested anyway.
    Reply
  • ajlueke - Friday, July 22, 2016 - link

    That would be fine if it was only dx12. Doesn't look like Nvidia is investing much time in Vulkan either, especially not on older hardware.

    http://www.pcgamer.com/doom-benchmarks-return-vulk...
    Reply
  • Cygni - Wednesday, July 20, 2016 - link

    Cool attention troll. Nobody cares what free reviews you choose to read or why. Reply
  • AndrewJacksonZA - Wednesday, July 20, 2016 - link

    Typo on page 18: "The Test"
    "Core i7-4960X hosed in an NZXT Phantom 630 Windowed Edition" Hosed -> Housed
    Reply
  • Michael Bay - Thursday, July 21, 2016 - link

    I`d sure hose me a Core i7-4960X. Reply
  • AndrewJacksonZA - Wednesday, July 20, 2016 - link

    @Ryan & team: What was your reasoning for not including the new Doom in your 2016 GPU Bench game list? AFAIK it's the first indication of Vulkan performance for graphics cards.

    Thank you! :-)
    Reply
  • Ryan Smith - Wednesday, July 20, 2016 - link

    We cooked up the list and locked in the games before Doom came out. It wasn't out until May 13th. GTX 1080 came out May 14th, by which point we had already started this article (and had published the preview). Reply
  • AndrewJacksonZA - Wednesday, July 20, 2016 - link

    OK, thank you. Any chance of adding it to the list please?

    I'm a Windows gamer, so my personal interest in the cross-platform Vulkan is pretty meh right now (only one title right now, hooray! /s) but there are probably going to be some devs are going to choose it over DX12 for that very reason, plus I'm sure that you have readers who are quite interested in it.
    Reply

Log in

Don't have an account? Sign up now