AMD Comments on GPU Stuttering, Offers Driver Roadmap & Perspective on Benchmarking
by Ryan Smith on March 26, 2013 2:28 AM ESTThe Start: The Rendering Pipeline In Detail
Before we can even discuss the concept of stuttering and other frame timing anomalies, we need to first take a look at a high-level overview of the Windows rendering pipeline. The pipeline isn’t particularly complex, but understanding where various stages of the process are in the hands of Windows, the CPU, the driver, and the video card is necessary to understand where bottlenecks and delays can occur.
At its most fundamental level, rendering a frame is a 3 part process. An application needs to pass data to Windows, Windows needs to manage the process and interface with the drivers, and finally once Windows and driver preparation is complete, a frame can be passed off to the GPU for final rendering and display.
At the top of the chain is the application itself. This is where user input is being handled and where in the context of a game the simulation is being executed. From a technical perspective, it is the application that is the first arbitrator for game smoothness; applications are responsible for adjusting the simulation rate in order to keep the flow of frames smooth. If the application cannot ensure an even rate, then nothing else that follows will really matter.
The reality of course is that this is harder than it sounds. It is not an insurmountable problem, but PCs are devices with a wide spectrum of performance and capabilities. A dual-core processor with an iGPU performs very different from a hex-core processor with a small army of GPUs, and an application needs to be able to accommodate this so that the simulation operates as evenly as possible in both CPU and GPU-bottlenecked scenarios.
Ultimately any timing model is going to be reactive, adjusting itself in response to prior events and how long previous frames took to render. Though another option is to shortcut this process entirely and operate at a fixed (or capped) simulation rate, either basing a game around 30Hz/60Hz operation, or decoupling rendering from the simulation entirely. Anyone who has uncapped id Software’s Rage for example will find that the game simply does not behave correctly without its 60Hz cap.
Static or dynamic, once a simulation has a suitable timing model in place we can then begin to look further down the chain, which is where we first encounter Direct3D, Windows’ primary 3D rendering API. Direct3D is nothing short of an enormous, complex structure of API calls and features. We tend to reduce it to version numbers and marque features for the sanity of ourselves and our readers – as we will here – but it goes without saying that Direct3D takes years to master; and for a GPU manufacturer it’s made all the more complex by the simultaneous existence of the modern iteration of Direct3D (DX10+), and the classic iteration that is DX9 and its predecessors.
For the purpose of the rendering pipeline Direct3D has a few different jobs. First and foremost, it is collecting draw calls from the application, combining them, and processing them for further work. Once a complete frame’s worth of draw calls has been collected, Direct3D passes its processed work over to the first component of the video card driver stack, the User Mode Driver (UMD).
It’s the UMD that is primarily responsible for taking the output of Direct3D and turning it into work batches the GPU can handle. These work batches, command buffers (aka Display Lists), are collections of instructions and data suitable for processing by the target GPU. Among other things, the UMD is responsible for shader compilation and assigning rendering elements to the correct (and best) surface formats for the GPU.
A logical view of single command buffer; from Microsoft's Direct3D documentation
When the UMD’s work is complete, it passes its command buffer back over to Direct3D. Direct3D in turn passes that command buffer to the context queue, our first real bottleneck. We’ll get back to why this is a bottleneck in a bit, but briefly, the context queue is responsible for queuing the individual command buffers in order to smooth out the rendering process. Queuing command buffers at this stage increases frame rendering latency, but by providing a buffer of buffers it allows the rendering pipeline to absorb any variances in rendering time or simulation time to more smoothly render frames.
The context queue has also gone by other names over the years, such as the flip queue and the pre-rendered frames queue. This is the source of the 3 frame render-ahead limit in Windows that is sometimes exposed in games and drivers, as Windows will by default queue up to 3 frames in this manner. This can be controlled by application developers, but most will leave it at 3 so long as a game is smoothly moving along.
Beyond the context queue we have Windows’ GPU scheduler, which is what regulates the popping of command buffers off of the context queue to be fed to the kernel mode GPU driver (KMD). Beyond this point the rest of the pipeline is rather simple, with the KMD taking the command buffer and feeding it to the GPU, all the while the KMD and GPU work together to manage the operation of the GPU. When a frame is finally completed, the GPU generates an interrupt to inform the KMD and OS about the completion.
At the end of this process we have a rendered frame sitting in the GPU’s back-buffer, but the frame itself is not displayed automatically. At the end of a batch of command buffers – effectively making the beginning and ends of frames – is the Direct3D Present() call. Present is the command that is responsible for telling the GPU to flip the back buffer to the front and to present the rendered frame to the user. Only once the Present call executes does a frame get displayed. The Present call, though not a command buffer object, still follows the same rendering path as the command buffers, including queuing up in the Context Queue.
103 Comments
View All Comments
Shark321 - Wednesday, March 27, 2013 - link
Overall a good article, but it has one huge problem. Ryan, you are repeating about 10 times that there is no good tool to replace the Fraps measuring, which is inaccurate.But there is. PcPerformance has intruduced a new microstutter measuring method weeks ago: http://www.pcper.com/reviews/Graphics-Cards/Frame-...
rickcain2320 - Wednesday, March 27, 2013 - link
I just bought an AMD/ATI card and not only do I have stuttering I have that horrid POWERPLAY kicking in all the time with screen tearing. I'm pulling my hair out and wondered why I didn't buy Geforce. My old 8800GTS was doing great but it finally gave up the ghost one day, I should have stuck with at least something consistent in performance.Deo Domuique - Wednesday, March 27, 2013 - link
This is the main problem on Anand's end, they need to sit down with a manufacturer firstly, in order to give us at least some valid graphs. It's understandable to a point, you don't bite the hand that feeds you, but... to a point. On the other hand, I trust TechReport's graphs... Actually TR is one of the very few websites I trust.lally - Wednesday, March 27, 2013 - link
There's actually been a lot of research on frame jitter's effects on people. You measure how well people do a specific task with different amounts of it, and compare their performance on the task to the jitter.http://lmgtfy.com/?q=virtual+reality+frame+rate+ji...
NerdT - Wednesday, March 27, 2013 - link
First of all, it's a very good read. Thanks.Re problem of GPUView "Furthermore it still doesn’t show us when a GPU buffer swap actually takes place and the user sees a new frame, and that remains the basis of any kind of fine-grained look into stuttering." :
It can actually show you a "flip queue" in yellow color where you can see when the frame was started to get flipped with the front buffer, the end of the flip process, and the wait time until it reaches VSync signal and that's the time user sees the frame. Not sure why you mentioned this. Better to revise it. I have been using GPUView for about two years and it's really unique, no other tool can yet compete with it.
mikato - Wednesday, March 27, 2013 - link
Nvidia: ok we knew our ride here would end sometime. No more competitive advantage "secret bonus" in performance.AMD fanboy: argh, as usual my AMD parts will perform better with time, and not get the respect deserved since all the benchmarks were done already.
JeBarr - Thursday, March 28, 2013 - link
What a long drawn out way of helping AMD in the PR department.Unlike most commenters, what I took away from this article is the fact that Ryan Smith is no longer qualified to conduct GPU benchmarks.
GPUView too complicated? Seriously?
lol.
Death666Angel - Thursday, March 28, 2013 - link
First of all: Great read! Very technical, but very interesting and still easy to understand. :)Concerning V-Sync: I always enable it when I start playing a game for the first time. But 3 times out of 5, the gameplay gets too sluggish (that would probably be the added latency). So I have to turn it off and live with screen tearing and too much frames being rendered. It's a shame.
And reading all this and the issues involved, it makes me wonder how Oculus and the involved parties are getting around this problem. They are working on minimizing latency left and right. I would like to see their input on this and if they are only optimizing for a few hardware setups. :)
LoccOtHaN - Wednesday, April 3, 2013 - link
Mirillis Action! that Program is an Alternative to Fraps (no stutering ! and its werry light ) RECOMENDED by Ne01KilledByAPixel - Thursday, April 4, 2013 - link
It is great to finally see someone deconstructing the issue of stutter in games, it drives me nuts! I also wrote an article that actually offers a solution to this problem. I developed a simple system that allows games to smooth out their delta by predicting the time when a frame will be rendered rather then using the measured delta from the update.http://frankforce.com/?p=2636