Of the GPU and Shading

This is my favorite part, really. After the CPU has started sending draw calls to the graphics card, the GPU can begin work on actually rendering the frame containing the input that was generated somewhere in the vicinity of 3ms to 21ms ago depending on the software (and it would be an additional 1ms to 7ms for a slower mouse). Modern, complex, games will tend push up to the long end of that spectrum, while older games (or games that aren't designed to do a lot of realistic simulation like twitch shooters) will have a lower latency.

Again, the actual latency during this stage depends greatly on the complexity of the scene and the techniques used in the game.

These days, geometry processing and vertex shading tend to be pretty fast (geometry shading is slower but less frequently used). With features like instancing and the fact that the majority of detail is introduced via the pixel shader (which is really a fragment shader, but we'll dispense with the nit picking for now). If the use of tessellation catches on after the introduction of DX11, we could see even less actual time spent on geometry as the current level of detail could be achieved with fewer triangles (or we could improve quality with the same load). This step could still take a millisecond or two with modern techniques.

When it comes to actual fragment generation from the geometry data (called rasterization), the fixed function hardware and early z / z culling techniques used make this step pretty fast (yet this can be the limiting factor in how much geometry a GPU can realistically handle per frame).

Most of our time will likely be spent processing pixel shader programs. This is the step where every pin point spot on every triangle that falls behind the area of a screen space pixel (these pin point spots are called fragments) is processed and its color determined. During this step, texture maps are filtered and applied, work is done on those textures based on things like the fragments location, the angle of the underlying triangle to the screen, and constants set for the fragment. Lighting is also part of the pixel shading process.

Lighting tends to be one of the heaviest loads in a heavily loaded portion of the pipeline. Realistic lighting can be very GPU intensive. Getting into the specifics is beyond the scope of this article, but this lighting alone can take a good handful of milliseconds for an entire frame. The rest of the pixel shading process will likely also take multiple milliseconds.

After it's all said and done, with the pixel shader as the bottleneck in modern games, we're looking at something like 6ms to 25ms. In fact, the latency of the pixel shaders can hide a lot of the processing time of other parts of the GPU. For instance, pixel shaders can start executing before all the geometry is processed (pixel shaders are kicked off as fragments start coming out of the rasterizer). The color/z hardware (render outputs, render backends or ROPs depending on what you want to call them) can start processing final pixels in the framebuffer while the pixel shader hardware is still working on the majority of the scene. The only real latency that is added by the geometry/vertex processing portion of the pipeline is the latency that happens before the first pixels begin processing (which isn't huge). The only real latency added by the ROPs is the processing time for the last batch of pixels coming out of the pixel shaders (which is usually not huge unless really complicated blending and stencil technique are used).

With the pixel shader as the bottleneck, we can expect that the entire GPU pipeline will add somewhere between 10ms and 30ms. This is if we consider that most modern games, at the resolutions people run them, produce something between 33 FPS and 100 FPS.

But wait, you might say, how can our framerate be 33 to 100 FPS if our graphics card latency is between 10ms and 30ms: don't the input and CPU time latencies add to the GPU time to lower framerate?

The answer is no. When we are talking about the total input lag, then yes we do have to add these latencies together to find out how long it has been since our input was gathered. After the GPU, we are up to something between 13ms and 58ms of input lag. But the cool thing is that human response happens in parallel to input gathering which happens in parallel to CPU time spent processing game logic and draw calls (which can happen in parallel to each other on multicore CPUs) which happens in parallel to the GPU rendering frames. There is a sequential path from input to the screen, but we can almost look at this like a heavily pipelined path where each stage operates in parallel on a different upcoming frame.

So we have the GPU rendering the previous frame while simulation and game logic are executing and input is being gathered for the next frame. In this way, the CPU can be ready to send more draw calls to the GPU as soon as the GPU is ready (provided only that we are not CPU limited).

So what happens after the frame is finished? The easy answer is a buffer swap and scanout. The subtle answer is mounds of potential input lag.

Parsing Input in Software and the CPU Limit Scanout and the Display


View All Comments

  • aguilpa1 - Thursday, July 16, 2009 - link

    Lots of variables that we never consider when trying to do fast gaming. I would be curious how much lag is in a racing sim like GRID or Colin McCrae DIRT. Those are intense graphics games and demand the fastest everything to keep you from going in the ditch. I have noticed I compensate at times by estimating when to start turning before the turn arrives. Reply
  • crimson117 - Thursday, July 16, 2009 - link

    Isn't part of that just intentional skidding / drift in racing games, to mimic the "lag" of rubber catching asphalt at high speeds? Reply
  • hechacker1 - Thursday, July 16, 2009 - link

    You say TF2 is GPU limited, but with my 4850 I find the first core is pegged at 100%. The same applies to my older 3850.

    With core i7 920 @ 166x20 = 3320MHz and +166 for Turbo mode, hyper threading on, I see TF2 using 6 cores, The first is pegged out at 100%, the second and third vary from 50-100% depending on the action (32 player server). The other three hover around 25%.

    1920x1080. Benq 2400G (bought for its low input-lag)

    All highest settings, 4xMSAA, Aniso 8x, Disable vsync, FOV 90

    My framerate hovers around 100FPS for most Valve maps.

    I use this autoexec to get more threading and higher quality textures:

    rlod 0 matpicmip -10 clnewimpacteffects 1 mpusehwmmodels 1 mpusehwmvcds 1 clburninggibs 1 matspecular 1 matparallaxmap 1 rthreadedparticles 1 rthreadedrenderables 1 clragdollcollide "1" jpegquality 100 rthreadedclientshadow_manager 1

    Most people say TF2 is a CPU limited game. Perhaps that only applies ATI?

    Even without the autoexec.cfg, I see the game use 100% on the first core.

    Very good article though. I hope this shuts up the false info that 60fps is too fast for humans to notice.

  • DerekWilson - Thursday, July 16, 2009 - link

    even if a core is pegged at 100% that doesn't mean the game's performance is CPU limited.

    at 2560x1600 we were hovering around 110fps but at 1152x864 we were constantly well over 200 fps. As lowering resolution doesn't change the load on the CPU, this clearly indicates that we were GPU bound -- at least at 25x16.

    For our 1152x864@120Hz test, we might have been CPU bound, but I don't have the data to know for sure here (I didn't test any near resolutions).
  • hechacker1 - Thursday, July 16, 2009 - link

    Oh yeah.

    Flip queue to 0.
    ATI A.I. at Low or "standard" (I've read "high" mode can use more CPU?)

    Latest driver. Windows 7 x64 7201.
  • Qiasfah - Thursday, July 16, 2009 - link

    In the article you stated that TF2 was GPU limited (and it was in the situations you were testing), however you should find that in battle situations with other active characters present it becomes heavily CPU limited. It would be interesting to see if there was a difference in input lag due to this in the midst of battles rather than sitting idle.

    I run an i7 920, and even with multicore enabled (an option which will very commonly double a persons TF2 framerate) i get the same dips in FPS regardless of graphical settings. It would be interesting to see how overclocking affects the performance of this game.
  • DerekWilson - Thursday, July 16, 2009 - link

    More than likely, in TF2, you'll be bottlenecked at the network when it comes to performance ...

    But the way Valve does things is with local prediction (running code on the client) and then checking predictions on the server. This should mean that our test shows what you can expect to actually /see/ whether or not what actually /happens/ is the same (if you are very laggy on the network or if there are lots of players or whatever).
  • codestrong - Thursday, July 16, 2009 - link

    "Beyond that, GPU is the next most important faster (factor?), and a mouse that can do at least 500 reports per second is a good idea." Nice work by the way. I've been interested in this since Carmack mentioned input lag during his work on quake live. Reply
  • DerekWilson - Thursday, July 16, 2009 - link

    yeah, i meant factor. thanks. Reply
  • SiliconDoc - Tuesday, July 21, 2009 - link

    Yes, nice article and nice work on getting the job done without a super expensive camera, on an interesting subject for gamers. Reply

Log in

Don't have an account? Sign up now