Parsing Input in Software and the CPU Limit

Before we get into software, for the sake of sanity, we are going to ignore context switching and we'll pretend that only the operating system kernel and the game are running and always get processor time exactly when they need it for as long as its needed (and never need it at the same time). In real life desktop operating systems, especially on single core processors, there will be added delay due to process scheduling between our game and other tasks (which is handled by the operating system) and OS background tasks. These delays (in extreme cases called starvation) can be somewhere between a handful of nanoseconds or on the microsecond level on modern systems depending on process prioritization, what else is happening, and how the scheduler is implemented.

Once the mouse has sent its report over USB to the PC and the USB root hub receives the data, it is up to the OS (for our purposes, MS Windows) to handle the data next. Our report travels from the USB root hub over the system bus (southbridge through the north bridge to the CPU takes +/- some nanoseconds depending on load), is put on an input stack (in this case the HID (Human Interface Device) stack), and a Windows OS message (WM_INPUT) is generated to let any user space software monitoring raw mouse input know that new data has arrived. Software written to take full advantage of hardware will handle the WM_INPUT message by reading the appropriate data directly from the HID stack after it gets the message that data is waiting.

This particular part of the process (checking windows messages and handling the WM_INPUT message) happens pretty fast and should be on the order of microseconds at worst. This is a hard delay to track down, as the real time this takes is dependent on what the programmer actually does. Latencies here are not guaranteed by either the motherboard chipset or Windows.

Once the software has the data (after at least 1ms and some microseconds in change), it needs to do something with it. This is hugely variable, as developers can choose to implement doing something with input at any of a number of points in the process of updating the game state for the next frame. The thing that makes the most sense to me would be to run your AI based on the previous input data, step through any scripted actions, update physics per object based on last state and AI decisions, then get user data and update player state/physics based on previous state and current input.

There are cases or design decisions that may require getting user input before doing some of these other tasks, so the way I would want to do it might not be practical. This whole part of the pipeline can be quite long as highly intelligent AI and immersive physics (along with other game scripting and state updates) can require massive amounts of work. At the least we have lots of sorting, branching, and necessarily serial computations to worry with.

Depending on when input is collected and the depth and breadth of the simulation, we could see input lag increase up to several milliseconds. This is highly game dependent, but it isn't something the end user has any control over outside of getting the fastest possible CPU (and this still won't likely change things in a perceivable way as there are memory and system latencies to consider and the GPU is largely the bottleneck in modern games). Some games are designed to be highly responsive and some games are designed to be highly accurate. While always having both cranked up to 11 would be great, there are trade offs to be made.

Unfortunately, that leaves us with a highly variable situation. The only way to really determine the input lag caused by game code itself is profile the code (which requires access to the source to be done right) or ask a developer. But knowing the specifics aren't as necessary as knowing that there's not much that can be done by the gamer to mitigate this issue. For the purposes of this article, we will consider game logic to typically add somewhere between 1ms and 10ms of input lag in modern games. This considers things like decoupling simulation and AI threads from rendering and having work done in parallel among other things. If everything were done linearly things would very likely take longer.

When we've got our game state updated, we then setup graphics for rendering. This will involve using our game state to update geometry and display lists on the CPU side before the GPU can start work on the next frame. The speed of this step is again dependent on the implementation and can take up a good bit of time. This will be dependent on the complexity of the scene and the number of triangles required. Again, while this is highly dependent on the game and what's going on, we can typically expect something between 1ms and 10ms for this part of the process as well if we include the time it takes to upload geometry and other data to the GPU.

Now, all the issues we've covered on this page go into making up a key element of game performance: CPU time. The total latency from front to back in this stage of a game engine creates a CPU limit on performance. When what comes after this (rendering on the GPU) takes less time than everything up to this point, we have hit the CPU limit. We can typically see the CPU limit when we drop resolution down to something ridiculously low on a high end card without seeing any real performance gain between that and the next highest resolution.

From the examples I've given here, if both the game logic and the graphics/geometry setup come in at the minimum latencies I've suggested should be typical, we could be CPU limited at as much as 500 frames per second. On the flip side, if both portions of this process push up to the 10ms level, we would never see a frame rate over 50 FPS no matter how fast the GPU rendered anything.

Obviously there is variability in games, and sometimes we see a CPU limit at less than 60 FPS even at the lowest resolution on the highest end hardware. Likewise, we can see framerates hit over 2000 FPS when drawing a static image (where game logic and display lists don't need to be updated) with a menu in front of it (like when a user hits escape in Oblivion with vsync off). And, again, multi-threaded software design on multi-core CPUs really middies up the situation. But this is near enough to illustrate the point.

And now it's on to the portion of realtime 3D graphics that typically incurs the most input lag before we leave the computer: the graphics hardware.

Reflexes and Input Generation Of the GPU and Shading
POST A COMMENT

84 Comments

View All Comments

  • DerekWilson - Sunday, July 19, 2009 - link

    It was bound to happen wasn't it?

    This has been around for a few years now, but (for obvious reasons) never made it into the mainstream gaming community. And, really, now that high performance mice are much more available it isn't as much of an issue.
    Reply
  • Kaihekoa - Saturday, July 18, 2009 - link

    From the conclusion this point wasn't clear to me. Reply
  • DerekWilson - Sunday, July 19, 2009 - link

    at present triple buffering in DirectX == a 1 frame flip queue in all cases ...

    so ... it is best to disable triple buffering in DirectX if you are over refresh rate in performance (60FPS generally) ...

    and it is better to enable triple buffering in DirectX if you are under 60 FPS.
    Reply
  • Squall Leonhart - Wednesday, March 30, 2011 - link

    This is not always the case actually, there are some DirectX engines specifically the age of empires 3 engine as an example, that have hitching when moving around the map unless triple buffering is forced on the game. Reply
  • billythefisherman - Saturday, July 18, 2009 - link

    First of all I'd like to say well done on the article you're probably the first person outside of game industry developers to have looked at this rather complex topic and certainly the first to take into account the whole hardware pipeline as well.

    Sadly though there are some gaping holes in your analysis mainly focused around the CPU stage. Sadly your CPU isn't going to run any faster than your GPU (and actually the same is correct in reverse) as one is dependent on the other (the GPU is dependent on the CPU). As such the CPU may finish all of its tasks faster than the GPU but the CPU will have to wait for the GPU to finish rendering the last frame before it can start on the next frame of logic.

    No game team in the world developing for a console is going to triple buffer their GPU command list.

    I intentionally added 'developing for a console' as this is also an important factor I'd say around 75% (being very conservative) of mainstream PC games now are based on cross platform engines. As such developers will more than likely gear their engines to the consoles as these make up the largest market segment by far.

    The consoles all have very limited memory capacities
    in comparison to their computational power and so developers will more than likely try to save memory over computation thus a double buffered command list is the norm. Some advanced console specific engines actually dropping down to a single command buffer and using CPU - GPU synchronisation techniques because of CPU's being faster than GPU's. This kind of thing isnt going to happen on the PC because the GPU is invariably faster than the CPU.

    When porting a game to PC a developer is very unlikely to spend the money re-engineering the core pipline because of the massive problems that can cause. This can be seen in most 'DirectX 10' games, as they simply add a few more post processing effects to soak up the extra power - you may call it lazy coding, I don't, it's just commercial reality these are businesses at the end of the day.

    So both your diagrams on the last page are wrong with regards to the CPU stage as they will be roughly the same amount of time as the GPU in the vast majority of frames because of frame locality ie one frame differs little to the next frame as the player tends not to jump around in space and so neighbouring frames take similar amounts of time to render.

    Onto my next complaint :
    "If our frametime is just longer than 16.67ms with vsync enabled, we will add a full additional frame of latency (with no work being done on the GPU) before we are able to swap the finished buffer to the front for scanout. The wasted work can cause our next frame not to come in before the next vsync, giving us up to two frames of latency (one because we wait to swap and one because of the delay in starting the next frame)."

    What are you talking about man!?! You don't drop down to 20fps (ie two more frames of latency) because you take 17ms to render your frame - you drop down to 30fps! With vsync enabled your graphics processor will be stalled until the next frame but thats all and you could possibly kick off your CPU to calculate the next frame to take advantage of that time. Not that thats going to make the slightest jot of difference if you're GPU bound because you have to wait for the GPU to finish with the command buffer its rendering (as you don't know where in the command buffer the GPU is).

    As I've said on the consoles there are tricks you can do to synchronise the GPU with the CPU but you don't have that low level control of the GPU on the PC as Nvidia/ATI don't want the internals of thier drivers exposed to one another.

    And as I've said not that you'd want to do such a thing on PC as the CPU is usually going to be slower than the GPU and cause the GPU to stall constantly hence the reason to double buffer the command buffer in the first place.

    I've also tried to explain in my posts to your triple buffering article why there's a lot cobblers in the next few paragraphs.
    Reply
  • DerekWilson - Sunday, July 19, 2009 - link

    Fruit pies? ... anyway...

    Thanks for your feedback. On the first issue, the console development is one of growing importance as much as I would like for it not to be. At some point, though, I expect there will be an inflection point where it will just not be possible to build certain types of games for consoles that can be built on PCs ... and we'll have this before the next generation of consoles. Maybe it's a pipedream, but I'm hoping the development focus will shift back to the PC rather than continue to pull away (I don't think piracy is a real factor in profitability though I do believe publishers use the issue to take advantage of developers and consumers).

    And I get that with GPU as bottleneck you have that much time to use the CPU as well ... but you /could/ decouple CPU and GPU and gain performance or reduce lag. Currently, it may make sense that if we are GPU limited the CPU stage will effectively equal the GPU stage in latency -- and likewise that if we are CPU limited, the GPU state effectively equals the CPU stage (because of stalling) in input latency.

    Certainly it is a more complex topic than I illustrated, and if I didn't make that clear then I do apologize. I just wanted to get across the general idea rather than a "this is how it always is" kind of thing ... clearly Fallout 3 has even more input lag than any of my worst case scenarios account for even with 2 frame of image processing on the monitor ... I have no idea what they are doing ...

    ...

    As for the second issue -- you can get up to two frames of INPUT LAG with vsync enabled and 17ms GPU time.

    you will get up to these two frames (60Hz frames) of input lag at 30FPS ...

    I'm not talking about the frame rate dropping to 2 frames then 1 frame (20 FPS) ... I'm talking about the fact that, at best, your input is gathered 17ms before your frame completes on the GPU (1 frame of input lag) and (because it missed vsync) it will take another frame for that to hit the screen (for a total of two).
    Reply
  • billythefisherman - Monday, July 20, 2009 - link

    I have to re-iterate: well done on tackling this rather complex issue, I applaud you! (I just wish you hadn't whipped up your punters so much in the benefits of triple buffering!) Reply
  • Gastra - Saturday, July 18, 2009 - link

    For (quite a lot if you follow the links) of information on what an optical mouse see:
    http://hackedgadgets.com/2008/10/15/optical-mouse-...">http://hackedgadgets.com/2008/10/15/optical-mouse-...
    Reply
  • DerekWilson - Sunday, July 19, 2009 - link

    That's pretty cool stuff ... And it lines up pretty well with our guess at mouse sensor resolution for the G9x.

    It'd still be a lot nicer if we could get the specs straight from the manufacturer though ...
    Reply
  • PrinceGaz - Friday, July 17, 2009 - link

    "For input lag reduction in the general case, we recommend disabling vsync. For NVIDIA card owners running OpenGL games, forcing triple buffering in the driver will provide a better visual experience with no tearing and will always start rendering the same frame that would start rendering with vsync disabled."

    I'm going to ask this again I'm afraid :) Are you sure Derek? Does nVidia's triple-buffer OpenGL driver implementation do that, or is it just the same as what most people take triple-buffer rendering to be, that is having one additional back buffer to render to so as to provide a steady supply of frames when the framerate dips below the refresh rate? Have you got confirmation either from screenshots or something else (like nVidia saying that is how it works) that OpenGL triple-buffering is any different from Direct3D rendering, or how AMD handle it?.

    Because if you don't, then all you are saying is that triple-buffering is a second back-buffer which is filled to prevent lags when the framerate falls below the refresh rate. Do you know for sure that nVidia OpenGL drivers render constantly when in triple-buffer mode or are you only assuming they do so?
    Reply

Log in

Don't have an account? Sign up now