Threading Models

There are four basic threading models available for computer programmers. Some of them are easier to implement than others, and some of them are more beneficial for certain types of work than others. The four threading models are:
  • Single Threading
  • Coarse Threading
  • Fine-Grained Threading
  • Hybrid Threading
Single threading is what everyone generally understands. You write a program as a sequential series of events that are executed. It's easy to do, and this is what most programmers learn in school or when they first start "hacking". Unfortunately, single threading is becoming obsolete for a lot of tasks. Simple programs that generally wait for user input or other events can still get by with taking this approach, but any application that depends on CPU computational power is going to miss out on a large amount of performance if it isn't designed to run in a multithreaded fashion. There's not much more to say about single threading.


The next step up from single threading is to look at coarse threading. Next to avoiding threads altogether, this is the simplest approach to multithreading. The idea is basically to find discrete systems of work that can be done separate from each other. The example we gave earlier of having two cooks working a single dish, with each person preparing a portion of the dish is an example of coarse threading. If one person is working on slicing up and grilling some chicken while a second person is preparing a Hollandaise sauce, you have two separate systems that can easily be accomplished at the same time.

A classic example of this in the computing world is an application that uses a client-server architecture. You would simply run the client code as one thread and the server code as a second thread, each would handle certain tasks, there would be a minimal amount of interaction between the two threads, and hopefully you would get a net increase in performance. This type of architecture is in fact what has allowed two of the current multithreaded games to begin to take advantage of multiple processors. Quake 4 and Call of Duty 2 both stem from id Software code that used a client-server design. The server handles such tasks as networking code, artificial intelligence and physics calculations, and perhaps a few other tasks related to maintaining the world state. Meanwhile the client handles such things as user input, graphics rendering, audio and some other tasks as well. (We might have the breakdown wrong, but it's the general idea we're interested in.)

Depending on the game, breaking up a client-server design into two separate threads can provide anywhere from ~0% to 20% or more performance. 0% - as in no gain at all? Unfortunately, as many of you are already aware, many games are bottlenecked by one specific area: graphics performance. If you reach the point where the graphics card is slowing everything else down, the CPU might have more than enough time available to run all of the other tasks while waiting for the graphics card to finish its work. If that's the case, having a faster CPU -- or even more CPU cores -- would provide little or no benefit in a coarse threaded client-server game. This is why Quake 4 and Call of Duty 2 see performance increases for enabling SMP support that start out high when you run at lower resolutions using fast graphics cards, but as the resolution increases and the GPU becomes the bottleneck, the benefit of SMP begins to disappear. Note also that both of these games only use two threads, so if you have something like a quad core processor or two dual core processors, you still aren't able to effectively use even half of the total computational resources available.

This is not to say that coarse threading can't be advantageous. If you have an application with two tasks that need to be computed, and both tasks take the same amount of time -- or nearly the same amount of time -- using a coarse threaded approach allows you to easily utilize all of the available CPU power. There are many tasks that fall into this category, although likewise there are plenty of tasks that do not fit this model. It is still a better solution than single threading in a multiprocessor world, but in many instances it's a baby step in the right direction.


Fine-grained threading takes the opposite approach of coarse threading. Instead of looking for entire systems that can be broken into separate threads, fine-grained threading looks at a single task and tries to break it into smaller pieces. A couple of the best examples of fine-grained threading providing a dramatic increase in performance are 3D rendering applications and video encoding applications. If you try to take a coarse threading approach to 3D rendering, you will usually end up with one system that handles the actual 3D rendering, and it will consume well over 90% of the application time. The same can be said of video encoding. Luckily, both tasks -- 3D rendering and video encoding -- are relatively simple to break up into smaller pieces. You simply take a divide and conquer approach, so if you have two processor cores you split the scene in half, or with four cores you split the scene into four pieces, etc. You can further optimize things by allowing any cores that finish early to begin working on a portion of one of the unfinished sections.

Thinking back to our kitchen analogy, what exactly is fine-grained threading? The best example we can come up with would be the preparation of a large quantity of food. Let's say you have a large barbecue where there will be hundreds of people present. Everyone wants their food to be warm, and hopefully the food doesn't get overcooked, undercooked, or any one of a number of other potential problems. If you're preparing something like hamburgers or steaks, cooking several at a time makes sense, but there's only so much one person can do. The answer is to add additional barbecues with additional cooks, so they can all be working on making hamburgers at the same time.

The problem with fine-grained threading is that not all tasks easily lend themselves to this approach. There are other types of foods that could be served at a barbecue where you might not need to have a bunch of chefs doing the same task, for example. It is definitely more difficult (overall) than coarse threading, but you can run into problems when you have a lot of work to be done but you don't know exactly how long each piece will take. Performance scaling and resource management also tend to become issues the more you split up a task - you wouldn't want 300 chefs, 300 barbeques, all cooking one hamburger each, right? You have to maintain a balance between spending time breaking a task into smaller pieces versus simply getting the work done. Typically, there is a minimum size beyond which it's not worth splitting up a task any further, and when you reach this point you just let all of the assigned pieces complete. Things like memory bandwidth can also become a performance bottleneck, so if you have a task that was really easy to break into pieces but each piece demanded a lot of memory bandwidth, you might discover that more than four pieces shows no performance improvements on current computer systems due to memory bandwidth limitations.

The net result is that fine-grained threading often has diminishing returns as you add more processors. With 3D rendering for example, having two processor cores is often about 90% faster than having only one processor core, whereas four processor cores is only about 80% faster than two cores, and eight cores might only be 70% faster than four cores. So if one processor takes 20 minutes to complete a 3D rendering task, two cores might be able to complete the task in 11 minutes, four cores might complete the task in 6 minutes, and eight cores could complete the task in only 3.5 minutes. The benefits of fine-grained threading will vary depending on the work that is being performed, and it can place a larger burden on the programmer to get things properly optimized, but the performance increases can still be substantial.


Hybrid threading really doesn't add much new to the table. As the name implies, this is a hybrid approach where you basically take both coarse and fine-grained threading and utilize them as appropriate to the task at hand. This requires the most work, as now you have to deal with synchronizing coarse system threads along with potentially individual threads within the systems. Hybrid threading will typically have more threads to worry about than either of the other approaches. The good news is that when done properly, hybrid threading can make maximum use of the available CPU power.

What Is Multithreading? Building a Threaded Game Engine
Comments Locked

55 Comments

View All Comments

  • JarredWalton - Tuesday, November 7, 2006 - link

    What's with the octal posting? Too many CPU cores running? ;)

    I deleted the other 7 identical posts for you. Careful with that Post Comment button!
  • saratoga - Tuesday, November 7, 2006 - link

    Server kept timing out when I hit post, so I assumed it wasn't committing :)
  • exdeath - Tuesday, November 7, 2006 - link

    You can see my recent comments on this topic here:

    http://www.dailytech.com/Article.aspx?newsid=4847&...">http://www.dailytech.com/Article.aspx?newsid=4847&...

    In my experience relying on atomic CPU swap operations isn't enough as it only works with a single value (32 bit word for example).

    While you lock and swap a 32 bit Y value, someone else has just finished reading the newly written X value but beat you to the lock to read the old Y value before you've updated. Clearly whole data structures need to be coherent, not just small atomic values.

    Also it’s unusual to modify objects observable states mid frame. Even if you avoided the above example so that the X,Y pair was always updated together, you'd still have different objects interpreting the position as a whole of that object in different places at different times. State data must be held constant to all observers throughout the context of a single frame.
  • exdeath - Tuesday, November 7, 2006 - link

    Even if you avoided the above example so that the X,Y pair was always updated together, you'd still have different objects interpreting the position as a whole of that object in different places at different times in the same frame.
  • JarredWalton - Tuesday, November 7, 2006 - link

    I'm assuming your comment is in regards to the PS3/Cell comments on the last page? It's sort of sounds like you're arguing about the way Valve has chosen to go about doing things, or that you disagree with some of the opinions they've expressed concerning other hardware. We have only tried to provide a very high-level overview of what Valve is doing, and we hardly touched the low-level details -- Valve didn't spend a lot of time on specific implementation issues either. All they did was provide us with some information about what they are doing, and a bit of opinion on what they think of the rest of the hardware options.

    Preventing anything else from doing write operations to the world state during an entire frame in order to keep things coherent is a big problem with multithreading. Apparently Valve has found a way around that, or at least found a way to do it more efficiently, using lock free and wait free algorithms. No, I can't honestly say I really understand what those algorithms do, but if they say it worked better for their code base I'm willing to trust them.

    As far as the PS3/Cell processor goes, Valve did say that they have various thoughts on how to properly utilize the architecture. It is simply going to be more difficult to do relative to Xbox 360 and PC. It's not impossible, and companies are definitely going to tackle this problem. As far as how they tackle it, I'm more than a bit rusty on my coding background, and other than high-level details I'm not too concerned how they improve their multithreading code on any specific platform, just that they do it.
  • exdeath - Tuesday, November 7, 2006 - link

    The other issue is OS support.

    Compiler add-on's or third party APIs can only serve to hide the details or make things look cleaner. But no matter what, the final barrier between the application and the OS are the API calls provided by the OS threading model. Thus no third party implementation can be better than the OS thread model itself in terms of performance and overhead. All those can do is make it easier to use at the top by handling the OS details.

    I imagine threading APIs on popular OSes will start to evolve, just like graphics APIs have, once everyone gets on the multi-core bandwagon and starts to get a feel for what's available in the OS APIs and what they'd rather have. So far, Vista's thread pool API looks good, but I still don't see an API to determine such basic things as checking if the work queue is empty and all threads are idle, etc.

    Currently I find it's easier to implement my own thread pool manager which does atomic increments and decrements on a 'task count' variable as tasks are entered or completed in the queue. Checking if all tasks are done involves testing that task count against 0 and signaling an event flag that wakes any management threads sleeping until all its work tasks to complete. It also allows for more flexibility in 'before and after' housekeeping as work threads move from task to task and that kind of control isn't offered in the XP’s built in thread pool API, nor Vista’s as far as I can tell.

  • exdeath - Tuesday, November 7, 2006 - link

    Not arguing their methods, a lot of things in this article are in line with my own opinions on multithreading, pretty much the best way to got about it. I'm just pointing out that atomic lock/swap operations in hardware are very primitive and typically operate only on CPU word size values, not entire data structures. Thus it's possible between doing two atomic operations on two variables on one core, another core can get an old version of one variable and a new version of another.

    core1: compute X
    core2: ...

    core1: lock/write x
    core2: read x, get newly written version

    core1: compute Y
    core2: read Y, get old y before the update

    core1: lock/write Y
    core2: ...

    The task on core2 is working with inconsistent data, the new X and the old Y. If the task on core2 only uses the data as input, i.e.: AI tracking another AI entity, it has the wrong position, and won't know about it since it has no need to perform its own lock/write (so it never gets the exception that says the value changed). Even if it did, it would have to throw out all work and redo it with the new Y, and then it could possibly change again.

    Looping and retrying seems wasteful. And I’m thinking the only way to catch such a hardware error on a failed lock/write update is via exceptions, and handling a thrown exception on an attempt to write a single 32 bit value is very wasteful of CPU cycles.

    In my own research I have had excellent results with double buffering any modified data. Each threaded task only updates its hidden internal working state for frame n+1 while all reads to the object are read from its external current state for frame n. At the end of the frame when all parallel tasks have completed, the current/working states are swapped, and the work queue is filled again to start the next frame.

    This ensures that throughout the entire computation of frame n+1, the current frame n state will be available to all threads, and guaranteed to not be modified through the duration of current frame. So basically all threads can read anything they want and modify their own data. On PC/360 the time to swap everything is basically nothing; you just swap a few pointers, or a single pointer to an array/structure of current/working data for the frame.

    On the PS3 some data copying and moving will be required, but this is mandatory due to design anyway and assisted by an extremely smart and powerful DMAC.

    One place to be critical about is message passing between objects since it requires posting (writing) data to be picked up by another object. But the time to lock/post/unlock a queue is negligible compared to the time it takes to process the results leading up to the creation of the message. This is similar to the D3D notion of doing as much as you can before you lock and only do the minimal work needed inside the lock and unlock as quickly as possible.
  • GhandiInstinct - Tuesday, November 7, 2006 - link

    Jarred Walton,

    My question: Will Valve's games in 2007 be released with specificaitons such as: "For minimum requirements you need a dual-core cpu, for maximum results you need a quad-core" or anything to that nature? Because I seem to be confused in what Valve is working on dual or quad or both or neither or something different, and what I should get to best utilize their games and multi-core software in general.

    Thanks.
  • JarredWalton - Tuesday, November 7, 2006 - link

    Episode Two should come out sometime in 2007, and before that happens you will get the multithreading patch affecting previous Source engine titles. Right now, it doesn't sound like anything released in the next year or so from valve is going to require dual cores. That's what I was trying to get out on the conclusion page where I mentioned that they are targeting an "equivalent experience" regardless of what sort of processor you are running.

    So just like you could turn down the level of detail in Half-Life 2 and run it on DX8 or even DX7 hardware, Source engine should be able to accommodate single core processors all the way up through N-core processors. The engine will spawn as many threads as you have processor cores, with one main thread serving as the controller and N - 1 helper threads. Xbox 360 for example would have 5 helper threads plus the master thread, because it has three course each capable of executing to threads simultaneously.
  • Patrese - Tuesday, November 7, 2006 - link

    Great article, good to see dual-quad cores being used for something in games. By the way, the kitchen examples made me hungry... :)

Log in

Don't have an account? Sign up now