Valve Hardware Day 2006 - Multithreaded Edition

Name: Valve Hardware Day 2006 - Multithreaded Edition
Item: Valve Hardware Day 2006 - Multithreaded Edition
Author: Jarred Walton

by Jarred Walton on November 7, 2006 6:00 AM EST

Posted in
Trade Shows

55 Comments | Add A Comment

55 Comments

Threading Models

There are four basic threading models available for computer programmers. Some of them are easier to implement than others, and some of them are more beneficial for certain types of work than others. The four threading models are:

Single Threading
Coarse Threading
Fine-Grained Threading
Hybrid Threading

Single threading is what everyone generally understands. You write a program as a sequential series of events that are executed. It's easy to do, and this is what most programmers learn in school or when they first start "hacking". Unfortunately, single threading is becoming obsolete for a lot of tasks. Simple programs that generally wait for user input or other events can still get by with taking this approach, but any application that depends on CPU computational power is going to miss out on a large amount of performance if it isn't designed to run in a multithreaded fashion. There's not much more to say about single threading.

The next step up from single threading is to look at coarse threading. Next to avoiding threads altogether, this is the simplest approach to multithreading. The idea is basically to find discrete systems of work that can be done separate from each other. The example we gave earlier of having two cooks working a single dish, with each person preparing a portion of the dish is an example of coarse threading. If one person is working on slicing up and grilling some chicken while a second person is preparing a Hollandaise sauce, you have two separate systems that can easily be accomplished at the same time.

A classic example of this in the computing world is an application that uses a client-server architecture. You would simply run the client code as one thread and the server code as a second thread, each would handle certain tasks, there would be a minimal amount of interaction between the two threads, and hopefully you would get a net increase in performance. This type of architecture is in fact what has allowed two of the current multithreaded games to begin to take advantage of multiple processors. Quake 4 and Call of Duty 2 both stem from id Software code that used a client-server design. The server handles such tasks as networking code, artificial intelligence and physics calculations, and perhaps a few other tasks related to maintaining the world state. Meanwhile the client handles such things as user input, graphics rendering, audio and some other tasks as well. (We might have the breakdown wrong, but it's the general idea we're interested in.)

Depending on the game, breaking up a client-server design into two separate threads can provide anywhere from ~0% to 20% or more performance. 0% - as in no gain at all? Unfortunately, as many of you are already aware, many games are bottlenecked by one specific area: graphics performance. If you reach the point where the graphics card is slowing everything else down, the CPU might have more than enough time available to run all of the other tasks while waiting for the graphics card to finish its work. If that's the case, having a faster CPU -- or even more CPU cores -- would provide little or no benefit in a coarse threaded client-server game. This is why Quake 4 and Call of Duty 2 see performance increases for enabling SMP support that start out high when you run at lower resolutions using fast graphics cards, but as the resolution increases and the GPU becomes the bottleneck, the benefit of SMP begins to disappear. Note also that both of these games only use two threads, so if you have something like a quad core processor or two dual core processors, you still aren't able to effectively use even half of the total computational resources available.

This is not to say that coarse threading can't be advantageous. If you have an application with two tasks that need to be computed, and both tasks take the same amount of time -- or nearly the same amount of time -- using a coarse threaded approach allows you to easily utilize all of the available CPU power. There are many tasks that fall into this category, although likewise there are plenty of tasks that do not fit this model. It is still a better solution than single threading in a multiprocessor world, but in many instances it's a baby step in the right direction.

Fine-grained threading takes the opposite approach of coarse threading. Instead of looking for entire systems that can be broken into separate threads, fine-grained threading looks at a single task and tries to break it into smaller pieces. A couple of the best examples of fine-grained threading providing a dramatic increase in performance are 3D rendering applications and video encoding applications. If you try to take a coarse threading approach to 3D rendering, you will usually end up with one system that handles the actual 3D rendering, and it will consume well over 90% of the application time. The same can be said of video encoding. Luckily, both tasks -- 3D rendering and video encoding -- are relatively simple to break up into smaller pieces. You simply take a divide and conquer approach, so if you have two processor cores you split the scene in half, or with four cores you split the scene into four pieces, etc. You can further optimize things by allowing any cores that finish early to begin working on a portion of one of the unfinished sections.

Thinking back to our kitchen analogy, what exactly is fine-grained threading? The best example we can come up with would be the preparation of a large quantity of food. Let's say you have a large barbecue where there will be hundreds of people present. Everyone wants their food to be warm, and hopefully the food doesn't get overcooked, undercooked, or any one of a number of other potential problems. If you're preparing something like hamburgers or steaks, cooking several at a time makes sense, but there's only so much one person can do. The answer is to add additional barbecues with additional cooks, so they can all be working on making hamburgers at the same time.

The problem with fine-grained threading is that not all tasks easily lend themselves to this approach. There are other types of foods that could be served at a barbecue where you might not need to have a bunch of chefs doing the same task, for example. It is definitely more difficult (overall) than coarse threading, but you can run into problems when you have a lot of work to be done but you don't know exactly how long each piece will take. Performance scaling and resource management also tend to become issues the more you split up a task - you wouldn't want 300 chefs, 300 barbeques, all cooking one hamburger each, right? You have to maintain a balance between spending time breaking a task into smaller pieces versus simply getting the work done. Typically, there is a minimum size beyond which it's not worth splitting up a task any further, and when you reach this point you just let all of the assigned pieces complete. Things like memory bandwidth can also become a performance bottleneck, so if you have a task that was really easy to break into pieces but each piece demanded a lot of memory bandwidth, you might discover that more than four pieces shows no performance improvements on current computer systems due to memory bandwidth limitations.

The net result is that fine-grained threading often has diminishing returns as you add more processors. With 3D rendering for example, having two processor cores is often about 90% faster than having only one processor core, whereas four processor cores is only about 80% faster than two cores, and eight cores might only be 70% faster than four cores. So if one processor takes 20 minutes to complete a 3D rendering task, two cores might be able to complete the task in 11 minutes, four cores might complete the task in 6 minutes, and eight cores could complete the task in only 3.5 minutes. The benefits of fine-grained threading will vary depending on the work that is being performed, and it can place a larger burden on the programmer to get things properly optimized, but the performance increases can still be substantial.

Hybrid threading really doesn't add much new to the table. As the name implies, this is a hybrid approach where you basically take both coarse and fine-grained threading and utilize them as appropriate to the task at hand. This requires the most work, as now you have to deal with synchronizing coarse system threads along with potentially individual threads within the systems. Hybrid threading will typically have more threads to worry about than either of the other approaches. The good news is that when done properly, hybrid threading can make maximum use of the available CPU power.

What Is Multithreading? Building a Threaded Game Engine

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

55 Comments

View All Comments

Nighteye2 - Wednesday, November 8, 2006 - link
Ok, so that's how Valve will implement multi-threading. But what about other companies, like Epic? How does the latest Unreal Engine multi-thread?
Justin Case - Wednesday, November 8, 2006 - link
Why aren't any high-end AMD CPUs tested? You're testing 2GHz AMD CPUs against 2.6+ GHz Intel CPUs. Doesn't Anandtech have access to faster AMD chips? I know the point of the article is to compare single- and multi-core CPUs, but it seems a bit odd that all the Intel CPUs are top-of-the-line while all AMD CPUs are low end.
JarredWalton - Wednesday, November 8, 2006 - link
AnandTech? Yes. Jarred? Not right now. I have a 5000+ AM2, but you can see that performance scaling doesn't change the situation. 1MB AMD chips do perform better than 512K versions, almost equaling a full CPU bin - 2.2GHz Opteron on 939 was nearly equal to the 2.4GHz 3800+ (both OC'ed). A 2.8 GHz FX-62 still isn't going to equal any of the upper Core 2 Duo chips.
archcommus - Tuesday, November 7, 2006 - link
It must be a really great feeling for Valve knowing they have the capacity and capability to deliver this new engine to EVERY customer and player of their games as soon as it's ready. What a massive and ugly patch that would be for virtually any other developer.

Don't really see how you could hate on Steam nowadays considering things like that. It's really powerful and works really well.
Zanfib - Tuesday, November 7, 2006 - link
While I design software (so not so much programming as GUI design and whatnot), I can remember my University courses dealing with threading, and all the pain threading can bring.

I predicted (though I'm sure many could say this and I have no public proof) that Valve would be one of the first to do such work, they are a very forward thinking company with large resources (like Google--they want to work on ANYthing, they can...), a great deal of experience and, (as noted in the article) the content delivery system to support it all.

Great article about a great subject, goes a long way to putting to rest some of the fears myself and others have about just how well multi-core chips will be used (with the exception of Cell, but after reading a lot about Cell's hardware I think it will always be an insanely difficult chip to code for).
Bonesdad - Tuesday, November 7, 2006 - link
mmmmmmmmm, chicken and mashed potatoes....
Aquila76 - Tuesday, November 7, 2006 - link
Jarred, I wanted to thank you for explaining in terms simple enough for my extremely non-technical wife to understand why I just bought a dual-core CPU! That was a great progression on it as well, going through the various multi-threading techniques. I am saving that for future reference.
archcommus - Tuesday, November 7, 2006 - link
Another excellent article, I am extremely pleased with the depth your articles provide, and somehow, every time I come up with questions while reading, you always seem to answer exactly what I was thinking! It's great to see you can write on a technical level but still think like a common reader so you know how to appeal to them.

With regards to Valve, well, I knew they were the best since Half-Life 1 and it still appears to be so. I remember back in the days when we weren't even sure if Half-Life 2 was being developed. Fast forward a few years and Valve is once again revolutionizing the industry. I'm glad HL2 was so popular as to give them the monetary resources to do this kind of development.

Right now I'm still sitting on a single core system with XP Pro and have lots of questions bustling in my head. What will be the sweet spot for Episode 2? Will a quad core really offer substantially better features than a dual core, or a dual core over a single core? Will Episode 2 be fully DX10, and will we need DX10 compliant hardware and Vista by its release? Will the rollout of the multithreaded Source engine affect the performance I already see in HL2 and Episode 1? Will Valve actually end up distributing different versions of the game based on your hardware? I thought that would not be necessary due to the fact that their engine is specifically designed to work for ANY number of cores, so that takes care of that automatically. Will having one core versus four make big graphical differences or only differences in AI and physics?

Like you said yourself, more questions than answers at this point!
archcommus - Tuesday, November 7, 2006 - link
One last question I forgot to put in. Say it was somehow possible to build a 10 or 15 GHz single core CPU with reasonable heat output. Would this be better than the multi-core direction we are moving towards today? In other words, are we only moving to mult-core because we CAN'T increase clock speeds further, or is this the preferred direction even if we could.
saratoga - Tuesday, November 7, 2006 - link
You got it.

A higher clock speed processor would be better, assuming performance scaled well enough anyway. Parallel hardware is less general then serial hardware at increasing performance because it requires parallelism to be present in the workload. If the work is highly serial, then adding parallelism to the hardware does nothing at all. Conversely, even if the workload is highly parallel, doubling serial performance still doubles performance. Doubleing the width of a unit could double the performance of that unit for certain workloads, while doing nothing at all for others. In general, if you can accelerate the entire system equally, doubling serial performance will always double program speed, regardless of the program.

Thats the theory anyway. Practice says you can only make certain parts faster. So you might get away with doubling clock speed, but probably not halving memory latency, so your serial performance doesn't scale like you'd hope. Not to mention increasing serial performance is extremely expensive compared to parallel performance. But if it were possible, no one would ever bother with parallelism. Its a huge pain in the ass from a software perspective, and its becoming big now mostly because we're starting to run out of tricks to increase serial performance.

Valve Hardware Day 2006 - Multithreaded Edition

Post Your Comment

55 Comments

View All Comments

Nighteye2 - Wednesday, November 8, 2006 - link

Justin Case - Wednesday, November 8, 2006 - link

JarredWalton - Wednesday, November 8, 2006 - link

archcommus - Tuesday, November 7, 2006 - link

Zanfib - Tuesday, November 7, 2006 - link

Bonesdad - Tuesday, November 7, 2006 - link

Aquila76 - Tuesday, November 7, 2006 - link

archcommus - Tuesday, November 7, 2006 - link

archcommus - Tuesday, November 7, 2006 - link

saratoga - Tuesday, November 7, 2006 - link

Log in

Don't have an account? Sign up now