Original Link: http://www.anandtech.com/show/2114
Valve Hardware Day 2006 - Multithreaded Editionby Jarred Walton on November 7, 2006 6:00 AM EST
- Posted in
- Trade Shows
Valve Hardware Day 2006
Last week Valve Software invited us up to their headquarters in Bellevue Washington for their Hardware Day event. Valve usually has some pretty interesting stuff to show off, and they are one of several companies that consistently push the boundaries of what your computer hardware can do. As creator of the Half-Life series, Valve is one of the most respected gaming software companies around. Their Steam distribution network has also garnered quite a bit of attention over the years. Last year, the big news was the new HDR rendering that Valve added to their Source engine. So what has Valve been up to for the past year, and how will it affect the future of gaming?
The man himself, Gabe Newell
Some of you may recall some of the statements Valve Software founder Gabe Newell made in regards to the next generation platforms and the move towards multi-processor systems. The short summary is that creating efficient and powerful multithreaded code is extremely difficult, and there's a very real possibility that developers will need to throw away a lot of their existing code base. Both of these things are drawbacks for creating multithreaded games, but there are also important benefits. Perhaps the most important advantage is that if you need additional processing power in the near future, you are much more likely to get it by tapping into the power of multiple processors rather than waiting for clock speeds to increase.
While it is listed as a challenge, one of the points made by Valve is that computer games are typically designed to make maximum use of your system. "You're doing a disservice to the customer if you're not using all of the CPU power." Some might disagree with that sentiment, but at some point the choice has to be made between a game that looks better and/or performs faster and one that uses less computational resources. Then there's a secondary consideration: do you want to make a computer game that merely takes advantage of additional processing cores to enhance the gaming experience, or should a multi-core CPU be required? There are still a large number of single core processors in use today, and many of those people might be unhappy if they were forced to upgrade.
Tom Leonard, Multithreading Project Lead
The costs and challenges associated with creating multithreaded games help to explain why the previous gaming support for multiple processors has been limited at best, but with all of the major processor vendors moving towards multi-core chips, the installed user base has finally become large enough that it makes sense to invest the time and effort into creating a powerful multithreaded gaming engine. Valve Software set out to do exactly that over the past year. The efforts have been spearheaded by Valve programmer Tom Leonard, whose past experience includes work on C++ compilers, system utilities, and artificial intelligence among other things. Other Valve employees that have helped include Jay Stelly, Aaron Seeler, Brian Jacobson, Erik Johnson, and Gabe Newell.
Perhaps the most surprising thing is how much has been accomplished in such a relatively short time, and Valve Software provided the attendees with a couple benchmark applications to demonstrate the power of multi-core systems. Before we get to the actual performance of these utilities, however, let's take a look at what multithreading actually means, the various approaches that can be taken, and the areas that stand to benefit the most. There are of course many ways to accomplish any given task, but for this article we are primarily concerned with Valve's approach to multithreading and what it means to the gaming community.
What Is Multithreading?
Before we get into discussion of how to go about multithreading, it may be beneficial for some if we explain what multithreading means. Most people who use computers are now familiar with the term multitasking. As the name implies, this involves running multiple tasks at the same time. This can be done either in the real world or on computers, and depending on what you're doing you may experience an overall increase in productivity by multitasking.
For example, let's say you're cooking dinner and it will consist of three dishes: roasted chicken, mashed potatoes, and green beans. If you were to tackle this task without any multitasking, you would first cook the chicken, then the potatoes, and finally the green beans. Unfortunately, by the time you're finished cooking the green beans, you might discover that the chicken and potatoes are already cold. So you decide to multitask and do all three at once: first you start boiling some water on the stove for the potatoes, while doing that you pull the chicken out of the refrigerator and place it into a pan and start heating the oven. Then you peel the potatoes. By now the water is boiling, so you put the potatoes into the water and let them cook. The oven is also preheated now, so you put the chicken in and let it begin cooking. The beans won't take too long to cook, so just wash them off and set them to the side for now. Eventually the potatoes are finished cooking, but before finishing those you put the green beans in a steamer and put them on the stove. Then you drain the potatoes and mash them up, add butter and whatever else you want, and now both the beans and chicken are done as well. You put everything onto plates, serve it up, and you're finished.
What's interesting to note is that the above description does not actually involve doing two things at once. Instead, you are actually doing portions of each task and then while you're waiting for certain things to complete you work on other tasks. On the classic single processor computer system, the same situation applies: the processor never really does two things at once; it just switches rapidly between various applications giving each of them a portion of the computational power of available. In order to actually do more than one thing at a time, you need more cooks in the kitchen, or else you need more processors. In the case of our example, you might have two people working on dinner, allowing more elaborate dishes to be prepared along with additional courses. Now while one person works on preparing the main three dishes we mentioned above, a second person could work on something like an appetizer and a dessert.
You could potentially even add more people, so you might have five people each preparing a single dish for a five course meal. Slightly trickier would be to have multiple people working on each dish. Rather than doing something mundane like grilled chicken, you could have a chicken dish with various other items to liven it up, along with a sauce. In extremely complex dishes, you could even break down a dish into more steps that various individuals could work on completing. Obviously, more can be accomplished as you add additional people, but you also run the risk of becoming less efficient so that some people might only be busy half the time.
We started with talking about multitasking, but the last example began to get into the concept of multithreading. In computer terminology, a "thread" is basically a portion of a program that needs to be executed. If you have a task that is computationally intensive and it is written as a single threaded application, it can only take advantage of a single processor core. Running two instances of such an application would allow you to use two processor cores, but if you only need to run one instance you need to figure out a way to take advantage of the additional computational power available. Multithreading is what is required, and in essence it involves breaking a task into two or more pieces which can be solved simultaneously.
Where multitasking can be important whether or not you have multiple processor cores available, multithreading really only begins to become important when you have the ability to execute more than one thread at a time. If you have a single core processor, multithreading simply adds additional overhead while the processor spends time switching between threads, and it is often better to run most tasks as a single thread on such systems. It's also worth noting that it becomes much easier to write and debug programming code when it is running as a single threaded application, because you know exactly in what order each task will execute.
We will return to this "cooks in the kitchen" example a bit more when we talk about the various types of threading environments. It's a bit simplistic, but hopefully it gives you a bit better idea about what goes on inside computer programs and what it means to break up a task into threads.
There are four basic threading models available for computer programmers. Some of them are easier to implement than others, and some of them are more beneficial for certain types of work than others. The four threading models are:
- Single Threading
- Coarse Threading
- Fine-Grained Threading
- Hybrid Threading
The next step up from single threading is to look at coarse threading. Next to avoiding threads altogether, this is the simplest approach to multithreading. The idea is basically to find discrete systems of work that can be done separate from each other. The example we gave earlier of having two cooks working a single dish, with each person preparing a portion of the dish is an example of coarse threading. If one person is working on slicing up and grilling some chicken while a second person is preparing a Hollandaise sauce, you have two separate systems that can easily be accomplished at the same time.
A classic example of this in the computing world is an application that uses a client-server architecture. You would simply run the client code as one thread and the server code as a second thread, each would handle certain tasks, there would be a minimal amount of interaction between the two threads, and hopefully you would get a net increase in performance. This type of architecture is in fact what has allowed two of the current multithreaded games to begin to take advantage of multiple processors. Quake 4 and Call of Duty 2 both stem from id Software code that used a client-server design. The server handles such tasks as networking code, artificial intelligence and physics calculations, and perhaps a few other tasks related to maintaining the world state. Meanwhile the client handles such things as user input, graphics rendering, audio and some other tasks as well. (We might have the breakdown wrong, but it's the general idea we're interested in.)
Depending on the game, breaking up a client-server design into two separate threads can provide anywhere from ~0% to 20% or more performance. 0% - as in no gain at all? Unfortunately, as many of you are already aware, many games are bottlenecked by one specific area: graphics performance. If you reach the point where the graphics card is slowing everything else down, the CPU might have more than enough time available to run all of the other tasks while waiting for the graphics card to finish its work. If that's the case, having a faster CPU -- or even more CPU cores -- would provide little or no benefit in a coarse threaded client-server game. This is why Quake 4 and Call of Duty 2 see performance increases for enabling SMP support that start out high when you run at lower resolutions using fast graphics cards, but as the resolution increases and the GPU becomes the bottleneck, the benefit of SMP begins to disappear. Note also that both of these games only use two threads, so if you have something like a quad core processor or two dual core processors, you still aren't able to effectively use even half of the total computational resources available.
This is not to say that coarse threading can't be advantageous. If you have an application with two tasks that need to be computed, and both tasks take the same amount of time -- or nearly the same amount of time -- using a coarse threaded approach allows you to easily utilize all of the available CPU power. There are many tasks that fall into this category, although likewise there are plenty of tasks that do not fit this model. It is still a better solution than single threading in a multiprocessor world, but in many instances it's a baby step in the right direction.
Fine-grained threading takes the opposite approach of coarse threading. Instead of looking for entire systems that can be broken into separate threads, fine-grained threading looks at a single task and tries to break it into smaller pieces. A couple of the best examples of fine-grained threading providing a dramatic increase in performance are 3D rendering applications and video encoding applications. If you try to take a coarse threading approach to 3D rendering, you will usually end up with one system that handles the actual 3D rendering, and it will consume well over 90% of the application time. The same can be said of video encoding. Luckily, both tasks -- 3D rendering and video encoding -- are relatively simple to break up into smaller pieces. You simply take a divide and conquer approach, so if you have two processor cores you split the scene in half, or with four cores you split the scene into four pieces, etc. You can further optimize things by allowing any cores that finish early to begin working on a portion of one of the unfinished sections.
Thinking back to our kitchen analogy, what exactly is fine-grained threading? The best example we can come up with would be the preparation of a large quantity of food. Let's say you have a large barbecue where there will be hundreds of people present. Everyone wants their food to be warm, and hopefully the food doesn't get overcooked, undercooked, or any one of a number of other potential problems. If you're preparing something like hamburgers or steaks, cooking several at a time makes sense, but there's only so much one person can do. The answer is to add additional barbecues with additional cooks, so they can all be working on making hamburgers at the same time.
The problem with fine-grained threading is that not all tasks easily lend themselves to this approach. There are other types of foods that could be served at a barbecue where you might not need to have a bunch of chefs doing the same task, for example. It is definitely more difficult (overall) than coarse threading, but you can run into problems when you have a lot of work to be done but you don't know exactly how long each piece will take. Performance scaling and resource management also tend to become issues the more you split up a task - you wouldn't want 300 chefs, 300 barbeques, all cooking one hamburger each, right? You have to maintain a balance between spending time breaking a task into smaller pieces versus simply getting the work done. Typically, there is a minimum size beyond which it's not worth splitting up a task any further, and when you reach this point you just let all of the assigned pieces complete. Things like memory bandwidth can also become a performance bottleneck, so if you have a task that was really easy to break into pieces but each piece demanded a lot of memory bandwidth, you might discover that more than four pieces shows no performance improvements on current computer systems due to memory bandwidth limitations.
The net result is that fine-grained threading often has diminishing returns as you add more processors. With 3D rendering for example, having two processor cores is often about 90% faster than having only one processor core, whereas four processor cores is only about 80% faster than two cores, and eight cores might only be 70% faster than four cores. So if one processor takes 20 minutes to complete a 3D rendering task, two cores might be able to complete the task in 11 minutes, four cores might complete the task in 6 minutes, and eight cores could complete the task in only 3.5 minutes. The benefits of fine-grained threading will vary depending on the work that is being performed, and it can place a larger burden on the programmer to get things properly optimized, but the performance increases can still be substantial.
Hybrid threading really doesn't add much new to the table. As the name implies, this is a hybrid approach where you basically take both coarse and fine-grained threading and utilize them as appropriate to the task at hand. This requires the most work, as now you have to deal with synchronizing coarse system threads along with potentially individual threads within the systems. Hybrid threading will typically have more threads to worry about than either of the other approaches. The good news is that when done properly, hybrid threading can make maximum use of the available CPU power.
Building a Threaded Game Engine
Valve Software researched and experimented with all three multithreading models, and some of the results were promising. The first thing they did was to look at coarse threading, being that it is usually the easiest approach. Like Quake 4 and Call of Duty 2, Valve also has a client-server architecture in the Source engine, so that's where they started. The results can be summarized with the following slide:
There were benefits to the work that was done, but long-term such an approach is destined to fall short. If you only have a client and a server thread, you are only using two processor cores at most, and even then it is unlikely that you're making full use of both cores. The decision was ultimately reached that the only approach truly worth pursuing as a long-term solution is hybrid threading. It will require the most effort but it will also give the most benefit. As long as you're already rewriting large portions of your engine, the thinking goes, you might as well make sure you do it right so as to avoid doing the same thing again in the near future. The next step is to come up with a threading framework that will allow the developers to accomplish their goals.
Operating systems and compilers already provide some support for multithreaded programs. In some cases, this support is sufficient, but when it comes to building a real-time gaming engine anything short of a custom threading tool may not provided the desired flexibility. Valve's problems with OS and compiler level threading can be summarized in the following two slides.
We've already stated several times that making multithreaded programs can be difficult, and the problem is exacerbated when you have many programmers -- some of them junior programmers that have never seen threaded code in their life -- all trying to work together. Valve decided that ultimately what they wanted to create was a threading framework and a set of tools that would allow the programmers and content creators to focus on the important thing, actually creating the games, rather than trying to constantly deal with threading issues. Tom Leonard and a couple other helpers were tasked with creating this toolset, and as you can guess by the existence of this article, their efforts are about to reach fruition.
We're not going to try and snowball you with any more information on the low level details at present. Basically, Valve is working hard to provide a toolset that will enable their developers -- as well as mod creators and engine licensees -- to take full advantage of multi-core platforms. They aren't just looking at dual core or even quad core; their goal is to build an engine that is scalable to whatever type of platform users may be running in the future. Of course that doesn't mean they won't be rewriting portions of their engine at some point anyway, but all signs indicate that they should be at the forefront of creating multi-core games and gaming engines.
The Future of Gaming?
The first thing that Valve had to do was look at the threading models available: how can they distribute work to the cores? Having decided on hybrid threading, the next step was to decide what sort of software tools were needed, and to create those tools as necessary. Most of that work should now be done or nearing completion, so the big question is with the threading in place, how are they going to apply the power available? Up to this point we have primarily been concerned with what multithreading is and how Valve has attempted to implement the technology. All this talk about multi-core support sounds good, but without a compelling need for the technology it really isn't going to mean much to most people.
We've already seen quite a few games that are almost entirely GPU bound, so some might argue that the GPU needs to get faster before we even worry about the CPU. As Valve sees things, however, the era of pretty visuals is coming to an end. We have now reached the point where in terms of graphics most people are more than satisfied with what they see. Games like Oblivion look great, but it's still very easy to tell that you're not in a real world. This does not mean that better graphics are not important, but Valve is now interested in taking a look at the rest of the story and what can be done beyond graphics. Valve also feels that their Source engine has traditionally been more CPU limited anyway, so they were very interested in techniques that would allow them to improve CPU performance.
Before we get to the stuff beyond graphics, though, let's take a quick look at what's being done to improve graphics. The hybrid threading approach to the rendering process can be thought of as follows:
At present, everything up to drawing in the rendering pipeline is essentially CPU bound, and even the drawing process can be at least partially CPU bound with tasks such as generating dynamic vertex buffers. The second image above shows the revised pipeline and how you might approach it. For example, spending CPU time building the vertex buffer makes the GPU more efficient, so even in cases where you are mostly GPU limited you can still improve performance by using additional CPU power. Another item related to graphics is the animations and bone transformations that need to be done prior to rendering a scene. Bone transforms can be very time consuming, and as you add more creatures the CPU limitations become more and more prevalent. One of the solutions is to simply cheat, either by reducing the complexity of the animations, cloning one model repeatedly, or by other methods; but with more processor power it becomes possible to actually do full calculations on more units.
A specific Half-Life example that was given is that you might have a scene with 200 Combine soldiers standing in rank, and the animations for that many units requires a huge chunk of time. All of the bone transformations can be done in parallel, so more CPU power available can directly equate to having more entities on screen at the same time.
With more computational power available to do animations, the immersiveness of the game world can also be improved. Right now, the amount of interaction that most creatures have with the environment is relatively limited. If you think about the way people move through the real world, they are constantly bumping into other objects or touching other objects and people. While this would be largely a visual effect, the animations could be improved to show people actually interacting with the environment, so to an onlooker someone running around the side of a house might actually reach out their hand and grab the wall as they turn the corner. Similarly, a sniper lying prone on the top of a hill could actually show their body adjusting to the curvature of the ground rather than simply being a standard "flat" prone position. Two characters running past each other could even bump and react realistically, with arms and bodies being nudged to the side instead of mysteriously gliding past each other.
|Click to enlarge|
One final visual aspect that CPU power can influence is the rendering of particle systems. Valve has given us a benchmark that runs through several particle system environments, and we will provide results for the benchmark later. How real world the benchmark is remains to be seen, as it is more of a technology demonstration at present, but the performance increases are dramatic. Not being expert programmers, we also can't say for sure how much of the work that they are doing on the CPU could actually be done elsewhere more efficiently.
Besides the benchmark, though, particle systems can be used to more realistically simulate things like flames and water. Imagine a scene where a campfire gets extinguished by real dynamically generated rain rather than as a canned animation: you could actually see small puffs of smoke and water as individual drops hit the fire, and you might even be able to kick the smoldering embers and watch individual sparks scatter around on the ground. The goal is to create a more immersive world, and the more realistic things look and behave, the more believable the environment. Is all of this necessary? Perhaps not with conventional games that we're used to, but certainly it should open up new gameplay mechanics and that's rarely a bad thing.
Gaming's Future, Continued
So what can be done besides making a world look better? One of the big buzzwords in the gaming industry right now is physics. Practically every new game title seems to be touting amazing new physics effects. Perhaps modern physics are more accurate, but having moderate amounts of physics in a gaming engine is nothing new. Games over a decade ago allowed you to do such things as pick up a stone and throw it (Ultima Underworld), or shoot a bow and have the arrow drop with distance. Even if the calculations were crude, that would still count as having "physics" in a game. Of course, there's a big difference in the amount of physics present in Half-Life and those present in Half-Life 2, and most people would agree that the Half-Life 2 was a better experience due to the improved interaction with the environment.
Going forward, physics can only become more important. One of the holy grails of gaming is still the creation of a world that has a fully destructible environment. Why do you need to find a stupid key to get through a wooden door when you're carrying a rocket launcher? Why can't you blow up the side of the building and watch the entire structure fall to the ground, perhaps taking out any enemies that were inside? How about the magical fence that's four inches too tall to jump over - why not just break it down instead of going around? It's true that various games have made attempts in this direction, but it's still safe to say that no one has yet created a gaming environment that allows you to demolish everything as you could in the real world (within reason). Gameplay still needs to play a role in what is allowed, but the more the possibilities for what can be done are increased, the more likely we are to see revolutionary gameplay.
Going along with physics and game world interactions, Valve spoke about the optimizations they've made in a structure called the spatial partition. The spatial partition is essentially a representation of the game world, and it is queried constantly to determine how objects interact. From what we could gather, it is used to allow rough approximations to take place where it makes sense, and it also helps determine where more complex (and accurate) mathematical calculations should be performed. One of the problems traditionally associated with multithreaded programming has been locking access to certain data structures in order to keep the world in a consistent state. For the spatial partition, the vast majority of the accesses are read operations that can occur concurrently, and Valve was able to use lock-free and wait-free algorithms in order to greatly improve performance. A read/write log is used to make sure the return values are correct, and Valve emphasized that the lock-free algorithms were a huge design win when it came to multithreading.
Another big area that can stand to see a lot of improvement is artificial intelligence. Often times, AI has a tacked on feel in current games. You want your adversaries to behave somewhat realistically, but you don't want the game to spend so much computational power figuring out what they should do that everything crawls to a slow. It's one thing to wait a few seconds (or more) for your opponent to make a move in a chess match; it's a completely different story in an action game being rendered at 60 frames per second. Valve discussed the possibilities for having a greater number of simplistic AI routines running, along with a few more sophisticated AI routines (i.e. Alyx in Episode One).
They had some demonstrations of swarms of creatures interacting more realistically with the environment, doing things like avoiding dangerous areas, toppling furniture, swarming opponents, etc. (The action was more impressive than the above screenshots might indicate.) The number of creatures could also be increased depending on CPU power (number of cores as well as clock speed), so where a Core 2 Quad might be able to handle 500 creatures, a single core Pentium 4 could start to choke on only 80 or so creatures.
In the past, getting other creatures in the game world to behave even remotely realistically was sufficient -- "Look, he got behind a rock to get shelter!" -- but there's so much more that can be done. With more computational power available to solve AI problems, we can only hope that more companies will decide to spend the time on improving their AI routines. Certainly, without having spare processor cycles, it is difficult to imagine any action games spending as much time on artificial intelligence as they spend on graphics.
There are a few less important types of AI that could be added as well. One of these is called "Out of Band AI" -- these are AI routines that are independent of the core AI. An example that was given would be a Half-Life 2 scene where Dr. Kleiner is playing chess. They could actually have a chess algorithm running in the background using spare CPU cycles. Useful? Perhaps not that example, unless you're really into chess, but these are all tools to create a more immersive game world, and there is almost certainly someone out there that can come up with more interesting applications of such concepts.
Other Multi-Core Benefits
The benefits of multi-core architectures are not limited to the games themselves. In fact, for many companies the benefits to the in-house developers can be far more important than the improvements that are made to the gaming world. Where actual games have to worry about targeting many different levels of computer systems, the companies that create games often have a smaller selection of high-end systems available. We talked earlier about how content creation is typically a task that can more readily take advantage of multithreading, and the majority of workstation level applications are designed to leverage multiprocessor systems. At the high-end of the desktop computing segment, the line between desktop computers and workstations has begun to blur, especially with the proliferation of dual and now quad core chips. Where in the past one of the major benefits of a workstation was often having dual CPU sockets, you can now get up to four CPU cores without moving beyond a desktop motherboard. There are still reasons to move to a workstation, for example the additional memory capacity that is usually available, but if all you need is more CPU power there's a good chance you can get by with a dual or quad core desktop instead.
Not surprisingly, Valve tends to have computers that are closer to the top end desktop configurations currently available. We asked about what sort of hardware most of their developers were running, and they said a lot of them are using Core 2 Duo systems. However, they also said that they have been holding off upgrading a lot of systems while they waited for Core 2 Quad to become available. They were able to do some initial testing using quad core systems, and for their work the benefits were so tremendous that it made sense to hold off upgrading until Intel launched the new processors. (For those of you that are wondering, on the graphics side of the equation, Valve has tried to stay split about 50-50 between ATI and NVIDIA hardware.)
One of the major tools that Valve uses internally is a service called VMPI (Valve Message Passing Interface). This utility allows Valve to make optimal use of all of the hardware they have present within their offices, somewhat like a distributed computing project, by sending work units to other systems on the network running the VMPI service. Certain aspects of content creation can take a long time, for example the actual compilation (i.e. visibility and lighting calculations) of one of their maps. Anyone that has ever worked with creating levels for just about any first-person shooter can attest to the amount of time this process takes. It still takes a lot of effort to design a map in the first place, but in level design there's an iterative process of designing, compiling, testing, and then back to the drawing board that plays a large role in perfecting a map. The problem is, once you get down to the point where you're trying just clean up a few last issues, you may only need to spend a couple minutes tweaking a level, and then you need to recompile and test it inside the gaming engine. If you are running on a single processor system -- even one of the fastest single processor systems -- it can take quite a while to recompile a map.
The VMPI service was created to allow Valve to leverage all of the latent computational power that was present in their offices. If a computer is sitting idle, which is often the case for programmers who are staring at lines of code, why not do something useful with the CPU time? Valve joked about how VMPI has become something of a virus around the offices, getting replicated onto all of the systems. (Yes, it can be shut off, for those that are wondering.) Jokes aside, creating a distributed, multithreaded utility to speed up map compilation times has certainly helped the level creators. Valve's internal VRAD testing can be seen below, and we will have the ability to run this same task on individual systems as a benchmark.
Running as a single thread on a Core 2 processor, a 2.67 GHz QX6700 is already 36% faster than a Pentium 4 3.2GHz. Enabling multithreading makes the Kentsfield processor nearly 5 times as fast. Looking at distributing the work throughout the Valve offices, 32 old Pentium 4 systems are only ~3 times faster than a single Kentsfield system (!), but more importantly 32 Kentsfield systems are still going to be 5 times faster than the P4 systems. In terms of real productivity, the time it takes Valve's level designers to compile a map can now be reduced to about half a minute, where a couple years back it might have been closer to 30 minutes. Now the level designers no longer have to waste time waiting for the computers to prepare their level for testing; 30 seconds isn't even enough time to run to the bathroom and come back! Over the course of a project, Valve states that they should end up saving "thousands of hours" of time. When you consider how much most employees are being paid, the costs associated with upgrading to a quad core processor could easily be recouped in a year or less. Mod authors with higher end systems will certainly appreciate the performance boost as well. We will take a closer look at performance scaling of the VRAS map compilation benchmark on a variety of platforms in a moment.
Obviously valve is pretty excited about what can be done with additional processing power, and they have invested a lot of time and resources into building tools that will take advantage of the possibilities. However, Valve is a software developer as opposed to a hardware review site, and our impression is that most of their systems are typical of any business these days: they are purchased from Dell or some other large OEM, which means they are a bit more limited in terms of what kind of hardware is available. That's not to say that Valve hasn't tested AMD hardware, because they have, but as soon as they reached the conclusion that Core 2 Duo/Core 2 Quad would be faster, they probably didn't bother doing a lot of additional testing. We of course are more interested in seeing what these new multiprocessor benchmarks can tell us about AMD and Intel hardware -- past, present, and future -- and we plan on utilizing these tests in future articles. As a brief introduction to these benchmark utilities, however, we thought it would be useful to run them on a few of our current platforms to see how they fare.
In the interest of time, we did not try to keep all of the tested platforms identical in terms of components. Limited testing did show that the processor is definitely the major bottleneck in both benchmarks, with a variance between benchmark runs of less than 5% on all platforms. Besides the processor, the only other area that seems to have any significant impact on benchmark performance is memory bandwidth and timings. We tested both benchmarks three times on each platform, then we threw out the high and low scores and took the remaining median score. In many instances, the first run of the particle simulation benchmark was slightly slower than the next two runs, which were usually equal in performance. The variability between benchmark runs of the map compilation test was less than 1%, so the results were very consistent.
Here are the details of the tested systems.
|Athlon 64 3200+ 939|
|CPU||Athlon 64 3200+ (939) - 2.0GHz 512K
OC 3200+ @ 10x240 HTT = 2.40GHz
|Motherboard||ASUS A8N-VM CSM - nForce 6150|
|Memory||2x1GB OCZ OCZ5001024EBPE - DDR-400 2-3-2-7 1T
OC DDR-480 3-3-2-7 1T
|HDD||Seagate SATA3.0Gbps 7200.9 250GB 8MB cache 7200 RPM|
|Athlon X2 3800+ 939|
|CPU||Athlon X2 3800+ (939) - 2.0GHz 2x512K
OC 3800+ @ 10x240 HTT = 2.40GHz
|Motherboard||ASUS A8R32-MVP - ATI Xpress 3200|
|Memory||2x1GB OCZ OCZ5001024EBPE - DDR-400 2-3-2-7 1T
OC DDR-480 3-3-2-7 1T
|HDD||Western Digital SATA3.0Gbps SE16 WD2500KS
250GB 16MB cache 7200 RPM
|Athlon X2 3800+ AM2|
|CPU||Athlon X2 3800+ (AM2) - 2.0GHz 2x512K
OC 3800+ @ 10x240 HTT = 2.40GHz
|Motherboard||Foxconn C51XEM2AA - nForce 590 SLI|
|Memory||2x1GB Corsair PC2-8500C5 - DDR2-800 4-4-4-12
OC DDR2-960 4-4-4-12
|HDD||Western Digital SATA3.0Gbps SE16 WD2500KS
250GB 16MB cache 7200 RPM
|Core 2 Duo E6700 NF570|
|CPU||Core 2 Duo E6700 - 2.67GHz 4096K
OC E6700 @ 10x320 FSB = 3.20GHz
|Motherboard||ASUS P5NSLI - nForce 570 SLI for Intel|
|Memory||2x1GB Corsair PC2-8500C5 - DDR2-800 4-4-4-12
OC DDR2-960 4-4-4-12
|HDD||Western Digital Raptor 150GB 16MB 10000 RPM|
|Core 2 Quad QX6700 975X|
|CPU||Core 2 Quad QX6700 - 2.67GHz 2 x 4096K
OC QX6700 @ 10x320 FSB = 3.20GHz
|Motherboard||ASUS P5W DH Deluxe - 975X|
|Memory||2x1GB Corsair PC2-8500C5 - DDR2-800 4-4-4-12
OC DDR2-960 4-4-4-12
|HDD||2 x Western Digital Raptor 150GB in RAID 0|
|Pentium D 920 945P|
|CPU||Pentium D 920 - 2.8GHz 2 x 2048K
OC 920 @ 14x240 HTT = 3.36GHz
|Motherboard||ASUS P5LD2 Deluxe - 945P|
|Memory||2x1GB Corsair PC2-8500C5 - DDR2-667 4-4-4-12
OC DDR2-800 4-4-4-12
|HDD||Western Digital SATA3.0Gbps SE16 WD2500KS
250GB 16MB cache 7200 RPM
We did test all of the systems with the same graphics card configuration, just to be consistent, but it really made little to no difference. On the Athlon 64 configuration, for example, we got the same results using the integrated graphics as we got with the X1900. We also tested at different resolutions, and found once again that on the graphics cards we used resolution seemed to have no impact on the final score. 640x480 generated the same results as 1920x1200, even when enabling all of the eye candy at the high resolution and disabling everything at the low resolution. To be consistent, all of the benchmarking was done at the default 1024x768 0xAA/8xAF. We tried to stay consistent on the memory that we used -- either for DDR or DDR2 - though the Pentium D test system had issues and would not run the particle simulation benchmark. Finally, to give a quick look at performance scaling, we overclocked all of the tested systems by 20%.
For now we are merely providing a short look at what Valve has been working on and some preliminary benchmarks. We intend to use these benchmarks on some future articles as well where we will provide a look at additional system configurations. Note that performance differences of one or two points should not be taken as significant in the particle simulation test, as the granularity of the reported scores is relatively coarse.
Particle Systems Benchmark
The more meaningful of the two benchmarks in terms of end users is going to be the particle simulation benchmark, as this has the potential to actually impact gameplay. The only problem is that the map is a contrived situation with four rooms each showing different particle system simulations. As proof that simulating particle systems can require a lot of CPU processing power, and that Valve can multithread the algorithms, the benchmark is meaningful. How it will actually impact future gaming performance is more difficult to determine. Also note that particle systems are only one aspect of game engine performance that can use more processing cores; artificial intelligence, physics, animation, and other tasks can benefit as well, and we look forward to the day when we have a full gaming benchmark that can simulate all of these areas rather than just particle systems. For now, here's a quick look at the particle system performance results.
There are several interesting things we get from the particle simulation benchmark. First, it scales almost linearly with the number of processor cores, so the Core 2 Quad system ends up being twice as fast as the Core 2 Duo system when running at the same clock speed. We will take a look at how CPU cache and memory bandwidth affects performance in the future, but at present it's pretty clear that Core 2 once again holds a commanding performance lead over AMD's Athlon 64/X2 processors. As for Pentium D, we repeatedly got a program crash when trying to run it, even with several different graphics cards. There's no reason to assume it would be faster than Athlon X2, though, and we did get results with Pentium D on the other test.
Athlon X2 performed the same, more or less, whether running on 939 or AM2 - even with high-end DDR2-800 memory. Our E6700 test system generated inconsistent results when overclocked, likely due to limitations with the nForce 570 SLI chipset. For most of the platforms, the 20% overclock brought on average a 20% performance increase, showing again that we are essentially completely CPU limited. The lack of granularity makes the scores vary slightly from 20% but it's close enough for now. Finally, taking a look at Athlon 64 vs. X2 on socket 939, the second CPU core improves performance by ~90%
VRAD Map Compilation Benchmark
As more of a developer/content creation benchmark, the results of the VRAD benchmark are not likely to be as interesting to a lot of people. However, keep in mind that better performance in this area can lead to more productive employees, so hopefully that means better games sooner. (Or maybe it just means more stress for the content developers?)
The results we got on the map compilation benchmark support Valve's own research and help to explain why they would be very interested in getting more Core 2 Quad systems into their offices. We don't have a single core Pentium 4 processor represented, but even a Pentium D 920 still ends up taking more than twice as long as a Core 2 Duo E6700 system, and about four times as long as Core 2 Quad. Looking at the CPU speed scaling, a 20% higher clock speed with the Pentium D resulted in 19% higher performance. If Intel had tried to stick with the NetBurst architecture, they would need dual core Pentium D processors running at more than 6.0 GHz in order to match the performance offered by the E6700. We won't even get into discussions about how much power such a CPU would require.
Performance scales almost linearly with clock speed once again, improving by 20% with the overclocking. Moving from single to dual core Athlon chips improves performance by about 92%. Going from a Core 2 Duo to a Core 2 Quad on the other hand improves performance by "only" 84%. It is not too surprising to find that moving to four cores doesn't show scaling equal to that of the single to dual move, but an 84% increase is still very good, roughly equal to what we see in 3D rendering applications.
At this point, we almost have more questions than we have answers. The good news is that it appears Valve and others will be able to make good use of multi-core architectures in the near future. How much those improvements will really affect the gameplay remains to be seen. In the short term, at least, it appears that most of the changes will be cosmetic in nature. After all, there are still a lot of single core processors out there, and you don't want to cut off a huge portion of your target market by requiring multi-core processors in order to properly experience a game. Valve is looking to offer equivalent results regardless of the number of cores for now, but the presentation will be different. Some effects will need to be scaled down, but the overall game should stay consistent. At some point, of course, we will likely see the transition made to requiring dual core or even multi-core processors in order to run the latest games.
We had a lot of other questions for Valve Software, so we'll wrap up with some highlights from the discussions that followed their presentations. First, here's a quick summary of the new features that will be available in the Source engine in the Episode Two timeframe.
We mentioned earlier that Valve is looking for things to do with gaming worlds beyond simply improving the graphics: more interactivity, a better feeling of immersion, more believable artificial intelligence, etc. Gabe Newell stated that he felt the release of Kentsfield was a major inflection point in the world of computer hardware. Basically, the changes this will bring about are going to be far reaching and will stay with us for a long time. Gabe feels that the next major inflection point is going to be related to the post-GPU transition, although anyone's guess as to what exactly it will be and when it will occur will require a much better crystal ball than we have access to. If Valve Software is right, however, the work that they are doing right now to take advantage of multiple cores will dovetail nicely with future computing developments. As we mentioned earlier, Valve is now committed to buying all quad core (or better) systems for their own employees going forward.
In regards to quad core processing, one of the questions that was raised is how the various platform architectures will affect overall performance. Specifically, Kentsfield is simply two Core 2 Duo die placed within a single CPU package (similar to Smithfield and Presler relative to Pentium 4). In contrast, Athlon X2 and Core 2 Duo are two processor cores placed within a single die. There are quite a few options for getting quad core processing these days. Kentsfield offers four cores in one package, in the future we should see four cores in a single die, or you could go with a dual socket motherboard with each socket containing a dual core processor. On the extreme end of things, you could even have a four socket motherboard with each socket housing a single core processor. Valve admitted that they hadn't done any internal testing of four socket or even two socket platforms -- they are primarily interested in desktop systems that will see widespread use -- and much of their testing so far has been focused on dual core designs, with quad core being the new addition. Back to the question of CPU package design (two chips on a package vs. a single die) the current indication is that the performance impact of a shared package vs. a single die was "not enough to matter".
One area in which Valve has a real advantage over other game developers is in their delivery system. With Steam comes the ability to roll out significant engine updates; we saw Valve add HDR rendering to the Source engine last year, and their episodic approach to gaming allows them to iterate towards a completely new engine rather than trying to rewrite everything from scratch every several years. That means Valve is able to get major engine overhauls into the hands of the consumers more quickly than other gaming companies. Steam also gives them other interesting options, for example they talked about the potential to deliver games optimized specifically for quad core to those people who are running quad core systems. Because Steam knows what sort of hardware you are running, Valve can make sure in advance that people meet the recommended system requirements.
Perhaps one of the most interesting things about the multi-core Source engine update is that it should be applicable to previously released Source titles like the original Half-Life 2, Lost Coast, Episode One, Day of Defeat: Source, etc. The current goal is to get the technology released to consumers during the first half of 2007 -- sometime before Episode Two is released. Not all of the multi-core enhancements will make it into Episode Two or the earlier titles, but as time passes Source-based games should begin adding additional multi-core optimizations. It is still up to the individual licensees of the Source engine to determine which upgrades they want to use or not, but at least they will have the ability to add multithreading support.
With all of the work that has been done by Valve during the past year, what is their overall impression of the difficulty involved? Gabe put it this way: it was hard, painful work, but it's still a very important problem to solve. There are talented people that have been working on getting multithreading to work rather than on other stuff, so there has been a significant monetary investment. The performance they will gain is going to be useful however, as it should allow them to create games that other companies are currently unable to build. Even more surprising is how the threading framework is able to insulate other programmers from the complexities involved. These "leaf coders" (i.e. junior programmers) are still able to remain productive within the framework, without having to be aware of the low-level threading problems that are being addressed. One of these programmers for example was able to demonstrate new AI code running on the multithreaded platform, and it only took about three days of work to port the code from a single threaded design to the multithreading engine. That's not to say that there aren't additional bugs that need to be addressed (there are certain race conditions and timing issues that become a problem with multithreading that you simply don't see in single threaded code), but over time the programmers simply become more familiar with the new way of doing things.
Another area of discussion that brought up several questions was in regards to other hardware being supported. Specifically, there is now hardware like PhysX, as well as the potential for graphics cards to do general computations (including physics) instead of just pure graphics work. Support for these technologies is not out of the question according to Valve. The bigger issue is going to be adoption rate: they are not going to spend a lot of man-hours supporting something that less than 1% of the population is likely to own. If the hardware becomes popular enough, it will definitely be possible for Valve to take advantage of it.
As far as console hardware goes, the engine is already running on Xbox 360, with support for six simultaneous threads. The PC platform and Xbox 360 are apparently close enough that getting the software to run on both of them does not require a lot of extra effort. PS3 on the other hand.... The potential to support PS3 is definitely there, but it doesn't sound like Valve has devoted any serious effort into this platform as of yet. Given that the hardware isn't available for consumer purchase yet, that might make sense. The PS3 Cell processor does add some additional problems in terms of multithreading support. First, unlike Xbox 360 and PC processors, the processor cores available in Cell are not all equivalent. That means they will have to spend additional effort making sure that the software is aware of what cores can do what sort of tasks best (or at all as the case may be). Another problem that Cell creates is that there's not a coherent view of memory. Each core has its own dedicated high-speed local memory, so all of that has to be managed along with worrying about threading and execution capabilities. Basically, PS3/Cell takes the problems inherent with multithreading and adds a few more, so getting optimal utilization of the Cell processor is going to be even more difficult.
One final area that was of personal interest is Steam. Unfortunately, one of the biggest questions in regards to Steam is something Valve won't discuss; specifically, we'd really like to know how successful Steam has been as a distribution network. Valve won't reveal how many copies of Half-Life 2 (or any of their other titles) were sold on Steam. As the number of titles showing up has been increasing quite a bit, however, it's safe to say Steam will be with us for a while. This is perhaps something of a no-brainer, but the benefits Valve gets by selling a game via Steam rather than at retail are quite tangible. Valve did confirm that any purchases made through Steam are almost pure profit for them. Before you get upset with that revelation, though, consider: whom would you rather support, the game develpers or the retail channel, publishers, and distributors? There are plenty of people that will willingly pay more money for a title if they know the money will all get to the artists behind the work. Valve also indicated that a lot of the "indie" projects that have made their way over to Steam have become extremely successful compared to their previously lackluster retail sales -- and the creators see a much greater percentage of the gross sales vs. typical distribution (even after Valve gets its cut). So not only does Steam do away with scratched CDs, lost keys, and disc swapping, but it has also started to become a haven for niche products that might not otherwise be able to reach their target audience. Some people hate Steam, and it certainly isn't perfect, but many of us have been very pleased with the service and look forward to continuing to use it.
Some of us (author included) have been a bit pessimistic on the ability of game developers to properly support dual cores, let alone multi-core processors. We would all like for it to be easy to take advantage of additional computational power, but multithreading certainly isn't easy. After the first few "SMP optimized" games came and went, the concerns seemed to be validated. Quake 4 and Call of Duty 2 both were able to show performance improvements courtesy of dual core processors, but for the most part these performance gains only came at lower resolutions. After Valve's Hardware Day presentations, we have renewed hope that games will actually be able to become more interesting again, beyond simply improving visuals. The potential certainly appears to be there, so now all we need to see is some real game titles that actually make use of it. Hopefully, 2007 will be the year that dual core and multi-core gaming really makes a splash. Looking at the benchmark results, are four cores twice as exciting as two cores? Perhaps not yet for a lot of people, but in the future they very well could be! The performance offered definitely holds the potential to open up a lot of new doors.
As a closing comment, there was a lot of material presented, and in the interest of time and space we have not covered every single topic that was discussed. We did try to hit all the major highlights, and if you have any further questions, please join the comments section and we will do our best to respond.