Original Link: http://www.anandtech.com/show/1645
The Quest for More Processing Power, Part Two: "Multi-core and multi-threaded gaming"by Johan De Gelas on March 14, 2005 12:05 AM EST
- Posted in
IntroductionIn our first article, we explained that dynamic power, power leakage, the memory wall and wire delay have forced CPU designers to rethink the methods that they use to achieve higher performance CPUs.
In Part 2, we will investigate the advantages and disadvantages of the new market trend: multi-core CPUs. Will dual core enhance your gaming experience? Tim Sweeney, the leading developer behind the Unreal 3 engine, was so kind to answer our questions about multi-threaded development with concise answers. There is more - in the third part of this series, we will investigate what future multi-core and single core architectures will bring. We examine if the stories about "the new era of multi-threaded multi-core CPUs" are true and whether or not this will really benefit the consumer.
Should you care?Should you care whether or not we are moving to multi-core and multi-threaded CPUs? After all, the past decades, we were able to get consistently more performance for lower prices. However, it is pretty unclear whether or not multi-cores will benefit all consumers. We will explain this statement in more detail, but it is very interesting to see whether or not it will benefit you. The last spring IDF was all about multi-core CPUs, but there was very little information on how this is going to benefit the consumers. Let us take a critical look at this new direction that the desktop CPUs have taken.
Multi-core, multi-expensive?Dual cores are expensive to manufacture. Yields (the number of working chips on one wafer) are roughly proportional to size. Larger, dual core chips will always have lower yields than smaller, single core chips on the same process technology. But that is only a small problem. A bigger and more obvious problem is that you have only half the number per wafer (even slightly less). So, dual cores (such as Pressler) cost at least twice as much to manufacture compared to a single core chip - most likely more (such as Yonah, Pentium-D). Dual and multi-cores might not increase the thermal density (dissipated power per mm²), but they do increase the total power. Granted, from the viewpoint of a heat sink designer, it is not much harder to cool a 112 mm² Prescott chip that dissipates +/- 90 Watt than a theoretical 206 mm² Pentium-D with 180 Watt. However, making sure that those 180 Watts do not cook all the components inside your computer is almost an impossible task for the system designer who wants to design a relatively silent PC. The result is that multi-core CPUs will run at lower clockspeeds than their single core counterparts. The Pentium-D, the dual core Prescott, is limited to 130 Watt and 3.2 GHz, while the current Prescott dissipates up to 115 Watt and runs at 3.8 GHz. And last, but not least, dual core CPUs need more bandwidth than a single core to make a difference and increase the "CPU perceived" latency. Cache coherency and getting access to the same memory bus all increase the total latency that the CPU sees and thus, lowers performance.
Multi-core, multi-performance?The advantages of multi-core and multi-threaded CPUs far outweigh the disadvantages in the server market. While most server applications produce a lot of threads and processes, performance scales close to linear as more cores are added to the die. This is in sharp contrast with the superscalar CPU where increasingly complex designs require exponentionally more transistors, and power show diminishing returns, especially in server applications where the IPC can go below 1. While Dual core CPUs are more expensive to manufacture, they are far easier to design than turning a single core CPU into an even wider issue, complex CPU. Development costs for a new CPU design are astronomically high. So, it does not surprise us at all that Server CPU manufacturers have turned en masse towards multi-core CPU designs: significant power gains with a fraction of the time and money invested. And the same can be said about a big part of the HPC market.
A good example of how well server applications can scale with more CPUs, refer to our DB2 tests, which showed up to a 96% performance increase going from single to dual, and a boost of up to 89% when we increased the number of Opterons from two to four. Most desktop and many workstation applications are single-threaded, however. Or more accurately, they might be multithreaded to be more responsive, but there is only one thread that really needs CPU power.
Even some workstation applications that are supposed to be prime examples of multi-threaded applications are not as multi-core friendly as they appear to be. I ran a lot of Adobe Premier benchmarking with different video formats, and I found out that the second CPU offered a meagre 10% to 40% speed increase in video editing (rendering). 3DSMax shows only big increases when you use very complex scenes. When using a relatively light animation scene, the second CPU adds about 20% to 50%. One of the best scenes, the architecture scene of the Spec test, shows an 89% increase when adding a second Opteron, but two extra Opterons already show some diminishing returns - performance went up to 72%.
Multitasking scenarios might be another way to use the power of dual and multi-cores. However, many of the CPU heavy applications that desktop and workstation users like to run in the background - archiving, encoding - also operate on the hard disk. And despite the merits of NCQ (Native Command Queuing), high rotation speeds, and lower seek times, disk heavy tasks and especially multithreaded ones can bring a whole system to a crawl when there is too much hard disk activity. So, it is clear that there are big challenges ahead before multi-core CPUs will really bring benefits to most consumers and employees.
Threads & Performance"Threads" is a popular discussed subject. Therefore, we like to give a small introduction to those of you who are not familiar with threads. To understand threads, you first must understand processes. Any decent OS controls the memory allocation to the different programs or processes. A process gets its own private, virtual address space in memory from the OS. Thus, a process cannot communicate/exchange data with other processes without the help of the kernel, the heart of the OS that controls everything. Processes can split up in threads, parallel tasks that share the virtual address space, which can exchange data very quickly without intervention of the OS (global, static, and instance fields, etc.).
The thread is the entity to which the modern operating system (Windows NT based, Solaris, Linux) assigns CPU time. While you could split a CPU intensive program in processes (modern OS sees it as 1 process consisting of one thread), threads of the same process have much less overhead and synchronize data much quicker. The operating system assigns CPU time to running threads based on their priority. Performance gains of multi-CPU or multi-core CPU configurations are only high if: You have more than one CPU intensive thread; The threads are balanced - there is not one very intensive and a few others that are hardly CPU intensive; Synchronization between threads (shared data) either happens quickly, thanks to fast interconnects, or little synchronization is necessary; The OS provides well-tuned load-balanced scheduling; The threads are cache friendly (memory latency!) and do not push the memory bandwidth to the limits. In that case, you may typically expect a 70% to 99% performance speed-up, thanks to the second core. Be warned that Intel was already showing performance increases, which are not realistic "up to 124%". 
The benchmarks compare a Pentium 4 EE 840, a Dual Core Pentium 4 3.2 GHz (1 MB L2), to a 3.73 GHz Pentium 4 EE with 2 MB L2. Especially in the last benchmark, a game running in foreground with two PVR (Personal Video Recorders), and tuners running in the background gives a very weird result. How can a slower Dual core be more than 100% faster than a single core with a higher clock speed, bigger caches and a faster FSB? When we first asked Intel, they pointed to the platform (newer chipset, etc.), but no new chipset can make up for a 33 % slower FSB.
We suspected that different thread priorities (giving the game thread a higher priority) might have been the explanation, but Intel's engineers had another interesting explanation. They pointed out that the Windows scheduler can sometimes be inefficient when running many heavy tasks on a single CPU and might have given the game less CPU time than normal. The Windows scheduler didn't have that problem when two CPUs were present: less context switching between threads, and no reason to give the game not enough CPU time. Prepare for a load of hard-to-interprete benchmarks on the Internet...
Threads & ProgrammingProgramming in Threads brings many advantages, especially on dual-cores. Threads with long running CPU intensive processing are not able to the give the system a sluggish unresponsive feeling when you want you do something else at the same time. The OS scheduler should take care of that as long as the CPU is fast enough, but the Intel benchmarks above show you that that is only true in theory. Dual and multi-core can definitely help here. Threads make a system more responsive and offer a very nice performance boost on multi-CPU systems. But the other side of the medal is complexity. Running separate tasks in separate threads that do not need to share data is the easiest part of making a program more suitable to multi-core CPUs. But that has been done a long time ago, and the real challenge is to handle threads that have to share data. The programmer also has to watch over the fact that high amounts of threads introduce overhead in the form of (unnecessary) context switches even on dual core CPUs.
A nasty problem that might pop-up is a "deadlock", when two threads are each waiting for the other to complete, resulting in neither thread ever completing. A race between two threads might sound speedier, but it means that the result of a program's operation depends on which of two or more threads completes first. The problem becomes exponentionally worse if more and more threads are able to run into these problems. Both the Java and .Net ("Threadpool") platform provide classes and tools to deal with thread management - programmers are not left on their own. The problem is not creating threads, but debugging the multithreaded programs. The result is that multithreading has been used sparingly and with as few threads as possible to keep complexity down. But the right tools are coming, right?
Multi-threading toolboxIntel does provide a few interesting tools for multithreading.
OpenMP is the industry standard for "portable" multi-threaded application development, and can do fine grain (loop level) and large grain (function level) threading.
The newest Intel compilers are even capable of Auto-Parallelization. That sounds fantastic - would multithreading be as easy as using the right compiler? After all, Intel's compiler is able to vectorize existing FP code too. Just recompile your FP intensive code with the right compiler flags and you get speed-ups of 100% and more as the Intel compiler is able to replace x87 instructions by faster SSE-2 alternatives.
Let us see what Intel says about auto-parallelization:
"Improve application performance on multiprocessor systems using auto-parallelization for automatic threading of loops. This option detects parallel loops capable of being executed safely in parallel and automatically generates multi-threaded code. Automatic parallelization relieves the user from having to deal with the low-level details of iteration partitioning, data sharing, thread scheduling and synchronizations. It also provides the benefit of the performance available from multiprocessor systems and systems that support Hyper-Threading Technology."So, it is just a matter of using the right tools? A chicken and egg problem? When the hardware is there, the software will follow? Is it just a matter of having the right tools and enough market penetration of multi-core CPUs? We asked Tim Sweeney, founder of Epic and a multi-threaded game engine programming guru.
Unreal 3 The new Unreal 3 engine is a state of the art game development framework for next-generation consoles and DirectX9 PC's, but what sparked our interest for this article was the fact that it is probably one of the first multithreaded game engines for the most popular game genre: first person shooters.
AnandTech: The new Unreal Engine 3 is designed for multi-threading, and will make good use of dual core CPUs available when games on the new engine come out. What parts of the game will benefit/be improved, thanks to multiprocessing? What will be the parts that will benefit the most?
Tim Sweeney: For multithreading optimizations, we're focusing on physics, animation updates, the renderer's scene traversal loop, sound updates, and content streaming.We are not attempting to multithread systems that are highly sequential and object-oriented, such as the gameplay.
Implementing a multithreaded system requires two to three times the development and testing effort of implementing a comparable non-multithreaded system, so it's vital that developers focus on self-contained systems that offer the highest effort-to-reward ratio.
AnandTech: What kind of performance improvement (rough estimate) do you expect from a dual core CPU compared to a single core CPU with the same core? (A few percents, a bit more than 10%, tens of percents?) In other words, will a gamer "feel" the difference between a dual core and single core or between a single and dual CPU system running an Unreal 3 engine based game?
Tim Sweeney: It's too early to talk numbers, but we certainly expect Unreal Engine 3 titles to see significant gains on multi-core platforms.
AnandTech: In the past years, games have typically depended more on GPU power than on CPU power (a mid-range CPU with a high end video card was/is faster than a high end CPU with a mid-range video card even at relatively low resolutions). Is the multithreaded nature of the Unreal 3 engine a sign that CPU performance is playing again a more important role in the gaming experience?
Tim Sweeney: Unreal Engine games have always been more CPU-intensive than the norm, for two reasons. First, we're always trying to push the leading edge with physics and other CPU-based features. Second, the Unreal Engine has a much more extensive gameplay scripting interface aimed at empowering mod authors and improving developer productivity by enabling safer and higher-level gameplay development. So we're not going to have any trouble keeping up with increases in CPU power.
Multi-core will be especially valuable because CPU performance scaling due to frequency improvements has tapered off over the past few years.
Clock speed has increased slowly, and real performance hasn't increased in proportion to clocks. But two cores have approximately twice the real aggregate performance as one core, so we're about to see a nonlinear improvement.
Finally, keep in mind that the Windows XP driver model for Direct3D is quite inefficient, to such an extent that in many applications, the OS and driver overhead associated with issuing Direct3D calls approaches 50% of available CPU cycles.Hiding this overhead will be one of the major immediate uses of multi-core.
AnandTech: Did you make use of auto-parallelisation compiler technology (like the auto parallelisation found in Intel C++ compiler) to make the engine multithreaded?
Tim Sweeney: Auto-parallelization of C++ code is not a serious notion. This falls in the same category as the Intel compiler's strip-mining optimizations and other such tricks, which are designed to speed up one particular loop in one particular SpecFP benchmark. These techniques applied to C/C++ programs are completely infeasible on the scale of real applications.
AnandTech: What about OpenMP?
Tim Sweeney: There are two parts to implementing multithreading in an application. The first part is launching the threads and handing data to them; the second part is making the appropriate portions of your 500,000-line codebase thread-safe. OpenMP solves only the first problem. But that's the easy part - any idiot can launch lots of threads and hand data to them. Writing thread-safe code is the far harder engineering problem and OpenMP doesn't help with that.
AnandTech: Programming multiple threads can be complex. Wasn't it very hard to deal with the typical problems of programming multithreaded such as deadlocks, racing and synchronization?
Tim Sweeney: Yes! These are hard problems, certainly not the kind of problems every game industry programmer is going to want to tackle. This is also why it's especially important to focus multithreading efforts on the self-contained and performance-critical subsystems in an engine that offer the most potential performance gain. You definitely don't want to execute your 150,000 lines of object-oriented gameplay logic across multiple threads - the combinatorical complexity of all of the interactions is beyond what a team can economically manage. But if you're looking at handing off physics calculations or animation updates to threads, that becomes a more tractable problem.
We also see middleware as one of the major cost-saving directions for the industry as software complexity increases. It's certainly not economical for hundreds of teams to write their own multithreaded game engines and tool sets. But if a handful of company write the core engines and tools, and hundreds of developers can reuse that work, then developers can focus more of their time and money on content and design, the areas that really set games apart.
AnandTech: The current OpenGL and DirectX are - AFAIK - not very well adapted to multithreaded programming. How did you solve this problem? Or wasn't it a problem at all?
Tim Sweeney: There is only one GPU in there, and though it is highly parallel at the pixel level, its execution is still serial on the granularity of state changes and triangle submission. So it is natural that the interface to the GPU remain single-threaded, and that part of one CPU thread be dedicated to submitting rendering commands.
Threads & GamingTim gave us some extremely interesting information. Yes, the extra computing power of multi-cores is welcome in the gaming industry. Better game physics, animation and intensive and accurate sound effects are made possible with more than one core.
 Galactic Civilizations is another example of how game developers can make good use of multithreading. This galactic domination game, which has a lot of emphasis on diplomacy, research and empire management, needs an AI with the most complex decisions. By multithreading this engine, it is possible that the game engine is thinking while the player is playing instead of working turn-based. In the next years, we may expect much better AI. But the price (game) developers have to pay is high: a multithreaded game engine triples or at least doubles the development effort, as Tim told us.
The tools, which Intel advertises in almost any multi-core presentation, are next to useless for the problems that the developers face, as Tim explained. Auto parallelisation is a nice trick to increase the spec FP score, but it is next to useless for a real world application. The good news for Intel, AMD and others is that the CPU will play a much more important role again. Physics, Artificial Intelligence and animation can be improved significantly by being parallelised and using the extra capabilities in dual core CPUs. But there are limits to Thread Level Parallelism. While increased ILP (Instruction Level Parallelism, IPC) might require exponentional increasing efforts of the manufacturer, using more and more threads, or increased TLP (Thread Level Parallelism), requires exponentional efforts from the developers. Tim clearly emphasizes that only parts of the application can be economically parallelized. Increasing parallelisation, using more threads, is simply not feasible. There is a pretty hard economic limit to TLP.
Tim Sweeney resumes:
"You can expect games to take advantage of multi-core pretty thoroughly in late 2006 as games and engines also targeting next-generation consoles start making their way onto the PC.
Writing multithreaded software is very hard; it's about as unnatural to support multithreading in C++ as it was to write object-oriented software in assembly language. The whole industry is starting to do it now, but it's pretty clear that a new programming model is needed if we're going to scale to ever more parallel architectures. I have been doing a lot of R&D along these lines, but it's going slowly."
ConclusionWriting multithreaded code means much higher software development costs while CPU development gets easier and thus cheaper (compared to even more complex superscalar CPUs). No wonder that the CPU developers are very motivated to hype the multi-core route, but the software development community is probably less enthusiastic.
Intel and other manufacturers should not simply push the costs of getting higher performance onto the software developers. Because, in the end, it will be the consumer who will pay the final price: either more money or buggier software with more crashes and hangs. One way that Intel and others can help to keep multithreaded development costs under control while offering increasing CPU performance is to keep investing in ILP and thus higher IPC cores; another option is to improve the interCPU communications.
The easiest part of multithreading is using threads that are running completely independent, that don't share any data. But this source of threading is probably already being used almost to the fullest. In order to tap into a new source of multithreading, such as the largely unused potential of multithreaded AI, Phyics and animation, it is important that developers don't have to worry about interthread messaging and synchronization lowering performance.
Very fast interprocessor communications to make sure that thread synchronization comes with little overhead will give a bigger incentive to developers to invest the extra time in multithreading.
"Most of the current multi-threaded software is developed with an eye at keeping inter-thread messaging and synchronization as low as possible because both have a significant cost. This cost will be lowered by an ordered of magnitude by multiple cores on a single die giving in turn more flexibility for the programmers.The Pentium-D and Pressler are examples of how not to do it: just slap two CPUs on the same die and call it a day. High clocked single cores like the upcoming Athlon 64 FX-57 will eat these massive chips for lunch in almost all benchmarks while consuming less energy. With the exception of some special far-fetched benchmarks, it will be pretty hard to justify the reason behind these dual cores.
Applications which got low speed-ups by going multi-threaded due to the overhead of fine-grained locking mechanisms will be able to exploit multiple-processors with fast interprocessor communications much better."
Luckily, Intel's Yonah and AMD's Dual Athlon 64 cores show that better multi-core CPUs are on the way. At that point in time, we are entering the multi-core engine for real. And we can only applaud that because it unleashes a massive amount of CPU power upon the developers.
References Intel multi-core briefing
Stephen L. Smith, Vice President Digital Enterprise group, IDF Spring 2005
 Unreal 3 engine
 Galactiv Cilivisations
 Gabriele Svelto on dualcore CPUs