CHAPTER 1: The brakes on CPU power

CPU Performance increase hits the brakes.

The growth rate of CPU performance has been spectacular in the past decades. Two legends of computing history, John.L Hennessy and David A. Patterson, have quantified this performance growth to be about 58% per year.

A recent study by the University of göteborg [1] confirmed that the 58% number was true between 1985 and 1996. During the last 7.5 years (1996-2004), the Swedish professors proved that the performance growth has slowed down to an average of 41% per year. Even worse is the conclusion that "there are signs of a continuing decline".

When we focus on Intel's CPUs, the deterioration of CPU performance growth is almost spelling doom. In November 2002, Intel was well ahead of the competition with the introduction of a 3.06 GHz Pentium 4. Intel had doubled the clock speed of its latest x86 architecture within two years, which was quite an accomplishment.

Two and half years later, Intel's Pentium 4 is running at 3.8 GHz, which means that clock speed has increased by only 25%. Of course, we all know that performance does not scale linearly with clock speed. So, let us talk performance.

 CPU  SpecInt2000  SpecFp2000
Pentium 4 3800E 1666 1839
Pentium 4 3060 1167 1096
Pentium 4 1500 560 634

From 2000 to 2002, performance increased by 108%. In the following 3 years, Intel's latest CPU only increased integer performance by 43%. The same does not hold true for SpecFP2000, as the 3.8 GHz Prescott CPU had improved performance by 68%, while the 3.06 GHz was about 73% faster than the first incarnation of the Netburst architecture.

However, SpecFP2000 remains a "special" benchmark, which exaggerates greatly the importance of memory bandwidth as very few other FPU applications behave the same way. The 800 MHz FSB of the 3.8 GHz is 50% faster than the bus to Intel's first Hyperthreaded CPU (3.06 GHz), while the FSB of the latter has only a 33% advantage over the older 1.5 GHz Pentium 4.

Intel's compilers have also improved vastly over the past years, which is positive. However, they have also become better in using special tricks (strip-mining optimizations, for example) to artificially improve the Spec score; tricks that are not usable by developers who need to get real applications to the market. Don't take my word for it, but make sure to read Tim Sweeney's comments in the next article.

These advantages are the main reasons why SpecFP doesn't tell us what most applications do: the pace of CPU performance growth has slowed down significantly, even in FP intensive workloads. Applications such as 3DSMax, Lightwave, Adobe Premiere, video encoding and others show, on average, that the Pentium 4 3.8 GHz is about 20-45% faster than the Pentium 4 3.06 GHz, while the latter is easily between 60% and 90% faster than our 1.5 GHz reference point.

Demystifying the slowdown

It is no mystery that the three main reasons why CPU progress is slowing down are:

  • Total dissipated power
  • Wire Delay
  • "The memory wall"

However, simply stating that these three problems are the reason why it is getting very hard to design CPUs that perform better is an oversimplification. There are decent solutions for each of these problems, and the real reason why they have slowed down CPU progress is more subtle.

We are going to cover the memory wall in more detail later. Suffice it to say, it is well known that DRAM speeds up by about 10% per year, while CPUs run 40% to 60% faster each year.

Power problems

In order to understand power problems, you have to understand the following formula, which describes switching power:

Power ~ ½ CV ² Af

In other words, dissipated power is linear with the effective capacitance, activity and frequency. Power increases quadratically with the CPU's core voltage. Activity is the factor that is influenced by the software you run; the more intensive the software, the higher the amount of the time that the transistors are active.

With each major transition to a new process technology that has a reduction in transistor feature size of 2, the same die area becomes 4 times smaller. For example, Willamette (introduced with 180 nm technology) would have been more or less 4 times smaller using the 90 nm technology. That is simplified of course, but it shows that the die gets smaller and smaller. Now that should not be such a problem as Vdd (Vcore) can also be reduced, and as a result, you can reduce power by a factor of two or even more. Of course, as CPUs extract more ILP and have deeper pipelines, they become more complex and use more transistors. The result is that the power reductions of decreasing Vdd are negated by the increasing amount of transistors.

And there are limitations of the amount of power that you can dissipate through a shrinking die area. But switching power is not the worst problem, as it can be reduced by applying a few clever techniques.

One of them is clock gating, a power-saving technique implemented extensively in the Pentium 4. Clock gating logic will only activate the clocks in a Functional Unit Block (FUB) when it needs to work. Together with other power-saving techniques, switching or dynamic power is more or less under control; over time, it increases linearly, while the amount of transistors used is increasing exponentially.


Index CHAPTER 1 (con't)
POST A COMMENT

65 Comments

View All Comments

  • WhoBeDaPlaya - Thursday, February 10, 2005 - link

    Ain't no way you can get those repeaters out of there - that's already the optimum solution for driving the large load (interconnect). It probably equalizes the stage effort required (you can work out the math and find that for multi-stage logic, the optimal config is that each stage has the exact same effort level). Eg. instead of driving an interconnect with a "unit" inverter, it might be more feasible to drive it with a chain of them, each with different fan in/out. Repeater insertion is tricky and (as far as I know) can't readily be automated.

    Interconnects are getting to tbe point where traversal of a die diagonally can take multiple clock cycles. Some folks are suggesting that a pipelined approach could be extended to interconnects, esp. clock trees. But the most fun problem (for me at least :P) is the handling of inductance extraction - how in the h*ll do you model it accurately? High-speed digital design == Analog design. Long live analog / mixed-signal VLSI designers :P
    Reply
  • fitten - Thursday, February 10, 2005 - link

    [quote]Well-written multicore-aware code should have the number of cores as a _variable_, so you just set it to 1 on a uniprocessor platform.[/quote]

    Sometimes parallel algorithms aren't very good for serial execution. In these cases, you may actually have one algorithm for multiple processors and another algorithm for a single processor.

    [quote]So, if Intel were to use less repeaters the heat output could be lowered significantly. [/quote]

    Well... I'm sure the Intel engineers didn't just up-and-say one day, "Hey, I know something cool to do... let's put some more repeaters into the core." I'm sure there's a reason for them being in there. It would probably take a bit of redesign to get the repeaters out. (I'm pretty sure this is what you meant, but I just wanted to clarify that stuff like repeaters aren't just put into a CPU for no reason. Things like repeaters are put in because there wasn't a more viable solution to some signalling problem that's there.)
    Reply
  • sphinx - Thursday, February 10, 2005 - link

    So, the reason for the Prescott's shortcomings is the use of too many repeaters as shown in the image of the Itanium 2. If I remember correctly, the article said that the repeaters were using too much power as well. So, if Intel were to use less repeaters the heat output could be lowered significantly. Reply
  • AtaStrumf - Thursday, February 10, 2005 - link

    Nice article and pretty easy to understand as well. I'm happy to hear that there may still be hope for controlling the power leakage, because without it I just can't see anybody getting beyond 65 nm, since even 65 nm will, without improvements, leak almost 3 times as much power as 90nm does now.

    Anxiously waiting for E0 A64 to see what AMD has managed to cook up.
    Reply
  • mickyb - Wednesday, February 09, 2005 - link

    There are plenty of multi-threaded apps out there. I am not sure pure single threaded apps exist any more outside of "Hello World" and some old Cobol/FORTRAN ports that are on floppy.

    Quake and UT have been multi-threaded for a while. Quake was multi-threaded when I had a dual Pentium pro. There were even benchmarks. The benefits seen with hyper-threading also show that many apps are multi-threaded. The performance gain was negligible due to the graphics drivers and OpenGL/DirectX not being thread optimized. I am sure that has been worked out by now.

    Multi-threading is not all about making use of multiple CPUs. There are many conditions where a program would be stopped dead in its tracks waiting for a response from some outside program or hardware device. You can solve this with events, multi-process, multi-threading, call-backs, etc. Goal wise, they are related. In the Winders world, threading is the method of choice.

    I really can't believe there are still arguments going on about programs not being multi-threaded. This is not that much of an issue any more. Even if your apps is not threaded, the OS is and it can run on one CPU while your app runs on the other. Or if you have 2 apps, then they can run on different CPUs.

    With all that said, I agree with the thought that creating performance for all applications is better served using a faster single core CPU than dual CPUs. I think this way because when you have a unit of work to be done (even with multiple threads), it is more likely to be done quicker with a single CPU that is capable of the same computing power as 2 CPUs. I single unit of work will ultimately be smaller than a thread in all cases. The smallest is the instruction set.

    Now...with that said, if the limiting factor is technology and they cannot obtain the equivalent performance of a dual core with a single core, then it makes since to go dual core to obtain it, especially with the power leakage. I like the thinking behind dual core on a laptop, but am skeptical about the part that says turning the CPU off and on rapidly to keep it cool and efficient. It will probably work if it isn't turned on and off too quickly, but heat spreads pretty quickly. You wouldn't even get past POST without a heat-sink and that silicon insulator keeps everything pretty cozy.
    Reply
  • NegativeEntropy - Wednesday, February 09, 2005 - link

    Johan, another excellent article, I'm looking forward to part 2. Reply
  • Evan Lieb - Wednesday, February 09, 2005 - link

    It's pretty much impossible to get a "newbie" explanation of CPU architectures without a least a basic understanding of how CPUs work. Rand's suggestions were quite good, you should start there if you're overwhelmed by Johan's explanations IceWindius. It also wouldn't hurt to start with Anand's CPU articles from last year. Reply
  • Rand - Wednesday, February 09, 2005 - link

    "I wish someone like Arstechinca would make something really built ground up like CPU's for morons so I could start understanding this stuff better."

    You may want to read parts 1-5 of "The Secrets of High Performance CPUs"
    http://www.aceshardware.com/list.jsp?id=4
    A bit outdayed now, as it was written in 99' if I recall correctly but it's still broadly relevant and a nice series of articles if your looking to get a better understanding of microprocessors without being drowned in the technical side of things.

    ArsTechnica also has some good articles with a newbie friendly slant.

    There are some excellent articles at RealWorldTech as well, but their definitely written for engineers rather then the average person.
    Unfortunately most of the more noteable books like those by Hennessy & Patterson assume you've already some knowledge of computer architectures.
    Reply
  • stephenbrooks - Wednesday, February 09, 2005 - link

    #46, Well-written multicore-aware code should have the number of cores as a _variable_, so you just set it to 1 on a uniprocessor platform. I also think there already exists a multithreaded version of one of the big engines (Quake, UT?) that apparently does not lose any performance on a single core either.

    But I agree with the main thrust of your post, which is "Buy AMD".
    Reply
  • Noli - Wednesday, February 09, 2005 - link

    Not to belittle dual core development and I know there are a lot of people who run technical programs that will benefit from dual core on this site, but when I spend a small fortune on a pc, the primary driver is being able to play the most advanced games in the world. Unfortunately, I don't feel multi-threaded game code is going to get written for a longggggg time (what's the point of reducing potential customers?). How long till a very large percentage of users have dual cores? End of 2006 at the very earliest? So it's really a just a theoretical interest till then for me... Reply

Log in

Don't have an account? Sign up now