Magic stone

ADDs need to happen quickly (they are very common) and dependent instructions should not wait long for the results that they need. Otherwise, the whole superscalar machine comes to a grinding halt and is not used efficiently. Intel's engineers killed those two birds with one magic stone: the double-pumped adder combined with a very smart Store-to-Load forward mechanism.

Consider this:

Instruction 1: C := D+D
Instruction 2: A := B+C

One instruction (2) needs the result (variable "C") of a previous one (1). In theory, the load of the variables of the second instruction cannot happen before the store of the first one has been written back. x86 ISA rules demand that stores happen in program order.

So, the second instruction won't have the necessary data before the first instruction has completed almost the entire pipeline (about 30 stages). That means that for 30 clock pulses, other independent instructions need to be found to keep the pipelines working. Modern CPUs try to offset this problem with Store-to-Load Forwarding. The forwarding mechanism kicks in as soon as the execution engine has calculated the result of the first instruction, and doesn't wait until all pipelines stages after the execution are done. The data is "forwarded" to the waiting instruction after a few clock cycles (as long as the calculation/execution takes), and way before the result has been written back to the L1 cache. Prescott's Store-to-Load Forwarding handles many store-forward situations, which are not forwarded by other "less intelligent" CPUs.

Cool & double pumped

If ADDs need to be executed fast, and store-to-forwards must happen very early in the pipeline, you need low latency ADD ALUs (Arithmetic and Logical Unit). Therefore, Intel designed a double-pumped ALU. This very simple and small ALU can only perform ADDs runs at no less than 7.6 GHz (!) on a 3.8 GHz Prescott. The advantage is that ADDs are done very quickly and the results can be forwarded quickly. So, while two double-pumped adders can perform up to four ADD operations per clock cycle, that is not really the point as the Trace cache can sustain, at the very best, about 3 ADD operations per clockcycle; the point is quick store-to-load forwarding. Dependencies are solved a lot faster, and cause a lot less trouble.

The fact that the most common instruction happens very quickly (ADD) is an extra bonus.

So, the problem is solved, right? Those 7.6 GHz ADDers produces a lot of heat and that is why Prescott failed? Wrong. Back in 2004, Intel published a PDF [6] , which explains that those supercharged ALUs use Low Voltage Swing. The PDF is written for experts, but simplified, we can say that the ALUs achieve incredible clock speeds by using a technique similar to the one that allows for example S-ATA to run at high speeds.

Low Voltage Swing means that instead of using one voltage measure point (and the ground), you calculate the voltage by subtracting one voltage line (or rail) from another one. The reasoning is that using the differential between two voltages (SCSI LVD might also be similar), errors are cancelled out. For example, if you have an error of +0.3V, and your core voltage is 0.5V, the voltage (0.8V) will not be read correctly (not within limits). But if you have two rails/lines one of 2V and 1.5V (0.5 V is the logical 1) very close to each other, both will be affected by the error, but 2.3V – 1.8V is still 0.5V. This is simplified of course, but it should give you an idea.

The end result is that voltage readings are much more precise and you can get away with much smaller voltage swings. S-ATA, for example, gets away with a voltage swing of about 0.2V (0.6V – 0.4V), while P-ATA needed no less than 3.3V (vs the ground). You can understand that it is much easier to make a voltage change quickly if the voltage swing is low.

This is most likely very similar to what happens in the double-pumped ALUs. So, while the rest of the core uses something like 1.2 – 1.4V, the double-pumped ALUs work with +/- 0.2V (page 6 of the PDF). Thus, the double-pumped ALUs are one of the coolest spots on the CPU.

Heat?

So, from where does all the heat come? First of all, those complex LVS gates, which are the building stones of the double-pumped ALU, need very quickly switching transistors compared to a conventional ALU. Quickly switching transistors lose more leakage power. While the dynamic power of the double-pumped ALU might be low, power leaking is a serious problem.

Secondly, as the clock speed of the ALUs goes up, the voltage swing must be raised also. That means that the core voltage (the LVS derives the min and max voltages of the core voltage) rises too, which means that the rest of the CPU must cope with higher voltage. Higher voltage means quadratically higher power losses for the rest of the core.

Thirdly, a CPU that needs to run at 4 GHz and 8 GHz is limited by wire delays. The result is many repeaters and extra pipelines stages just to get the signal across the die. More pipeline stages and more repeaters mean more logic, more power.

The fourth problem is 64 bit. In order to handle 64 bit operations, it must have been an incredibly difficult job to redesign the ALUs without slowing them down. The result is extra logic, which consumes more power.

The fifth problem is, of course, the Branch Prediction Unit, which is much more complex and generates a lot of heat.

All the small tweaks, EM64T, more complex BPU and ALUs, probably some non-working functionality (Dynamic multi-threading), built-in self tests (BIST, making debugging easier) and the larger caches made sure that Prescott also had 125 million transistors, which of course increased leakage also.


CHAPTER 4: The Pentium 4 crash landing Conclusion
Comments Locked

65 Comments

View All Comments

  • WhoBeDaPlaya - Thursday, February 10, 2005 - link

    Ain't no way you can get those repeaters out of there - that's already the optimum solution for driving the large load (interconnect). It probably equalizes the stage effort required (you can work out the math and find that for multi-stage logic, the optimal config is that each stage has the exact same effort level). Eg. instead of driving an interconnect with a "unit" inverter, it might be more feasible to drive it with a chain of them, each with different fan in/out. Repeater insertion is tricky and (as far as I know) can't readily be automated.

    Interconnects are getting to tbe point where traversal of a die diagonally can take multiple clock cycles. Some folks are suggesting that a pipelined approach could be extended to interconnects, esp. clock trees. But the most fun problem (for me at least :P) is the handling of inductance extraction - how in the h*ll do you model it accurately? High-speed digital design == Analog design. Long live analog / mixed-signal VLSI designers :P
  • fitten - Thursday, February 10, 2005 - link

    [quote]Well-written multicore-aware code should have the number of cores as a _variable_, so you just set it to 1 on a uniprocessor platform.[/quote]

    Sometimes parallel algorithms aren't very good for serial execution. In these cases, you may actually have one algorithm for multiple processors and another algorithm for a single processor.

    [quote]So, if Intel were to use less repeaters the heat output could be lowered significantly. [/quote]

    Well... I'm sure the Intel engineers didn't just up-and-say one day, "Hey, I know something cool to do... let's put some more repeaters into the core." I'm sure there's a reason for them being in there. It would probably take a bit of redesign to get the repeaters out. (I'm pretty sure this is what you meant, but I just wanted to clarify that stuff like repeaters aren't just put into a CPU for no reason. Things like repeaters are put in because there wasn't a more viable solution to some signalling problem that's there.)
  • sphinx - Thursday, February 10, 2005 - link

    So, the reason for the Prescott's shortcomings is the use of too many repeaters as shown in the image of the Itanium 2. If I remember correctly, the article said that the repeaters were using too much power as well. So, if Intel were to use less repeaters the heat output could be lowered significantly.
  • AtaStrumf - Thursday, February 10, 2005 - link

    Nice article and pretty easy to understand as well. I'm happy to hear that there may still be hope for controlling the power leakage, because without it I just can't see anybody getting beyond 65 nm, since even 65 nm will, without improvements, leak almost 3 times as much power as 90nm does now.

    Anxiously waiting for E0 A64 to see what AMD has managed to cook up.
  • mickyb - Wednesday, February 9, 2005 - link

    There are plenty of multi-threaded apps out there. I am not sure pure single threaded apps exist any more outside of "Hello World" and some old Cobol/FORTRAN ports that are on floppy.

    Quake and UT have been multi-threaded for a while. Quake was multi-threaded when I had a dual Pentium pro. There were even benchmarks. The benefits seen with hyper-threading also show that many apps are multi-threaded. The performance gain was negligible due to the graphics drivers and OpenGL/DirectX not being thread optimized. I am sure that has been worked out by now.

    Multi-threading is not all about making use of multiple CPUs. There are many conditions where a program would be stopped dead in its tracks waiting for a response from some outside program or hardware device. You can solve this with events, multi-process, multi-threading, call-backs, etc. Goal wise, they are related. In the Winders world, threading is the method of choice.

    I really can't believe there are still arguments going on about programs not being multi-threaded. This is not that much of an issue any more. Even if your apps is not threaded, the OS is and it can run on one CPU while your app runs on the other. Or if you have 2 apps, then they can run on different CPUs.

    With all that said, I agree with the thought that creating performance for all applications is better served using a faster single core CPU than dual CPUs. I think this way because when you have a unit of work to be done (even with multiple threads), it is more likely to be done quicker with a single CPU that is capable of the same computing power as 2 CPUs. I single unit of work will ultimately be smaller than a thread in all cases. The smallest is the instruction set.

    Now...with that said, if the limiting factor is technology and they cannot obtain the equivalent performance of a dual core with a single core, then it makes since to go dual core to obtain it, especially with the power leakage. I like the thinking behind dual core on a laptop, but am skeptical about the part that says turning the CPU off and on rapidly to keep it cool and efficient. It will probably work if it isn't turned on and off too quickly, but heat spreads pretty quickly. You wouldn't even get past POST without a heat-sink and that silicon insulator keeps everything pretty cozy.
  • NegativeEntropy - Wednesday, February 9, 2005 - link

    Johan, another excellent article, I'm looking forward to part 2.
  • Evan Lieb - Wednesday, February 9, 2005 - link

    It's pretty much impossible to get a "newbie" explanation of CPU architectures without a least a basic understanding of how CPUs work. Rand's suggestions were quite good, you should start there if you're overwhelmed by Johan's explanations IceWindius. It also wouldn't hurt to start with Anand's CPU articles from last year.
  • Rand - Wednesday, February 9, 2005 - link

    "I wish someone like Arstechinca would make something really built ground up like CPU's for morons so I could start understanding this stuff better."

    You may want to read parts 1-5 of "The Secrets of High Performance CPUs"
    http://www.aceshardware.com/list.jsp?id=4
    A bit outdayed now, as it was written in 99' if I recall correctly but it's still broadly relevant and a nice series of articles if your looking to get a better understanding of microprocessors without being drowned in the technical side of things.

    ArsTechnica also has some good articles with a newbie friendly slant.

    There are some excellent articles at RealWorldTech as well, but their definitely written for engineers rather then the average person.
    Unfortunately most of the more noteable books like those by Hennessy & Patterson assume you've already some knowledge of computer architectures.
  • stephenbrooks - Wednesday, February 9, 2005 - link

    #46, Well-written multicore-aware code should have the number of cores as a _variable_, so you just set it to 1 on a uniprocessor platform. I also think there already exists a multithreaded version of one of the big engines (Quake, UT?) that apparently does not lose any performance on a single core either.

    But I agree with the main thrust of your post, which is "Buy AMD".
  • Noli - Wednesday, February 9, 2005 - link

    Not to belittle dual core development and I know there are a lot of people who run technical programs that will benefit from dual core on this site, but when I spend a small fortune on a pc, the primary driver is being able to play the most advanced games in the world. Unfortunately, I don't feel multi-threaded game code is going to get written for a longggggg time (what's the point of reducing potential customers?). How long till a very large percentage of users have dual cores? End of 2006 at the very earliest? So it's really a just a theoretical interest till then for me...

Log in

Don't have an account? Sign up now