Magic stone

ADDs need to happen quickly (they are very common) and dependent instructions should not wait long for the results that they need. Otherwise, the whole superscalar machine comes to a grinding halt and is not used efficiently. Intel's engineers killed those two birds with one magic stone: the double-pumped adder combined with a very smart Store-to-Load forward mechanism.

Consider this:

Instruction 1: C := D+D
Instruction 2: A := B+C

One instruction (2) needs the result (variable "C") of a previous one (1). In theory, the load of the variables of the second instruction cannot happen before the store of the first one has been written back. x86 ISA rules demand that stores happen in program order.

So, the second instruction won't have the necessary data before the first instruction has completed almost the entire pipeline (about 30 stages). That means that for 30 clock pulses, other independent instructions need to be found to keep the pipelines working. Modern CPUs try to offset this problem with Store-to-Load Forwarding. The forwarding mechanism kicks in as soon as the execution engine has calculated the result of the first instruction, and doesn't wait until all pipelines stages after the execution are done. The data is "forwarded" to the waiting instruction after a few clock cycles (as long as the calculation/execution takes), and way before the result has been written back to the L1 cache. Prescott's Store-to-Load Forwarding handles many store-forward situations, which are not forwarded by other "less intelligent" CPUs.

Cool & double pumped

If ADDs need to be executed fast, and store-to-forwards must happen very early in the pipeline, you need low latency ADD ALUs (Arithmetic and Logical Unit). Therefore, Intel designed a double-pumped ALU. This very simple and small ALU can only perform ADDs runs at no less than 7.6 GHz (!) on a 3.8 GHz Prescott. The advantage is that ADDs are done very quickly and the results can be forwarded quickly. So, while two double-pumped adders can perform up to four ADD operations per clock cycle, that is not really the point as the Trace cache can sustain, at the very best, about 3 ADD operations per clockcycle; the point is quick store-to-load forwarding. Dependencies are solved a lot faster, and cause a lot less trouble.

The fact that the most common instruction happens very quickly (ADD) is an extra bonus.

So, the problem is solved, right? Those 7.6 GHz ADDers produces a lot of heat and that is why Prescott failed? Wrong. Back in 2004, Intel published a PDF [6] , which explains that those supercharged ALUs use Low Voltage Swing. The PDF is written for experts, but simplified, we can say that the ALUs achieve incredible clock speeds by using a technique similar to the one that allows for example S-ATA to run at high speeds.

Low Voltage Swing means that instead of using one voltage measure point (and the ground), you calculate the voltage by subtracting one voltage line (or rail) from another one. The reasoning is that using the differential between two voltages (SCSI LVD might also be similar), errors are cancelled out. For example, if you have an error of +0.3V, and your core voltage is 0.5V, the voltage (0.8V) will not be read correctly (not within limits). But if you have two rails/lines one of 2V and 1.5V (0.5 V is the logical 1) very close to each other, both will be affected by the error, but 2.3V – 1.8V is still 0.5V. This is simplified of course, but it should give you an idea.

The end result is that voltage readings are much more precise and you can get away with much smaller voltage swings. S-ATA, for example, gets away with a voltage swing of about 0.2V (0.6V – 0.4V), while P-ATA needed no less than 3.3V (vs the ground). You can understand that it is much easier to make a voltage change quickly if the voltage swing is low.

This is most likely very similar to what happens in the double-pumped ALUs. So, while the rest of the core uses something like 1.2 – 1.4V, the double-pumped ALUs work with +/- 0.2V (page 6 of the PDF). Thus, the double-pumped ALUs are one of the coolest spots on the CPU.

Heat?

So, from where does all the heat come? First of all, those complex LVS gates, which are the building stones of the double-pumped ALU, need very quickly switching transistors compared to a conventional ALU. Quickly switching transistors lose more leakage power. While the dynamic power of the double-pumped ALU might be low, power leaking is a serious problem.

Secondly, as the clock speed of the ALUs goes up, the voltage swing must be raised also. That means that the core voltage (the LVS derives the min and max voltages of the core voltage) rises too, which means that the rest of the CPU must cope with higher voltage. Higher voltage means quadratically higher power losses for the rest of the core.

Thirdly, a CPU that needs to run at 4 GHz and 8 GHz is limited by wire delays. The result is many repeaters and extra pipelines stages just to get the signal across the die. More pipeline stages and more repeaters mean more logic, more power.

The fourth problem is 64 bit. In order to handle 64 bit operations, it must have been an incredibly difficult job to redesign the ALUs without slowing them down. The result is extra logic, which consumes more power.

The fifth problem is, of course, the Branch Prediction Unit, which is much more complex and generates a lot of heat.

All the small tweaks, EM64T, more complex BPU and ALUs, probably some non-working functionality (Dynamic multi-threading), built-in self tests (BIST, making debugging easier) and the larger caches made sure that Prescott also had 125 million transistors, which of course increased leakage also.


CHAPTER 4: The Pentium 4 crash landing Conclusion
Comments Locked

65 Comments

View All Comments

  • Zak - Wednesday, August 22, 2007 - link

    I seem to remember reading somewhere, probably couple of years ago, about research being done on hyperconductivity in "normal" temperatures. Right now hyperconductivity occurs only in extremely low temperatures, right? If materials were developed that achieve the same in normal temperatures it'd solve lots of these issues, like wire delay and power loss, wouldn't it?

    Z.
  • Tellme - Monday, February 21, 2005 - link

    Carl what i meant was that soon we might not see much improved performance with multicores as well because the data comes too late to the processor for quick execution. (That is true for single cores as well).

    Did you checked the link?
    Their idea is simple.
    "If you can't bring the memory bandwidth to the processor, then bring the processors to the memory."
    Intresting no?
    Currently processor waits most of its time for data to be processed.

  • carl0ski - Saturday, February 19, 2005 - link

    #61 i thought p4 already had memory bandwidth problems,
    AMD has a temporary work around (on die memory controller) which aids in multiple CPU's/Dies using the same fsb to access the Ram.

    Intel has proposed multiple fsb's , one each CPU/die.

    Does anyone know if that means they will need sperate RAM dimms for each FSB? because that would prove an expensive system.
  • carl0ski - Saturday, February 19, 2005 - link

    [quote]59 - Posted on Feb 12, 2005 at 11:28 AM by fitten Reply
    #57 What was the performance comparison of the 1GHz Athlon vs. the 1GHz P3? IIRC, the Athlon was faster by some margin. If this was the case, then there was a little more than tweaking that went on in the Pentium-M line. Because they started out looking at the P3 doesn't mean that what they ended up with was the P3 with a tweak here or there. :)[/quote]

    #59 didnt P3 1ghz run 133mhz sdram? on a 133fsb?
    Athlon 1ghz had a nice DDR 266 fsb to support it.

  • Tellme - Monday, February 14, 2005 - link

    Nice article.

    I think dual cores will soon reach hit the wall ie Memory Bandwidth.

    Hopefully memory and processors are integrates in near future.

    See
    http://www.ee.ualberta.ca/~elliott/cram/

  • ceefka - Monday, February 14, 2005 - link

    Though still a little too technical for me, it makes a good read.

    It's good to know that Intel has eaten their words and realized they had to go back to the drawing board.

    I believe rather sooner than later multicore will mean 4 - 8 cores providing the power to emulate everything that is not necessarily native, like running MAC OSX on an AMD or Intel box. Iow the CELL will meet its match.
  • fitten - Saturday, February 12, 2005 - link

    #57 What was the performance comparison of the 1GHz Athlon vs. the 1GHz P3? IIRC, the Athlon was faster by some margin. If this was the case, then there was a little more than tweaking that went on in the Pentium-M line. Because they started out looking at the P3 doesn't mean that what they ended up with was the P3 with a tweak here or there. :)
  • avijay - Friday, February 11, 2005 - link

    EXCELLENT Article! One of the very best I've ever read. Nice to see all the references at the end as well. Could someone please point me to Johan's first article at AT please. Thanks.
    Great Work!
  • fishbreath - Friday, February 11, 2005 - link

    For those of you who don't actually know this:

    1) The Dotham IS a Pentium 3. It was tweaked by Intel in Israel, but it's heart and soul is just a PIII.

    1b) All P4's have hyperthreading in them, and always have had. It was a fuse feature that was not announced until there were applications to support them. But anyone who has HT and Windows XP knows that Windows simply has a smoother 'feel' when running on an HT processor!

    2) Complex array processors are already in the pipeline (no pun intended). However the lack of an operating system or language to support them demands they make their first appearance in dedicated applications such as h264 encoders.
  • blckgrffn - Friday, February 11, 2005 - link

    Yay for Very Large Scale Integration (more than 10,000 transistors per chip)! :) I wonder when the historians will put down in the history books that we have hit the fifth generation of computing org....

Log in

Don't have an account? Sign up now