Magic stone

ADDs need to happen quickly (they are very common) and dependent instructions should not wait long for the results that they need. Otherwise, the whole superscalar machine comes to a grinding halt and is not used efficiently. Intel's engineers killed those two birds with one magic stone: the double-pumped adder combined with a very smart Store-to-Load forward mechanism.

Consider this:

Instruction 1: C := D+D
Instruction 2: A := B+C

One instruction (2) needs the result (variable "C") of a previous one (1). In theory, the load of the variables of the second instruction cannot happen before the store of the first one has been written back. x86 ISA rules demand that stores happen in program order.

So, the second instruction won't have the necessary data before the first instruction has completed almost the entire pipeline (about 30 stages). That means that for 30 clock pulses, other independent instructions need to be found to keep the pipelines working. Modern CPUs try to offset this problem with Store-to-Load Forwarding. The forwarding mechanism kicks in as soon as the execution engine has calculated the result of the first instruction, and doesn't wait until all pipelines stages after the execution are done. The data is "forwarded" to the waiting instruction after a few clock cycles (as long as the calculation/execution takes), and way before the result has been written back to the L1 cache. Prescott's Store-to-Load Forwarding handles many store-forward situations, which are not forwarded by other "less intelligent" CPUs.

Cool & double pumped

If ADDs need to be executed fast, and store-to-forwards must happen very early in the pipeline, you need low latency ADD ALUs (Arithmetic and Logical Unit). Therefore, Intel designed a double-pumped ALU. This very simple and small ALU can only perform ADDs runs at no less than 7.6 GHz (!) on a 3.8 GHz Prescott. The advantage is that ADDs are done very quickly and the results can be forwarded quickly. So, while two double-pumped adders can perform up to four ADD operations per clock cycle, that is not really the point as the Trace cache can sustain, at the very best, about 3 ADD operations per clockcycle; the point is quick store-to-load forwarding. Dependencies are solved a lot faster, and cause a lot less trouble.

The fact that the most common instruction happens very quickly (ADD) is an extra bonus.

So, the problem is solved, right? Those 7.6 GHz ADDers produces a lot of heat and that is why Prescott failed? Wrong. Back in 2004, Intel published a PDF [6] , which explains that those supercharged ALUs use Low Voltage Swing. The PDF is written for experts, but simplified, we can say that the ALUs achieve incredible clock speeds by using a technique similar to the one that allows for example S-ATA to run at high speeds.

Low Voltage Swing means that instead of using one voltage measure point (and the ground), you calculate the voltage by subtracting one voltage line (or rail) from another one. The reasoning is that using the differential between two voltages (SCSI LVD might also be similar), errors are cancelled out. For example, if you have an error of +0.3V, and your core voltage is 0.5V, the voltage (0.8V) will not be read correctly (not within limits). But if you have two rails/lines one of 2V and 1.5V (0.5 V is the logical 1) very close to each other, both will be affected by the error, but 2.3V – 1.8V is still 0.5V. This is simplified of course, but it should give you an idea.

The end result is that voltage readings are much more precise and you can get away with much smaller voltage swings. S-ATA, for example, gets away with a voltage swing of about 0.2V (0.6V – 0.4V), while P-ATA needed no less than 3.3V (vs the ground). You can understand that it is much easier to make a voltage change quickly if the voltage swing is low.

This is most likely very similar to what happens in the double-pumped ALUs. So, while the rest of the core uses something like 1.2 – 1.4V, the double-pumped ALUs work with +/- 0.2V (page 6 of the PDF). Thus, the double-pumped ALUs are one of the coolest spots on the CPU.

Heat?

So, from where does all the heat come? First of all, those complex LVS gates, which are the building stones of the double-pumped ALU, need very quickly switching transistors compared to a conventional ALU. Quickly switching transistors lose more leakage power. While the dynamic power of the double-pumped ALU might be low, power leaking is a serious problem.

Secondly, as the clock speed of the ALUs goes up, the voltage swing must be raised also. That means that the core voltage (the LVS derives the min and max voltages of the core voltage) rises too, which means that the rest of the CPU must cope with higher voltage. Higher voltage means quadratically higher power losses for the rest of the core.

Thirdly, a CPU that needs to run at 4 GHz and 8 GHz is limited by wire delays. The result is many repeaters and extra pipelines stages just to get the signal across the die. More pipeline stages and more repeaters mean more logic, more power.

The fourth problem is 64 bit. In order to handle 64 bit operations, it must have been an incredibly difficult job to redesign the ALUs without slowing them down. The result is extra logic, which consumes more power.

The fifth problem is, of course, the Branch Prediction Unit, which is much more complex and generates a lot of heat.

All the small tweaks, EM64T, more complex BPU and ALUs, probably some non-working functionality (Dynamic multi-threading), built-in self tests (BIST, making debugging easier) and the larger caches made sure that Prescott also had 125 million transistors, which of course increased leakage also.


CHAPTER 4: The Pentium 4 crash landing Conclusion
Comments Locked

65 Comments

View All Comments

  • Momental - Wednesday, February 9, 2005 - link

    #41, I understood what he meant when he stated that AMD could only be so lucky to have something which was a technological failure, ie: Prescott, sell as well as it has. Even the article clearly summarizes that Prescott in and of itself isn't a piece of junk per se, only that is has no more room for evolution as Intel originally had hoped.

    #36 wasn't saying that it was a flop sales-wise, quite the contrary. The thing has sold like hotcakes!

    I, like many others here, literally got dizzy as I struggled to keep up with all of the technical terminology and mathmetical formulas. My brain is, as of this moment, threatening to strike if I don't get it a better health and retirement plan along with a shorter work week. ;)
  • Ivo - Wednesday, February 9, 2005 - link

    1. About the multiprocessing: Of coarse, there are many (important!) applications, which are more than satisfied with the existing mono-CPU performance. Some other will benefit from dual CPUs. Matrix 2CPU+2GPU combinations could be essential e.g. for stereo-visualization. Probably, desktop machines with enhanced voice/image analytical capabilities could require even more sophisticated CPU Matrices. I suppose, the mono- and multi-CPU solutions will coexist in the near future.

    2. About the leakage problem: New materials like SOI are part of the solution. Another part are the new techniques. Let us take a lesson from the nature: our blood-transportation system consists of tiny capillaries and much thicker arteries. Maybe it could make sense to combine 65 nm transistors e.g. in the cash memory and 90 nm transistors in the ALU?
  • Noli - Wednesday, February 9, 2005 - link

    "Netburst architecture is very innovative and even genial"

    genius-like?
    If by genial you mean 'having a pleasant or friendly disposition', it sounds weird. It can mean 'conducive to growth' in this context but that's not so intuitive because a) it wasn't and b) at best it was only theoretically genial.

    Presumably it's not genial as in 'of or relating to the chin' :)

    Agree monolithic was confusing but it was the intel dude who said it - I thought it meant 'large single unit' rather than 'old (as in technology)' as in: increasing processing power by increasing the size and complexity of a single core is now not as efficient as strapping two cores together - a duallithic unit :)

    Sorry to be a pedantic twat.
  • Xentropy - Wednesday, February 9, 2005 - link

    Some of the verbage in that final chapter makes me wonder how much better Prescott might have done if Intel had just left out everything 64-bit and developed an entirely different processor for 64-bit. Especially since we won't have a mainstream OS that'll even utilize those instructions for another few months, and it's already been about a year since release, they could have easily gotten away with putting 64-bit off for the next project. It's pretty obvious by now even the 32-bit Prescotts have those 64-bit transistors sitting around. Even if not active, they aren't exactly contributing to the power efficiency of the processor.

    I think one big reason Intel thinks dual core will be the savior of even the Prescott line is supposedly dual cores running at 3Ghz only require equivalent power draw to a single core at 3.6Ghz and should be just as fast in some situations (multitasking, at least). Dual core at 85% clockspeed will be slower for gaming, though, so dual core Prescott still won't close the gap with AMD for gaming enthusiasts (98% of this site's readership), and may even represent an even further drop in performance per watt. Here's hoping for Pentium-M on the desktop. :>
  • piroroadkill - Wednesday, February 9, 2005 - link

    #36 -- You really didn't read the article and get the point of it. It wasn't a failure from a sales point of view, and this article was not written from a sales point of view, but a technical point of view, and how the Prescott helped in furthering CPU technology.

    Thus, a failure.
  • ViRGE - Wednesday, February 9, 2005 - link

    Although I think I sank more than I swam, that was a very good and informative article Johan. I just have one request for a future article since I'm guessing the next one is on multi-core tech: will someone at AT run the full AT benchmark suite against a SMP Xeon machine so that we can get a good idea ahead of time what dual-core performance will be like against single core? My understanding is that the Smithfields aren't going to be doing much else new besides putting 2 cores on one die(i.e. no cache sharing or other new tech), so SMP benchmarks should be fairly close to dual-core benchmarks.
  • Griswold - Wednesday, February 9, 2005 - link

    Point and case as to why the marketing department is the most important (and powerful) part of any highly successful company. It's not the R&D labs who tell you what works and what comes next, it's the PR team.
  • quidpro - Wednesday, February 9, 2005 - link

    Someone needs to make a new Tron movie so I can understand this better.
  • tore - Wednesday, February 9, 2005 - link

    Great article, on page 3 you talk about BJT transistor with a base, collector and emitter, since all modern cpu's use mosfets should you talk about a mosfet with a gate, source and drain?
  • Questar - Wednesday, February 9, 2005 - link

    "The Pentium 4 "Prescott" is, despite its innovative architecture, a failure."


    AMD wishes they had a "failure" that sold like Prescott.


Log in

Don't have an account? Sign up now