31 Stages: What's this, Baskin Robbins?

Flip back a couple of years and remember the introduction of the Pentium 4 at 1.4 and 1.5GHz. Intel went from a 10-stage pipeline of the Pentium III to a 20-stage pipeline, an increase of 100%. Initially the Pentium 4 at 1.5GHz had a hard time even outperforming the Pentium III at 1GHz, and in some cases was significantly slower.

Fast forward to today and you wouldn't think twice about picking a Pentium 4 2.4C over a Pentium III 1GHz, but back then the decision was not so clear. Does this sound a lot like our CPU design example from before?

The 0.13-micron Northwood Pentium 4 core looked to have a frequency ceiling of around 3.6 - 3.8GHz without going beyond comfortable yield levels. A 90nm shrink, which is what we thought Prescott was originally going to be, would reduce power consumption and allow for even higher clock speeds - but apparently not high enough for Intel's desires.

Intel took the task of a 90nm shrink and complicated it tremendously by performing significant microarchitectural changes to Prescott - extending the basic integer pipeline to 31 stages. The full pipeline (for an integer instruction, fp instructions go through even more stages) will be even longer than 31 stages as that number does not include all of the initial decoding stages of the pipeline. Intel informed us that we should not assume that the initial decoding stages of Prescott (before the first of 31 stages) are identical to Northwood, the changes to the pipeline have been extensive.

The purpose of significantly lengthening the pipeline: to increase clock speed. A year ago at IDF Intel announced that Prescott would be scalable to the 4 - 5GHz range; apparently this massive lengthening of the pipeline was necessary to meet those targets.

Lengthening the pipeline does bring about significant challenges for Intel, because if all they did was lengthen the pipeline then Prescott would be significantly slower than Northwood on a clock for clock basis. Remember that it wasn't until Intel ramped the clock speed of the Pentium 4 up beyond 2.4GHz that it was finally a viable competitor to the shorter pipelined Athlon XP. This time around, Intel doesn't have the luxury of introducing a CPU that is outperformed by its predecessor - the Pentium 4 name would be tarnished once more if a 3.4GHz Prescott couldn't even outperform a 2.4GHz Northwood.

The next several pages will go through some of the architectural enhancements that Intel had to make in order to bring Prescott's performance up to par with Northwood at its introductory clock speed of 3.2GHz. Without these enhancements that we're about to talk about, Prescott would have spelled the end of the Pentium 4 for good.

One quick note about Intel's decision to extend the Pentium 4 pipeline - it isn't an easy thing to do. We're not saying it's the best decision, but obviously Intel's engineers felt so. Unlike GPUs that are generally designed using Hardware Description Languages (HDLs) using pre-designed logic gates and cells, CPUs like the Pentium 4 and Athlon 64 are largely designed by hand. This sort of hand-tuned design is why a Pentium 4, with far fewer pipeline stages, can run at multiple-GHz while a Radeon 9800 Pro is limited to a few hundred-MHz. It would be impossible to put the amount of design effort making a CPU takes into a GPU and still meet 6 month cycles.

What is the point of all of this? Despite the conspiracy theorist view on the topic, a 31-stage Prescott pipeline was a calculated move by Intel and not a last-minute resort. Whatever their underlying motives for the move, Prescott's design would have had to have been decided on at least 1 - 2 years ago in order to launch today (realistically around 3 years if you're talking about not rushing the design/testing/manufacturing process). The idea of "adding a few more stages" to the Pentium 4 pipeline at the last minute is not possible, simply because it isn't the number of stages that will allow you to reach a higher clock speed - but the fine hand tuning that must go into making sure that your slowest stage is as fast as possible. It's a long and drawn out process and both AMD and Intel are quite good at it, but it still takes a significant amount of time. Designing a CPU is much, much different than designing a GPU. This isn't to say that Intel made the right decision back then, it's just to say that Prescott wasn't a panicked move - it was a calculated one.

We'll let the benchmarks and future scalability decide whether it was a good move, but for now let's look at the mammoth task Intel brought upon themselves: making an already long pipeline even longer, and keeping it full.

Pipelining: 101 Prescott's New Crystal Ball: Branch Predictor Improvements
Comments Locked

104 Comments

View All Comments

  • mattsaccount - Sunday, February 1, 2004 - link

    From the HardOCP review: "Certainly moving to watercooling helped us out a great deal. In fact it is hard for us to recommend buying a Prescott and cooling it any other way."
  • eBauer - Sunday, February 1, 2004 - link

    I am curious as to why the UT2k3 botmatch scores dropped on all CPU's... Different map?
  • Pumpkinierre - Sunday, February 1, 2004 - link

    Sorry errata on #20 that was 3.0 Northood result is out of kilter with other cpus in dtata analysis sysmark 2004.
  • Pumpkinierre - Sunday, February 1, 2004 - link

    JFK,Vietnam,Nixon,Monica,Bush/Gore,Iraq and now this! - what is going on with the leader of the free world.I hope it overclocks well- that's all that's going for it. Maybe Intel should rethink their multiplier locked policy. AMD must get in there and profit. I still dont understand why the caches are running at half the latency as Northood if they are the same speed and structure? Is it as a result of a doubling in size for the same associativity?

    Good article- needs re-rereading after digestion. Last chart in Sysmark2004 (data analysis) has 3.0 Prescott totally outperformed by 2.8 Prescott and all other cpus. Look like a benchmark/typing glitch.
  • yak8998 - Sunday, February 1, 2004 - link

    first the error:
    pg 9 -
    The LDDQU instruction is one Intel is particularly proud of as it helps accelerate video encoding and it is implemented in the DivX 5.1.1 codec. More information on how it is used can be found in Intel’s developer documentation here.

    No link?

    ===
    "What's the power consumption like on these new bad boys?

    Is anything less than a quality 450watt PSU gonna be generally *NOT* recommended?? "

    I'm going to guess a clean running ~350W or so should suffice for a regular system, but I'm not positive with these monster gfx cards out rite now...

    "Any of you know what the cache size on the EE's will be?"

    If your talking about the Northwood (the p4c's are still considered northwoods, no?), its 1mb I believe.
    (still finishing the article. man i love these in-depth technical articles)
  • Tiorapatea - Sunday, February 1, 2004 - link

    I agree, some info on power consumption please.

    Thanks for the article, by the way.

    I guess we'll have to wait and see how Prescott ramps in speed versus 90nm A64.
  • AgaBooga - Sunday, February 1, 2004 - link

    Much better than the P4's origional launch...

    All I want to know now is what AMD is going to do soon... They'll probably counteract Prescott with high clock speeds but when and by how much is what matters.

    Any of you know what the cache size on the EE's will be?

    Also, the final CPU's based on Northwood are kind of like a car with the ratio curves or whatever they're called, but basically after a point of revving, going any higher doesn't give you as much of an increase in speed as it would at a lower rpm increasing the same amount.
  • Cygni - Sunday, February 1, 2004 - link

    AMD's roadmap shows a 4000+ Athlon64 by the end of the year... which is the same as Intel's. They are aware, im sure.
  • Stlr22 - Sunday, February 1, 2004 - link

    What's the power consumption like on these new bad boys?

    Is anything less than a quality 450watt PSU gonna be generally *NOT* recommended??
  • HammerFan - Sunday, February 1, 2004 - link

    Things are gonna get hairy in '04 and '05!!! My take is that AMD nees to get their marketing up-to-spec or the high-clocked prescotts are gonna run the show.

    I have a question for Derek and Anand: What kind of temps does the prescott run at? what type of cooler does it have? (there's nothing there to support or refute claims that the prescott is one hot potato)

Log in

Don't have an account? Sign up now