CHAPTER 4: The Pentium 4 crash landing

The Prescott failure

The Pentium 4 "Prescott" is, despite its innovative architecture, a failure. Intel expected to scale this Pentium 4 architecture to 5 GHz, and derivatives of this architecture were supposed to come close to 10 GHz. Instead, the Prescott was only able to reach 3.8 GHz after numerous revisions. And even then, the 3.8 GHz is losing up to 115 Watt, and about 35-50% (depending on the source) is lost to leakage power.

The Prescott project failed, but that doesn't mean that the architecture itself was not any good. In fact, the philosophy behind the enhanced Netburst architecture is very innovative and even brilliant. To understand why we state this, let me quickly refresh your memory on the software side of things.

IPC unfriendly software

First, consider that the average code does not allow the CPU to process a lot of instructions in parallel. To give you an idea, we found out that video encoding achieves about 0.6-0.8 instructions per clock cycle (IPC) on modern CPUs. Secondly, note that almost 20% of the instructions are branches, and 50% of them are memory operations. In case of video encoding, you may have less than 10% branches, and about 60% memory operations. Most of the instructions that are not branches or memory operations are additions, or "ADD"s. Some of the memory operations need to make use of the same units that perform the ADD instructions.

You should also know that many algorithms contain calculations, which need the results of a previous one: a dependency. So, you cannot issue the second calculation until the first is done.

Most studies show that realistically, a sophisticated CPU would be able to reach an IPC of a little more than 2, about twice as much as CPUs today.

Up close and personal

Now, take look at the scheme of the Prescott architecture below. Let us see how Prescott solves all the problems mentioned above.


Fig 7. Prescott's architecture.

Click to enlarge.

First of all, you want to make sure that memory operations happen quickly. Therefore, the Prescott doubled the L1 (data only) and L2-cache. It has also two dedicated Address Generation Units, one for stores and one for loads.

Build for 4 GHz and more, accesses to the main RAM are going to be costly in terms of clock pulses (latency), considering that DDR-II 533 runs at a 266 MHz clock. So, Prescott tries to minimize the damage of waiting for cache misses by increasing the big store buffers of Northwood from 24 to 32, and doubling the load request buffers. So, Prescott can have a lot of cache misses simultaneously outstanding . An intelligent hardware prefetcher is another way to avoid slowdowns due to high memory latency.

To battle branch misprediction, the Prescott Branch predictor has been tuned and predicts 10% of the mispredicted branches by Northwood correctly. That results in up to 20% better performance! And of course, the trace cache makes sure that a mispredicted branch does not need to restart the decoding stages. As a result, the misprediction penalty is not 39 stages, but 31 stages. The 8 stages of decoding do not need to happen again because in most cases, the Trace cache has the decoded instruction.


CHAPTER 3: Containing the epidemic problems CHAPTER 4 (con't)
Comments Locked

65 Comments

View All Comments

  • Zak - Wednesday, August 22, 2007 - link

    I seem to remember reading somewhere, probably couple of years ago, about research being done on hyperconductivity in "normal" temperatures. Right now hyperconductivity occurs only in extremely low temperatures, right? If materials were developed that achieve the same in normal temperatures it'd solve lots of these issues, like wire delay and power loss, wouldn't it?

    Z.
  • Tellme - Monday, February 21, 2005 - link

    Carl what i meant was that soon we might not see much improved performance with multicores as well because the data comes too late to the processor for quick execution. (That is true for single cores as well).

    Did you checked the link?
    Their idea is simple.
    "If you can't bring the memory bandwidth to the processor, then bring the processors to the memory."
    Intresting no?
    Currently processor waits most of its time for data to be processed.

  • carl0ski - Saturday, February 19, 2005 - link

    #61 i thought p4 already had memory bandwidth problems,
    AMD has a temporary work around (on die memory controller) which aids in multiple CPU's/Dies using the same fsb to access the Ram.

    Intel has proposed multiple fsb's , one each CPU/die.

    Does anyone know if that means they will need sperate RAM dimms for each FSB? because that would prove an expensive system.
  • carl0ski - Saturday, February 19, 2005 - link

    [quote]59 - Posted on Feb 12, 2005 at 11:28 AM by fitten Reply
    #57 What was the performance comparison of the 1GHz Athlon vs. the 1GHz P3? IIRC, the Athlon was faster by some margin. If this was the case, then there was a little more than tweaking that went on in the Pentium-M line. Because they started out looking at the P3 doesn't mean that what they ended up with was the P3 with a tweak here or there. :)[/quote]

    #59 didnt P3 1ghz run 133mhz sdram? on a 133fsb?
    Athlon 1ghz had a nice DDR 266 fsb to support it.

  • Tellme - Monday, February 14, 2005 - link

    Nice article.

    I think dual cores will soon reach hit the wall ie Memory Bandwidth.

    Hopefully memory and processors are integrates in near future.

    See
    http://www.ee.ualberta.ca/~elliott/cram/

  • ceefka - Monday, February 14, 2005 - link

    Though still a little too technical for me, it makes a good read.

    It's good to know that Intel has eaten their words and realized they had to go back to the drawing board.

    I believe rather sooner than later multicore will mean 4 - 8 cores providing the power to emulate everything that is not necessarily native, like running MAC OSX on an AMD or Intel box. Iow the CELL will meet its match.
  • fitten - Saturday, February 12, 2005 - link

    #57 What was the performance comparison of the 1GHz Athlon vs. the 1GHz P3? IIRC, the Athlon was faster by some margin. If this was the case, then there was a little more than tweaking that went on in the Pentium-M line. Because they started out looking at the P3 doesn't mean that what they ended up with was the P3 with a tweak here or there. :)
  • avijay - Friday, February 11, 2005 - link

    EXCELLENT Article! One of the very best I've ever read. Nice to see all the references at the end as well. Could someone please point me to Johan's first article at AT please. Thanks.
    Great Work!
  • fishbreath - Friday, February 11, 2005 - link

    For those of you who don't actually know this:

    1) The Dotham IS a Pentium 3. It was tweaked by Intel in Israel, but it's heart and soul is just a PIII.

    1b) All P4's have hyperthreading in them, and always have had. It was a fuse feature that was not announced until there were applications to support them. But anyone who has HT and Windows XP knows that Windows simply has a smoother 'feel' when running on an HT processor!

    2) Complex array processors are already in the pipeline (no pun intended). However the lack of an operating system or language to support them demands they make their first appearance in dedicated applications such as h264 encoders.
  • blckgrffn - Friday, February 11, 2005 - link

    Yay for Very Large Scale Integration (more than 10,000 transistors per chip)! :) I wonder when the historians will put down in the history books that we have hit the fifth generation of computing org....

Log in

Don't have an account? Sign up now