CHAPTER 4: The Pentium 4 crash landing

The Prescott failure

The Pentium 4 "Prescott" is, despite its innovative architecture, a failure. Intel expected to scale this Pentium 4 architecture to 5 GHz, and derivatives of this architecture were supposed to come close to 10 GHz. Instead, the Prescott was only able to reach 3.8 GHz after numerous revisions. And even then, the 3.8 GHz is losing up to 115 Watt, and about 35-50% (depending on the source) is lost to leakage power.

The Prescott project failed, but that doesn't mean that the architecture itself was not any good. In fact, the philosophy behind the enhanced Netburst architecture is very innovative and even brilliant. To understand why we state this, let me quickly refresh your memory on the software side of things.

IPC unfriendly software

First, consider that the average code does not allow the CPU to process a lot of instructions in parallel. To give you an idea, we found out that video encoding achieves about 0.6-0.8 instructions per clock cycle (IPC) on modern CPUs. Secondly, note that almost 20% of the instructions are branches, and 50% of them are memory operations. In case of video encoding, you may have less than 10% branches, and about 60% memory operations. Most of the instructions that are not branches or memory operations are additions, or "ADD"s. Some of the memory operations need to make use of the same units that perform the ADD instructions.

You should also know that many algorithms contain calculations, which need the results of a previous one: a dependency. So, you cannot issue the second calculation until the first is done.

Most studies show that realistically, a sophisticated CPU would be able to reach an IPC of a little more than 2, about twice as much as CPUs today.

Up close and personal

Now, take look at the scheme of the Prescott architecture below. Let us see how Prescott solves all the problems mentioned above.


Fig 7. Prescott's architecture.

Click to enlarge.

First of all, you want to make sure that memory operations happen quickly. Therefore, the Prescott doubled the L1 (data only) and L2-cache. It has also two dedicated Address Generation Units, one for stores and one for loads.

Build for 4 GHz and more, accesses to the main RAM are going to be costly in terms of clock pulses (latency), considering that DDR-II 533 runs at a 266 MHz clock. So, Prescott tries to minimize the damage of waiting for cache misses by increasing the big store buffers of Northwood from 24 to 32, and doubling the load request buffers. So, Prescott can have a lot of cache misses simultaneously outstanding . An intelligent hardware prefetcher is another way to avoid slowdowns due to high memory latency.

To battle branch misprediction, the Prescott Branch predictor has been tuned and predicts 10% of the mispredicted branches by Northwood correctly. That results in up to 20% better performance! And of course, the trace cache makes sure that a mispredicted branch does not need to restart the decoding stages. As a result, the misprediction penalty is not 39 stages, but 31 stages. The 8 stages of decoding do not need to happen again because in most cases, the Trace cache has the decoded instruction.


CHAPTER 3: Containing the epidemic problems CHAPTER 4 (con't)
Comments Locked

65 Comments

View All Comments

  • AnnoyedGrunt - Tuesday, February 8, 2005 - link

    It's possible that 22 was referring solely to the grammar of the sentence, which could potentially make more sense if it was rewritten as, "while other applications will REQUIRE exponential investments in develpment....."

    Very good article overall, but some portions could be polished a bit perhaps to make it easier for people only slightly familiar with processor details (people like myself) to understand.

    Really looking forward to part 2!

    -D'oh!
  • JarredWalton - Tuesday, February 8, 2005 - link

    23 - Not at all. Have you ever tried writing multi-threaded code? If it take 12 months to write and debug a single-threaded program that handles a task, and you try to do the same thing in multi-threaded code, I would expect 24 to 36 months to get everything done properly.

    Let's not even get into the discussion of the fact that not all code really *can* benefit from multi-threadedness. I had a similar conversation with several others in the Dual Core AMD Roadmap article. You can read the comments there for additional insight, I hope:

    http://www.anandtech.com/talkarticle.aspx?i=2303
  • cosmotic - Tuesday, February 8, 2005 - link

    "while the other applications will see exponential investments in development time to achieve the same performance increase." Thats a really stupid statement.
  • cosmotic - Tuesday, February 8, 2005 - link

    That first image really sucks. You should at least make it look decent. It looks like crap now.
  • IceWindius - Tuesday, February 8, 2005 - link

    Math hurts, and thus my head hurts.......


    Either way, Intel finally admits they fucked up and AMD spanked them for it. Justice is served.
  • faboloso112 - Tuesday, February 8, 2005 - link

    only about halfway through the article but this is a damn good article.

    not a fanboi of any sort but i certainly do hate intel's pr team.

    i think the reason amd has done well for itself is because it doesn't pride itself nor relies of fake product specs and their exaggerated capabilities and scalability...unlike intel...and ill admit...i got cought up in the hype too with the whole 10ghz thing at the time because based on moore's law and how things had been going w/ the clock speed jumps...i thought one day it would be possible...but look at where the prescott stands now...and look at how instead of blabbing about 10ghz..they talk of multi-core cpu.

    i think ill stop talking now and return to the article...
  • erikvanvelzen - Tuesday, February 8, 2005 - link

    i eat these sort of articles about cpu's, memory and the like which have references to hardware which i actually use.

    If you like this, check out these articles by John 'Hannibal' Stokes @ arstechnica.com:
    http://arstechnica.com/cpu/index.html
    http://arstechnica.com/articles/paedia/cpu.ars
  • jbond04 - Tuesday, February 8, 2005 - link

    AWESOME article, Johan. Good to see someone do some real research regarding the Prescott processor. Keep up the good work!
  • Oxonium - Tuesday, February 8, 2005 - link

    Johan used to write very good articles for Ace's Hardware. I'm glad to see him writing those same high-quality articles for Anandtech. Keep up the good work!
  • BlackMountainCow - Tuesday, February 8, 2005 - link

    Wow, very interesting read. Finally some stuff based on real facts and not some "Prescott just sux" stuff. Two thumbs up!

Log in

Don't have an account? Sign up now