Pipelining: 101

It seems like every time Intel releases a new processor we have to revisit the topic of pipelining to help explain why a 3GHz P4 performs like a 2GHz Athlon 64. With a 55% longer pipeline than Northwood, Prescott forces us to revisit this age old topic once again.

You've heard it countless times before: pipelining is to a CPU as the assembly line is to a car plant. A CPU's pipeline is not a physical pipe that data goes into and appears at the end of, instead it is a collection of "things to do" in order to execute instructions. Every instruction must go through the same steps, and we call these steps stages.

The stages of a pipeline do things like find out what instruction to execute next, find out what two numbers are going to be added together, find out where to store the result, perform the add, etc...

The most basic CPU pipeline can be divided into 5 stages:

1. Instruction Fetch
2. Decode Instructions
3. Fetch Operands
4. Execute
5. Store to Cache

You'll notice that those five stages are very general in their description, at the same time you could make a longer pipeline with more specific stages:

1. Instruction Fetch 1
2. Instruction Fetch 2
3. Decode 1
4. Decode 2
5. Fetch Operands
6. Dispatch
7. Schedule
8. Execute
9. Store to Cache 1
10. Store to Cache 2

Both pipelines have to accomplish the same task: instructions come in, results go out. The difference is that each of the five stages of the first pipeline must do more work than each of the ten stages of the second pipeline.

If all else were the same, you'd want a 5-stage pipeline like the first case, simply because it's easier to fill 5 stages with data than it is to fill 10. And if your pipeline is not constantly full of data, you're losing precious execution power - meaning your CPU isn't running as efficiently as it could.

The only reason you would want the second pipeline is if, by making each stage simpler, you can get the time it takes to complete each stage to be significantly quicker than in the previous design. Your slowest (most complicated) stage determines how quickly you can get data through each stage - keep that in mind.

Let's say that the first pipeline results in each stage taking 1ns to complete and if each stage takes 1 clock cycle to execute, we can build a 1GHz processor (1/1ns = 1GHz) using this pipeline. Now in order to make up for the fact that we have more stages (and thus have more of a difficult time keeping the pipeline full), the second design must have a significantly shorter clock period (the amount of time each stage takes to complete) in order to offer equal/greater performance to the first design. Thankfully, since we're doing less work per clock - we can reduce the clock period significantly. Assuming that we've done our design homework well, let's say we get the clock period down to 0.5ns for the second design.

Design 2 can now scale to 2GHz, twice the clock speed of the original CPU and we will get twice the performance - assuming we can keep the pipeline filled at all times. Reality sets in and it becomes clear that without some fancy footwork, we can't keep that pipeline full all the time - and all of the sudden our 2GHz CPU isn't performing twice as fast as our 1GHz part.

Make sense? Now let's relate this to the topic at hand.

Index 31 Stages: What’s this, Baskin Robbins?
Comments Locked

104 Comments

View All Comments

  • mattsaccount - Sunday, February 1, 2004 - link

    From the HardOCP review: "Certainly moving to watercooling helped us out a great deal. In fact it is hard for us to recommend buying a Prescott and cooling it any other way."
  • eBauer - Sunday, February 1, 2004 - link

    I am curious as to why the UT2k3 botmatch scores dropped on all CPU's... Different map?
  • Pumpkinierre - Sunday, February 1, 2004 - link

    Sorry errata on #20 that was 3.0 Northood result is out of kilter with other cpus in dtata analysis sysmark 2004.
  • Pumpkinierre - Sunday, February 1, 2004 - link

    JFK,Vietnam,Nixon,Monica,Bush/Gore,Iraq and now this! - what is going on with the leader of the free world.I hope it overclocks well- that's all that's going for it. Maybe Intel should rethink their multiplier locked policy. AMD must get in there and profit. I still dont understand why the caches are running at half the latency as Northood if they are the same speed and structure? Is it as a result of a doubling in size for the same associativity?

    Good article- needs re-rereading after digestion. Last chart in Sysmark2004 (data analysis) has 3.0 Prescott totally outperformed by 2.8 Prescott and all other cpus. Look like a benchmark/typing glitch.
  • yak8998 - Sunday, February 1, 2004 - link

    first the error:
    pg 9 -
    The LDDQU instruction is one Intel is particularly proud of as it helps accelerate video encoding and it is implemented in the DivX 5.1.1 codec. More information on how it is used can be found in Intel’s developer documentation here.

    No link?

    ===
    "What's the power consumption like on these new bad boys?

    Is anything less than a quality 450watt PSU gonna be generally *NOT* recommended?? "

    I'm going to guess a clean running ~350W or so should suffice for a regular system, but I'm not positive with these monster gfx cards out rite now...

    "Any of you know what the cache size on the EE's will be?"

    If your talking about the Northwood (the p4c's are still considered northwoods, no?), its 1mb I believe.
    (still finishing the article. man i love these in-depth technical articles)
  • Tiorapatea - Sunday, February 1, 2004 - link

    I agree, some info on power consumption please.

    Thanks for the article, by the way.

    I guess we'll have to wait and see how Prescott ramps in speed versus 90nm A64.
  • AgaBooga - Sunday, February 1, 2004 - link

    Much better than the P4's origional launch...

    All I want to know now is what AMD is going to do soon... They'll probably counteract Prescott with high clock speeds but when and by how much is what matters.

    Any of you know what the cache size on the EE's will be?

    Also, the final CPU's based on Northwood are kind of like a car with the ratio curves or whatever they're called, but basically after a point of revving, going any higher doesn't give you as much of an increase in speed as it would at a lower rpm increasing the same amount.
  • Cygni - Sunday, February 1, 2004 - link

    AMD's roadmap shows a 4000+ Athlon64 by the end of the year... which is the same as Intel's. They are aware, im sure.
  • Stlr22 - Sunday, February 1, 2004 - link

    What's the power consumption like on these new bad boys?

    Is anything less than a quality 450watt PSU gonna be generally *NOT* recommended??
  • HammerFan - Sunday, February 1, 2004 - link

    Things are gonna get hairy in '04 and '05!!! My take is that AMD nees to get their marketing up-to-spec or the high-clocked prescotts are gonna run the show.

    I have a question for Derek and Anand: What kind of temps does the prescott run at? what type of cooler does it have? (there's nothing there to support or refute claims that the prescott is one hot potato)

Log in

Don't have an account? Sign up now