Intel's Pentium 4 E: Prescott Arrives with Luggage

Name: Intel's Pentium 4 E: Prescott Arrives with Luggage
Item: Intel's Pentium 4 E: Prescott Arrives with Luggage
Author: Anand Lal Shimpi & Derek Wilson

by Anand Lal Shimpi & Derek Wilson on February 1, 2004 3:06 PM EST

Posted in
CPUs

104 Comments | Add A Comment

104 Comments

An Impatient Prescott: Scheduler Improvements

Prescott can’t keep any more operations in flight (active in the pipeline) than Northwood, but because of the longer pipeline Prescott must work even harder to make sure that it is filled.

We just finished discussing branch predictors and their importance in determining how deep of a pipeline you can have, but another contributor to the equation is a CPU’s scheduling windows.

Let’s say you’ve got a program that is 3 operations long:

1. D = B + 1
2. A = 3 + D
3. C = A + B

You’ll notice that the 2nd operation can’t execute until the first one is complete, as it depends on the outcome (D) of the first operation. The same is true for the 3rd operation, it can’t execute until it knows what the value of A is. Now let’s say our CPU has 3 ALUs, and in theory could execute three adds simultaneously, if we just had this stream of operations going through the pipeline, we would only be using 1/3 of our total execution power - not the best situation. If we just upgraded from a CPU with 1 ALU, we would be getting the same throughput as our older CPU – and no one wants to hear that.

Luckily, no program is 3 operations long (even print “Hello World” is on the order of 100 operations) so our 3 ALUs should be able to stay busy, right? There is a unit in all modern day CPUs whose job it is to keep execution units, like ALUs, as busy as possible – as much of the time as possible. This is the job of the scheduler.

The scheduler looks at a number of operations being sent to the CPU’s execution cores and attempts to extract the maximum amount of parallelism possible from the operations. It does so by placing pending operations as soon as they make it to the scheduling stage(s) of the pipeline into a buffer or scheduling window. The size of the window determines the amount of parallelism that can be extracted, for example if our CPU’s scheduling window were only 3 operations large then using the above code example we would still only use 1/3 of our ALUs. If we could look at more operations, we could potentially find code that didn’t depend on the values of A, B or D and execute that in parallel while we’re waiting for other operations to complete. Make sense?

Because Intel increased the size of the pipeline on Prescott by such a large amount, the scheduling windows had to be increased a bit. Unfortunately, present microarchitecture design techniques do not allow for very large scheduling windows to be used on high clock speed CPUs – so the improvements here were minimal.

Intel increased the size of the scheduling windows used to buffer operations going to the FP units to coincide with the increase in pipeline.

There is also parallelism that can be extracted out of load and store operations (getting data out of and into memory). Let’s say that you have the following:

A = 1 + 3
Store A at memory location X
…
…
Load A from memory location X

The store operation actually happens as two operations (further pipelining by splitting up a store into two operations): a store address operation (where the data is going) and a store data operation (what the data actually is). The problem here is that the scheduler may try to parallelize the store operations and the load operation without realizing that the two are dependent on one another. Once this is discovered, the load will not execute and a performance penalty will be paid because the CPU’s scheduler just wasted time getting a load ready to execute and then having to get rid of it. The load will eventually execute after the store operations have completed, but at a significant performance penalty.

If a situation like the one mentioned above does crop up, long pipeline designs will suffer greatly – meaning that Prescott wants this to happen even less than Northwood. In Prescott, Intel included a small, very accurate, predictor to predict whether a load operation is likely to require data from a soon-to-be-executed store and hold that load until the store has executed. Although the predictor isn’t perfect, it will reduce bubbles of no-execution in the pipeline – a killer to Prescott and all long pipelined architectures.

Don’t look at these enhancements to improve performance, but to help balance the lengthened pipeline. A lot of the improvements we’ll talk about may sound wonderful but you must keep in mind that at this point, Prescott needs these technologies in order to equal the performance of Northwood so don’t get too excited. It’s an uphill battle that must be fought.

Prescott's Crystal Ball (continued) Execution Core Improvements

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

104 Comments

View All Comments

Stlr22 - Sunday, February 1, 2004 - link
post*
Stlr22 - Sunday, February 1, 2004 - link
KristopherKubicki

Earlier you said that I should read the article.
What was your point? What was it about my first pot that you disagreed with?
KristopherKubicki - Sunday, February 1, 2004 - link
#7:

I agree 100% with Anand and Derek. This processor will be a non-event until we get in the 3.6GHz range. Similar to Northwood's launch.

#10:

Check out our price engine. We have already been listing the processor a week!

http://www.anandtech.com/guides/priceguide.htm

http://www.monarchcomputer.com/Merchant2/merchant....
cliffa3 - Sunday, February 1, 2004 - link
In the table on page 14 it shows that the 90nm P4@2.8 will have a 533 MHz FSB, but is that the case? I did some quick google research and can't find anything to support that...please confirm or correct, thanks.
NFactor - Sunday, February 1, 2004 - link
Yes, I must agree this is an amazing article, one of the best i have ever read. Thanks.
Xentropy - Sunday, February 1, 2004 - link
VERY interesting article. Thank you Anand and Derek! One of the best I've read on Anandtech, and I consider yours the best hardware site on the net!

One correction, on page 7, you say, "if you want to multiply a number in binary by 2 you can simply shift the bits of the number to the right by 1 bit," but don't you mean shift to the left one bit (and place a zero at the end)? It's much like multiplying a decimal number by ten for obvious reasons.

Anyway, it looks like the Prescott is somewhat of a non-event at this time. Just new cores that perform fundamentally the same as the current ones at current speeds. The real news will come later; Intel has just positioned itself for one hell of a speed ramp to come. Northwood was clearly at the end of the line. One analogy, I suppose, would be that Intel didn't fire any shots in the CPU war today, but they loaded their guns in preparation to fire.

The coming year will be an exciting one for us hardware geeks. I'm interested in seeing how higher clocked Prescotts play out as well as whether anything 64-bit shows up before 2005 to support AMD's stance that we need it NOW.

Again, thanks for a very thorough article!
Stlr22 - Sunday, February 1, 2004 - link
KristopherKubicki

So what's your take on these new Prescotts?
KristopherKubicki - Sunday, February 1, 2004 - link
Anand scolded me for not reading the article :( I only read the conclusion and the graphs. Turns out the decision making isnt as clearcut as it sounds.

As for the thing with the inquirer. Well, lots of people had prescotts. We had one back in August I believe. The thing is they were horribly slow - 533FSB 2.8GHz. Everyone drew the conclusion that these were purposely slowed processors that were jsut for engineering purposes. While the inq benched this processor, most people didnt just becuase they were under the impression this was not to be the final production model. Hope that clears up some discrepancy about the validity.

Cheers,

Kristopher
wicktron - Sunday, February 1, 2004 - link
Hehe, I guess the Inq was right about this one. Where are all the Inq bashers and their claim of "fake" benchies? Haha, I laugh.
Stlr22 - Sunday, February 1, 2004 - link
KristopherKubicki - "read the article..."

lol that might be a good idea, as I only broswed it and read the conclusion. :D

Intel's Pentium 4 E: Prescott Arrives with Luggage

An Impatient Prescott: Scheduler Improvements

Post Your Comment

104 Comments

View All Comments

Stlr22 - Sunday, February 1, 2004 - link

Stlr22 - Sunday, February 1, 2004 - link

KristopherKubicki - Sunday, February 1, 2004 - link

cliffa3 - Sunday, February 1, 2004 - link

NFactor - Sunday, February 1, 2004 - link

Xentropy - Sunday, February 1, 2004 - link

Stlr22 - Sunday, February 1, 2004 - link

KristopherKubicki - Sunday, February 1, 2004 - link

wicktron - Sunday, February 1, 2004 - link

Stlr22 - Sunday, February 1, 2004 - link

Log in

Don't have an account? Sign up now