Prescott's New Crystal Ball: Branch Predictor Improvements

We’ve said it before: before you can build a longer pipeline or add more execution units, you need a powerful branch predictor. The branch predictor (more specifically, its accuracy), will determine how many operations you can have working their way through the CPU until you hit a stall. Intel extended the basic Integer pipeline by 11 stages, so they need to make corresponding increases in the accuracy of Prescott’s branch predictor otherwise performance will inevitably tank.

Intel admits that the majority of the branch predictor unit remains unchanged in Prescott, but there have been some key modifications to help balance performance.

For those of you that aren’t familiar with the term, the role of a branch predictor in a processor is to predict the path code will take. If you’ve ever written code before, it boils down to being able to predict which part of a conditional statement (if-then, loops, etc…) will be taken. Present day branch predictors work on a simple principle; if branches were taken in the past, it is likely that they will be taken in the future. So the purpose of a branch predictor is to keep track of the code being executed on the CPU, and increment counters that keep track of how often branches at particular addresses were taken. Once enough data has accumulated in these counters, the branch predictor will then be able to predict branches as taken or not taken with relatively high accuracy, assuming they are given enough room to store all of this data.

One way of improving the accuracy of a branch predictor, as you may guess, is to give the unit more space to keep track of previously taken (or not taken) branches. AMD improved the accuracy of their branch predictor in the Opteron by increasing the amount of space available to store branch data, Intel has not chosen to do so with Prescott. Prescott’s Branch Target Buffer remains unchanged at 4K entries and it doesn’t look like Intel has increased the size of the Global History Counter either. Instead, Intel focused on tuning the efficiency of their branch predictor using less die-space-consuming methods.

Loops are very common in code, they are useful for zeroing data structures, printing characters or are simply a part of a larger algorithm. Although you may not think of them as branches, loops are inherently filled with branches – before you start a loop and every iteration of the loop, you must find out whether you should continue executing the loop. Luckily, these types of branches are relatively easy to predict; you could generally assume that if the outcome of a branch took you to an earlier point in the code (called a backwards branch), that you were dealing with a loop and the branch predictor should predict taken.

As you would expect, not all backwards branches should be taken – not all of them are at the end of a loop. Backwards branches that aren’t loop ending branches are sometimes the result of error handling in code, if an error is generated then you should back up and start over again. But if there’s no error generated in the application, then the prediction should be not-taken, but how do you specify this while keeping hardware simple?

Code Fragment A

Line 10: while (i < 10) do
Line 11: A;
Line 12: B;
Line 13: increment i;
Line 14: if i is still < 10, then go back to Line 11

Code Fragment B

Line 10: A;
Line 11: B;
Line 12: C;
...
Line 80: if (error) then go back to Line 11

Line 14 is a backwards branch at the end of a loop - should be taken!
Line 80 is a backwards branch not at the end of a loop - should not be taken!
Example of the two types of backwards branching

It turns out that loop ending branches and these error branches, both backwards branches, differentiate themselves from one another by the amount of code that separates the branch from its target. Loops are generally small, and thus only a handful of instructions will separate the branch from its target; error handling branches generally instruct the CPU to go back many more lines of code. The depiction below should illustrate this a bit better:

Prescott includes a new algorithm that looks at how far the branch target is from the actual branch instruction, and better determines whether or not to take the branch. These enhancements are for static branch prediction, which looks at certain scenarios and always makes the same prediction when those scenarios occur. Prescott also includes improvements to its dynamic branch prediction.

31 Stages: What’s this, Baskin Robbins? Prescott's Crystal Ball (continued)
Comments Locked

104 Comments

View All Comments

  • mattsaccount - Sunday, February 1, 2004 - link

    From the HardOCP review: "Certainly moving to watercooling helped us out a great deal. In fact it is hard for us to recommend buying a Prescott and cooling it any other way."
  • eBauer - Sunday, February 1, 2004 - link

    I am curious as to why the UT2k3 botmatch scores dropped on all CPU's... Different map?
  • Pumpkinierre - Sunday, February 1, 2004 - link

    Sorry errata on #20 that was 3.0 Northood result is out of kilter with other cpus in dtata analysis sysmark 2004.
  • Pumpkinierre - Sunday, February 1, 2004 - link

    JFK,Vietnam,Nixon,Monica,Bush/Gore,Iraq and now this! - what is going on with the leader of the free world.I hope it overclocks well- that's all that's going for it. Maybe Intel should rethink their multiplier locked policy. AMD must get in there and profit. I still dont understand why the caches are running at half the latency as Northood if they are the same speed and structure? Is it as a result of a doubling in size for the same associativity?

    Good article- needs re-rereading after digestion. Last chart in Sysmark2004 (data analysis) has 3.0 Prescott totally outperformed by 2.8 Prescott and all other cpus. Look like a benchmark/typing glitch.
  • yak8998 - Sunday, February 1, 2004 - link

    first the error:
    pg 9 -
    The LDDQU instruction is one Intel is particularly proud of as it helps accelerate video encoding and it is implemented in the DivX 5.1.1 codec. More information on how it is used can be found in Intel’s developer documentation here.

    No link?

    ===
    "What's the power consumption like on these new bad boys?

    Is anything less than a quality 450watt PSU gonna be generally *NOT* recommended?? "

    I'm going to guess a clean running ~350W or so should suffice for a regular system, but I'm not positive with these monster gfx cards out rite now...

    "Any of you know what the cache size on the EE's will be?"

    If your talking about the Northwood (the p4c's are still considered northwoods, no?), its 1mb I believe.
    (still finishing the article. man i love these in-depth technical articles)
  • Tiorapatea - Sunday, February 1, 2004 - link

    I agree, some info on power consumption please.

    Thanks for the article, by the way.

    I guess we'll have to wait and see how Prescott ramps in speed versus 90nm A64.
  • AgaBooga - Sunday, February 1, 2004 - link

    Much better than the P4's origional launch...

    All I want to know now is what AMD is going to do soon... They'll probably counteract Prescott with high clock speeds but when and by how much is what matters.

    Any of you know what the cache size on the EE's will be?

    Also, the final CPU's based on Northwood are kind of like a car with the ratio curves or whatever they're called, but basically after a point of revving, going any higher doesn't give you as much of an increase in speed as it would at a lower rpm increasing the same amount.
  • Cygni - Sunday, February 1, 2004 - link

    AMD's roadmap shows a 4000+ Athlon64 by the end of the year... which is the same as Intel's. They are aware, im sure.
  • Stlr22 - Sunday, February 1, 2004 - link

    What's the power consumption like on these new bad boys?

    Is anything less than a quality 450watt PSU gonna be generally *NOT* recommended??
  • HammerFan - Sunday, February 1, 2004 - link

    Things are gonna get hairy in '04 and '05!!! My take is that AMD nees to get their marketing up-to-spec or the high-clocked prescotts are gonna run the show.

    I have a question for Derek and Anand: What kind of temps does the prescott run at? what type of cooler does it have? (there's nothing there to support or refute claims that the prescott is one hot potato)

Log in

Don't have an account? Sign up now