Prescott's New Crystal Ball: Branch Predictor Improvements

We’ve said it before: before you can build a longer pipeline or add more execution units, you need a powerful branch predictor. The branch predictor (more specifically, its accuracy), will determine how many operations you can have working their way through the CPU until you hit a stall. Intel extended the basic Integer pipeline by 11 stages, so they need to make corresponding increases in the accuracy of Prescott’s branch predictor otherwise performance will inevitably tank.

Intel admits that the majority of the branch predictor unit remains unchanged in Prescott, but there have been some key modifications to help balance performance.

For those of you that aren’t familiar with the term, the role of a branch predictor in a processor is to predict the path code will take. If you’ve ever written code before, it boils down to being able to predict which part of a conditional statement (if-then, loops, etc…) will be taken. Present day branch predictors work on a simple principle; if branches were taken in the past, it is likely that they will be taken in the future. So the purpose of a branch predictor is to keep track of the code being executed on the CPU, and increment counters that keep track of how often branches at particular addresses were taken. Once enough data has accumulated in these counters, the branch predictor will then be able to predict branches as taken or not taken with relatively high accuracy, assuming they are given enough room to store all of this data.

One way of improving the accuracy of a branch predictor, as you may guess, is to give the unit more space to keep track of previously taken (or not taken) branches. AMD improved the accuracy of their branch predictor in the Opteron by increasing the amount of space available to store branch data, Intel has not chosen to do so with Prescott. Prescott’s Branch Target Buffer remains unchanged at 4K entries and it doesn’t look like Intel has increased the size of the Global History Counter either. Instead, Intel focused on tuning the efficiency of their branch predictor using less die-space-consuming methods.

Loops are very common in code, they are useful for zeroing data structures, printing characters or are simply a part of a larger algorithm. Although you may not think of them as branches, loops are inherently filled with branches – before you start a loop and every iteration of the loop, you must find out whether you should continue executing the loop. Luckily, these types of branches are relatively easy to predict; you could generally assume that if the outcome of a branch took you to an earlier point in the code (called a backwards branch), that you were dealing with a loop and the branch predictor should predict taken.

As you would expect, not all backwards branches should be taken – not all of them are at the end of a loop. Backwards branches that aren’t loop ending branches are sometimes the result of error handling in code, if an error is generated then you should back up and start over again. But if there’s no error generated in the application, then the prediction should be not-taken, but how do you specify this while keeping hardware simple?

Code Fragment A

Line 10: while (i < 10) do
Line 11: A;
Line 12: B;
Line 13: increment i;
Line 14: if i is still < 10, then go back to Line 11

Code Fragment B

Line 10: A;
Line 11: B;
Line 12: C;
...
Line 80: if (error) then go back to Line 11

Line 14 is a backwards branch at the end of a loop - should be taken!
Line 80 is a backwards branch not at the end of a loop - should not be taken!
Example of the two types of backwards branching

It turns out that loop ending branches and these error branches, both backwards branches, differentiate themselves from one another by the amount of code that separates the branch from its target. Loops are generally small, and thus only a handful of instructions will separate the branch from its target; error handling branches generally instruct the CPU to go back many more lines of code. The depiction below should illustrate this a bit better:

Prescott includes a new algorithm that looks at how far the branch target is from the actual branch instruction, and better determines whether or not to take the branch. These enhancements are for static branch prediction, which looks at certain scenarios and always makes the same prediction when those scenarios occur. Prescott also includes improvements to its dynamic branch prediction.

31 Stages: What’s this, Baskin Robbins? Prescott's Crystal Ball (continued)
Comments Locked

104 Comments

View All Comments

  • INTC - Tuesday, February 3, 2004 - link

    CRAMITPAL you must be an ex-Intel disgruntled employee with all of the rage and hatred against the company in you messages. Prescott is a year late? Get serious! Here's the earliest article I could find dated 2/27/02 which says that Prescott was due to launch 2nd half of 2003 http://news.com.com/2100-1001-846382.html. Worst case scenario - even if you count that 2nd half started July 1st then Prescott is 7 months and 1 day delayed - a far cry from a year ago, and a very far cry from the delays of that other chip *cough* *cough* Hammer *cough*! "Special cooling" CRAM? That's probably what your brain needs. I can't seem to find the requirement for special cooling on any of the reviews that have been written thus far - mostly they just used the included HSF in the retail box which even allows for some overclocking too. As far as being slower than Athlon 64 you must need some air cooling on the brain or you must have your selective blinders on again. Page 17 of the Anandtech review http://www.anandtech.com/cpu/showdoc.html?i=1956&a... shows the Prescott 3.2 beating both the Athlon64 3400+ and FX51 in 8 of 9 tests and tying the FX51 in the 9th test - and that's on an Intel 875PBZ that is hobbled in performance compared to a Abit IC7-Max3 or Asus P4C800-E. There's also Aquamark CPU score, DIVX, 3dsmax, lightwave, and in case you didn't read any of the other sites' reviews, you may want to look at MPEG encoding, Photoshop 8, SPECviewperf, oh and real multitasking. I gotta give it to the Athlon64 and FX in games where anything past 30 fps looks just like 30 fps and Microsoft Word and Excel where the program is usually waiting for human input but to say that Prescott "STILL doesn't come close to matching A64 32 bit performance" is ....... well, lets just say that its a good thing that you're not marketing director for all of the companies below:

    HP plans to offer Prescott chips in HP Pavilion and Compaq Presario desktops that are sold direct to customers, at first. It will start taking orders on them Wednesday.

    A Compaq Presario 6000T desktop, for example, will come with a 2.8EGHz Prescott chip, 256MB of RAM, an 80GB hard drive and a CD-ROM for $749 before rebates, Oliver said.

    Gateway will also offer Prescott Pentium 4s in its 510 and 710 desktops, without raising its prices. A 510G desktop will feature a 2.8EGHz Prescott and start at $1,099, the company said.

    Dell plans to fit some of the new chips into its Dimension desktops and also won't increase prices. Its Dimension XPS game machine will be offered with either the 3.2EGHz Pentium 4, the 3.4GHz Northwood Pentium 4 or the 3.4GHz Pentium 4 Extreme Edition. With the 3.2EGHz chip, the machine will start at $1,799.

    Dell will offer the 3.4GHz Northwood Pentium 4 on its Dimension 8300 at first, and will add the 3EGHz and 3.2EGHz Prescott chips by the middle of February, the company said. The 3.4GHz Dimension will start near $1,350.

    A number of other PC makers, ranging from IBM to Micro Center, will add desktops with Prescott chips as well.

    source: http://news.com.com/2100-1006_3-5151363.html?tag=n...
  • TrogdorJW - Tuesday, February 3, 2004 - link

    In regards to #78, the reason for increasing the pipeline length was to allow for higher clock speeds by doing less work in each pipeline stage. (As the Anandtech article mentions.) A 20 stage Northwood core on 90nm process would probably end up maxed out at around 4.0 GHz, with Intel's typically conservative binning. (You could maybe OC to 4.4 GHz.) With the 31 stage pipeline, it becomes much easier to reach 5.0 GHz.

    Think about this: at 5 GHz, each clock cycle is .2 ns, or 200 ps. The speed of light can travel a "whopping" 6 cm in that amount of time - in a vacuum! In a copper wire, I think 4 cm might be a better estimate. Now you have to wait for voltages to stabilize and signals to propogate through the transistors. I would think that waiting for the voltages to stabilize probably constitutes the majority of time taken, so now the signals can probably only travel 1 cm.

    If that's the case, it becomes pretty clear why they have to have longer and longer pipelines. You can't get signals to stabilize through millions of transistors in 200 picoseconds. Well, maybe you can, but if each stage is cut down to 2 million transistors (~60 million transistors in the Prescott core, with 31 stages total, gives about 2 million per stage) it would definitely take less time for signals to become stable than if you have 3 million transistors per stage (20 stage pipeline with 60 million transistors in the core).

    Of course, if the Northwood core is 30 million transistors (29 million, really), a 20 stage pipeline would give 1.5 million transistors per stage. Hmmm... So once again we're back to the 64-bit conspiracy, because where are those extra 30 million transistors being used?
  • Icewind - Tuesday, February 3, 2004 - link

    For some reason, I have a REAL hard time believing a company like Intel would "secretly" put in 64 bit extentions in a new CPU core. Especially one that has pretty much shown it is no better then the current Northwood core.

    Far as im concered, Intel goes back to drawing board and AMD owns the first part of 2004.
  • Pumpkinierre - Tuesday, February 3, 2004 - link

    Sorry go that wrong should be 43%, so discrepancy even larger ie areawise reduction from .13 to .09 um should be 52% not 43%).
  • Pumpkinierre - Tuesday, February 3, 2004 - link

    #77 Trogdor, you beat me to it and with more detail but same estimation- i wont say great minds etc. Increased density of cache may also explain increased latency. However, 13^2 is 169 and 9^2 is 81 which translates to 52% decrease area wise which is close to 47% decrease quoted allowing for other factors like strained silicon.
  • Pumpkinierre - Tuesday, February 3, 2004 - link

    Maybe Intel ARE going to bring out a 64bit prescott in a coupla weeks to make up for this let down. Aces reckons there is 30 million transistors unaccounted for, when factoring in the bigger caches (Northwood 55 million transistors, prescott 125 million). Some of this is debugging hardware but that cant be the whole story.

    With the exception of the caches, the prescott tweaks are good. Why didnt they just do those to the 20 stage pipeline Northwood core? They would have got 30 to 50% more power for the same clock speed and less heat. Geez, I'm happy I bought my northwood in June,03 and I'll probably upgrade to one or a gallatin (if the price drops) unless they sort this heat problem out.
  • TrogdorJW - Tuesday, February 3, 2004 - link

    Interesting article. Frankly, I'm *SHOCKED* that Intel really went with 31 pipeline stages. I had heard the rumors, but I figured someone was using the FP pipeline and not the integer pipeline. Damn... that's a serious penalty to pay for branch mispredictions!

    What I really want to know, however, is what else the Prescott can do that Intel isn't telling us yet. I've heard all the rumors about 64-bit capability being hidden, but I disgarded them. Now, though, with the specifications released, I honestly have to reconsider. After all, the 30-stage pipeline "rumor" was pretty accurate, so these 64-bit rumors might be as well!

    Before you scoff, let me give you some very compelling reasons for Prescott to have hidden 64-bit functionality. Let's start with a quote from the Anandtech article (from page 8): "With Prescott Intel debuted their highest density cache ever – each SRAM cell (the building blocks of cache) is now 43% smaller than the cells used in Northwood. What this means is that Intel can pack more cache into an even smaller area than if they had just shrunk the die on Prescott."

    Okay, you got that? As far as I can tell, this means that Intel has improved their SRAM design in the Prescott so that it is smaller - i.e. uses less transistors - than their old SRAM in the Northwood. Sounds reasonable, right? Now, let's reference a different section of the article, on page 11 look at the chart at the bottom. (For a more complete chart, here's a link to THG with both AMD and Intel CPUs: http://www.tomshardware.com/cpu/20040201/images/cp...

    Looking at that chart (both Anand and THG have the same numbers, so I'm quite sure they're correct), how many transistors does the P4 Northwood require? The answer is 29 million for the *core*, plus whatever is required for the L2 cache. So the Willamette was 42 mil (13 mil for the 256K L2 cache) and the Northwood is 55 mil (26 mil for the 512K L2 cache). How much space is required for L2 cache, then, based off of Intel's *old* techniques? Apparently, 13 million transistors per 256K of cache. Reasonable enough, since AMD is pretty close to that, judging by the transistor count increase when they went to Barton.

    How many transistors would be required, then, for Intel to produce a 1024K L2 cache? In this scenario, 52 million, right? Granted, all caches are not the same: the 2MB L3 cache of the P4EE/Xeon is 30.75 million transistors per 512K, or 15.375 million per 256K, so it's not as "efficient" as the L2 cache design. Still, if we go with 52 million for the 1024K L2 on the Prescott, we end up with 73 million transistors remaining for the CPU core. Even if we go with 61.5 million transistors for the 1024K cache (using the L3 Xeon numbers), we still have 63.5 million transistors left for the core.

    So, the original P4 core was 20 stages and 29 million transistors. The Prescott core is 31 stages and somewhere between 60 and 75 million transistors. Even with all of the changes mentioned in the article, I don't see Intel using 30 million transistors just in increasing the pipeline, adding 13 new instructions, and modifying the branch prediction and hyper threading. I suppose I could be wrong, but I am really starting to think that the Prescott might have some unannounced 64-bit capabilities. Rumors often have a kernel of truth in them, you know?

    Some other thoughts: Athlon 64 is very much based off of Athlon XP, only with 64-bit extensions and SSE2 support, right? Looking at AMD's chart, the Athlon core took about 22 million transistors, and AMD needed between 16 and 17 million transistors per 256K of L2. If they stuck with those values, a 1024K L2 in the Athlon 64 would require 64 million transistors. The K8 is 105.9 million transistors, so we end up with 42 million remaining transistors in the core. Some of that also had to be used on the newly integrated memory controller. Still, *worst* case, AMD used at most 20 million transistors to add a memory controller, SSE2 support, and 64-bit support to the Athlon XP core. What could Intel possibly be doing with 30 to 40 million transistors, I wonder?

    Yes, this is speculation. However, it's speculation based on facts. Maybe Intel doesn't have 64-bit support in Prescott, but I will be really surprised if they don't announce *something* at IDF in a few weeks. 64-bit seems like the likely choice, but maybe there's something else that I missed. Anyone else have any thoughts on this?

    Now, some other thoughts. First, how many people have built an Athlon 64 rig? I just built my first this past weekend, and let me tell you, all is NOT sunshine and roses for AMD. I purchased Geil PC3200 Golden Dragon 2-3-3-6 timing RAM - 1 GB in a paired set. Nothing but trouble getting it to work on the AMD!!! Okay, so it was an MSI Neo-FIS2R board; maybe that was the problem? Anyway, I've used the same RAM in P4 systems with no problems.

    Running at 2.5-3-3-6 didn't help, although I was able to install Windows XP (it would crash at the 2-3-3-6 timings that were specified in the SPD); once installed, I couldn't complete any benchmarks without crashes. I tried other timings as well; 3-4-4-8 failed to POST and I had to clear the CMOS. Maybe 2.5-4-4-8 would work? I got tired of trying, though. The solution that DID work, unfortunately, was to run the RAM at DDR333 speed and auto (2-3-3-6) timings.

    Okay, that said, Athlon 64 3000+ was still plenty fast, and most people won't notice the difference between the top systems except in HPC environments or benchmarks. And the new heatsink, although more difficult to install, is much appreciated. The heat spreader is a welcome addition also. Overall, I was frustrated with the memory problems, but A64 is okay. My advice is to check closely on motherboards and the RAM you'll be using before jumping into the "wonderful" world of Athlon 64. A great page for this (although it will definitely become outdated over time) is at THG:
    http://www.tomshardware.com/motherboard/20040112/m...
  • destaccado - Monday, February 2, 2004 - link

    Well, normally I wouldn't agree with Cramitpal just because he is so biased towards AMD but:
    The message is clear: Intel has failed!
  • CRAMITPAL - Monday, February 2, 2004 - link

    Intel road maps said Prescott would be released a year ago... Intel Press Releases claimed all was fine with 90 nano and "ahead of schedule". Intel is not to be trusted. They released the Enema Edition THREE times with paper launches. The 3.4 Gig. Prescott ain't even available. They are selling CPU rejects IMNHO that will not run at the 3.4 Gig. and faster design speed.

    Any company that would release what in my opionion and that of others is a defective CPU design, to market for naive, gullible sheep to buy, is fraud. If they couldn't fix this Dog at least don't mislead consumers by releasing an over-heating piece of crap that is SLOWER than the Northwood, uses more electrical power, needs special cooling and STILL doesn't come close to matching A64 32 bit performance, and doesn't do 64 bit at all.
  • Stlr22 - Monday, February 2, 2004 - link

    Would there be a difference in a "sever environment" ?

    Seems to me like the choice is still obvious. Northwood is the way to go for now.

Log in

Don't have an account? Sign up now