Typical branches have one of two options: either don’t take the branch, or go to the target instruction and begin executing code there:

A Typical Branch

...
Line 24: if (a = b)
Line 25:
execute this code;
Line 26: otherwise
Line 27: go to line 406;
...

Most branches have two options - don't take the branch or go to the target and start executing there

There is a third type of branch – called an indirect branch – that complicates predictions a bit more. Instead of telling the CPU where to go if the branch is taken, an indirect branch will tell the CPU to look at an address in a register/main memory that will contain the location of the instruction that the CPU should branch to. An indirect branch predictor, originally introduced in the Pentium M (Banias), has been included in Prescott to predict these types of branches.

An Indirect Branch

...
Line 113: if (z < 2)
Line 114: execute this code;
Line 115: otherwise
Line 116: go to memory location F and retreive the address of where to start executing

...

Conventionally, you predict an indirect branch somewhat haphazardly by telling the CPU to go to where most instructions of the program end up being located. It’s sort of like needing to ask your boss what he wants you to do, but instead of asking just walking into the computer lab because that’s where most of your work ends up being anyways. This method of indirect branch prediction ends up working well for a lot of cases, but not all. Prescott’s indirect branch predictor features algorithms to handle these cases, although the exact details of the algorithms are not publicly available. The fact that the Prescott team borrowed this idea from the Pentium M team is a further testament to the impressive amount of work that went into the Pentium M, and what continues to make it one of Intel’s best designed chips of all time.

Prescott’s indirect branch predictor is almost directly responsible for the 55% decrease in mispredicted branches in the 253.perlbmk SPEC CPU2000 test. Here’s what the test does:

253.perlbmk is a cut-down version of Perl v5.005_03, the popular scripting language. SPEC's version of Perl has had most of OS-specific features removed. In addition to the core Perl interpreter, several third-party modules are used: MD5 v1.7, MHonArc v2.3.3, IO-stringy v1.205, MailTools v1.11, TimeDate v1.08

The reference workload for 253.perlbmk consists of four scripts:

The primary component of the workload is the freeware email-to-HTML converter MHonArc. Email messages are generated from a set of random components and converted to HTML. In addition to MHonArc, which was lightly patched to avoid file I/O, this component also uses several standard modules from the CPAN (Comprehensive Perl Archive Network).

Another script (which also uses the mail generator for convienience) excercises a slightly-modified version of the 'specdiff' script, which is a part of the CPU2000 tool suite.

The third script finds perfect numbers using the standard iterative algorithm. Both native integers and the Math::BigInt module are used.
Finally, the fourth script tests only that the psuedo-random numbers are coming out in the expected order, and does not really contribute very much to the overall runtime.

The training workload is similar, but not identical, to the reference workload. The test workload consists of the non-system-specific parts of the acutal Perl 5.005_03 test harness.

In the case of the mail-based benchmarks, a line with salient characteristics (number of header lines, number of body lines, etc) is output for each message generated.

During processing, MD5 hashes of the contents of output "files" (in memory) are computed and output.

For the perfect number finder, the operating mode (BigInt or native) is output, along with intermediate progress and, of course, the perfect numbers.
Output for the random number check is simply every 1000th random number generated.

As you can see, the performance improvement is in a real-world algorithm. As is commonplace for microprocessor designers to do, Intel measured the effectiveness of Prescott’s branch prediction enhancements in SPEC and came up with an overall reduction in mispredicted branches of about 13%:

Percentage Reduction in Mispredicted Branches for Prescott over Northwood (higher is better)
164.gzip
1.94%
175.vpr
8.33%
176.gcc
17.65%
181.mcf
9.63%
186.crafty
4.17%
197.parser
17.92%
252.eon
11.36%
253.perlbmk
54.84%
254.gap
27.27%
255.vortex
-12.50%
256.bzip2
5.88%
300.twolf
6.82%
Overall
12.78%

The improvements seen above aren’t bad at all, however remember that this sort of a reduction is necessary in order to make up for the fact that we’re now dealing with a 55% longer pipeline with Prescott.

The areas that received the largest improvement (> 10% fewer mispredicted branches) were in 176.gcc, 197.parser, 252.eon, 253.perlbmk and 254.gap. The 176.gcc test is a compiler test, which the Pentium 4 has clearly lagged behind the Athlon 64 in. 197.parser is a word processing test, also an area where the Pentium 4 has done poorly in the past thanks to branch-happy integer code. 252.eon is a ray tracer, and we already know about 253.perlbmk; improvements in 254.gap could have positive ramifications for Prescott’s performance in HPC applications as it simulates performance in math intensive distributed data computation.

The benefit of improvements under the hood like the branch prediction algorithms we’ve discussed here is that they are taken advantage of on present-day software, with no recompiling and no patches. Keep this in mind when we investigate performance later on.

We’ll close this section off with another interesting fact – although Prescott features a lot of new improvements, there are other improvements included in Prescott that were only introduced in later revisions of the Northwood core. Not all Northwood cores are created equal, but all of the enhancements present in the first Hyper Threading enabled Northwoods are also featured in Prescott.

Prescott's New Crystal Ball: Branch Predictor Improvements An Impatient Prescott: Scheduler Improvements
Comments Locked

104 Comments

View All Comments

  • sprockkets - Monday, February 2, 2004 - link

    Hmmm... on Intel's website on the new processor news: "Thermal Monitoring: Allows motherboards to be cost-effectively designed to expected application power usages rather than theoretical maximums."

    Not sure what it means. I'm thinking clock throttling so that if your particular chip is hotter than it should be it will run on under engineered motherboards/coolers.

    This chip dissipates around the same heat as Northwoods clock for clock! And of course, Intel style is wait 6-12, then the new stuff will actually be good. Still, is it really that important to increase performance so much that heat becomes an issue? I.E., will Dell be able to make the cooling whisper quiet? They can with the processor sitting at 80-90c, but now that with normal cooling it's almost there, now what will they do? Why can't we just have new processors that run so cool that we can just use heatsinks without fans? Oh well.
  • Novaoblivion - Monday, February 2, 2004 - link

    Great article :) I found it very interesting I dont think I'll be buying a prescott till they hit about 4Ghz. My 2.4C is nice and fast for now.
  • CRAMITPAL - Monday, February 2, 2004 - link


    http://www.theinquirer.net/?article=13927


    http://www.theinquirer.net/?article=13947
  • johnsonx - Monday, February 2, 2004 - link

    To Vanners, #38:

    "if you halve the time for a stage in the pipeline and double the number of stages. Yes this means you can run at 2GHz instead of 1GHz but the reality is you're still taking 5ns to complete the pipe."

    Yes and no... In the example, you're right that a single instruction takes the same 5ns to complete. But you're not just executing a single instruction... rather, thousands to millions! The 10 stage pipe has twice as many instructions in flight as the 5 stage pipe. Therefore in the example, you get one result out of the 5-stage/1Ghz cpu every 1ns, but TWO results out of the 10-stage/2Ghz cpu in the same 1ns... twice as many.

    What I find interesting is that as pipelines get longer and longer, we might have to start talking about Instruction Latency: the number of clocks and ns between the time an instruction goes in and when the result comes out. It'll never be anything a human could notice directly, but it might come into play in high-performance realtime apps that deal with input from the outside world, and have to produce synchronized output. Any CPU calculates somewhat "back-in-time" as instructions fly down the pipe... right now, a Prescott calculates about twice as far behind 'reality' as an A64 does. I don't know if there is any realworld application where this really could make a difference, or if there ever will be, but it's interesting to ponder, particularly if the pipeline lengths of Intel vs. AMD continue to diverge.
  • cliffa3 - Monday, February 2, 2004 - link

    i don't see how a 4+GHz prescott will match up with intel's new pico BTX form factor...with that much heat (using air cooling), you need to keep a safe zone around the proc unless you like your RAM DDR+BBQ.
    I'd have to say that a lot of enthusiasts are younger and live in limited space conditions...might work well for people up north who don't want to run the heater, but as for me in texas, i have all the cool air pumping in to my bedroom and it still takes a lot to keep it cool. Can you imagine a university or corporation having a room full of those?..if they think about that, then it's no bueno for DELL and others as well.
    I'd also have to agree with the others about the heat/power being a major part of the article that was left out...otherwise a tremendous read, thanks for all the effort that goes into these.
  • tfranzese - Monday, February 2, 2004 - link

    But - I need to add - the correction was needed and is welcome. Not trying to pick a bone with the editors.
  • tfranzese - Monday, February 2, 2004 - link

    #55, you read what I read. I'll vouch for you.
  • Icewind - Monday, February 2, 2004 - link

    #55
    Better go back to sleep me thinks :)
  • Spearhawk - Monday, February 2, 2004 - link

    Is it just me (who was extremely tired yesterday) or has the 101 on pipeline part changed since the article was put up?
    I seem to rememeber reading someting about how a 5 staged CPU at 1 Ghz should be exactly as fast as a 2 GHz CPU with 10 stages (all else being equal of course) and that the secret of geting any profit out of going to more stages was to make sure that it couldn't only scale to 2 Ghz but to 3 Ghz or more.
  • Icewind - Monday, February 2, 2004 - link

    I think shuttle owners are SOL with prescott.

Log in

Don't have an account? Sign up now