Branch Prediction and a Deeper Pipeline

Bulldozer will use a deeper pipeline with less logic per stage compared to current Phenom II/Opteron processors. AMD argues that this will ensure clock speed won’t be a problem with the design and we should expect to see Bulldozer based products at similar if not higher clock speeds than what we have today with Phenom II.

With a deeper pipe, branch prediction becomes more important and Bulldozer has a significant change in the way branch prediction works.

In Phenom II, the branch prediction and instruction fetch logic are run in lockstep - when one stalls, the other also stalls. Branches are predicted as they are encountered. If the fetch logic grabs an x86 branch instruction, the prediction logic works in parallel to predict the likely target of that branch. However if the branch is incorrectly predicted, subsequent branches aren’t predicted until the current mispredict is correctly resolved. As a result, the fetch logic and prefetchers can’t work and potential performance is lost.

In Bulldozer the branch prediction and fetch logic are decoupled. The predictor now produces a queue of future fetch addresses. Even if there’s a mispredict the branch predictor can continue to fill its prediction queue with targets. The fetch logic can then check this queue of addresses against what’s in the instruction cache to avoid future misses in L1.

Prefetchers

With Phenom AMD implemented comparable prefetching logic to what Intel did with Core. In Bulldozer, AMD is ramping up the aggressiveness of those prefetchers. There are independent prefetchers at both the L1 and L2 levels that support larger numbers of strides and large stride sizes (both compared to what exists in current AMD architectures). There’s also a non-strided data prefetcher that looks at correlated cache misses and uses that data to prefetch into the caches.

AMD unfortunately didn’t go into more detail on its prefetchers other than to promise that they are much more aggressive than what we have today. Aggressive prefetching usually means there’s a good amount of memory bandwidth available so I’m wondering if we’ll see Bulldozer adopt a 3 - 4 channel DDR3 memory controller in high end configurations similar to what we have today with Gulftown.

Power Gating & Real Turbo Mode

Each Bulldozer module in a processor can be clocked and power gated independently. This has two implications. You can now power off cores (in sets of two) that aren’t in use and save tons of idle power. You can also use the power savings to drive up the frequency of other cores in a Bulldozer CPU. With Bulldozer, AMD should have something functionally equivalent to Intel’s Turbo Boost modes. Since clock speed and power gating is controlled at the module level and not the core level there will still be some differences between the two but this should be much better than AMD’s current Core Turbo technology.

There’s of course extensive clock gating around the chip, but obviously the big change is power gating which AMD hasn’t had up to this point (Bobcat is also power gated).

Performance and Availability

While Bobcat is going to be in production in Q4 of this year, with system availability in Q1 of 2011 - Bulldozer is still a 2011 project and AMD isn’t giving any guidance as to when in 2011.

Parts are already back and in AMD’s labs but we have no indication of performance or rollout schedule. Given Bobcat’s schedule, I’d say that the first Bulldozer CPUs will be out no earlier than Q2 2011 and AMD’s unwillingness to specify what half of the year would imply that it’ll be a late Q2/early Q3 launch.

The first Bulldozer parts will be server focused, with high end desktop CPUs following but still in 2011.

A Real Redesign Final Words
Comments Locked

76 Comments

View All Comments

  • Mr Perfect - Wednesday, August 25, 2010 - link

    It sounds like AMD will be selling by the integer core though, not by module. There's this from Page 4:

    "Processors may implement anywhere from one to four Bulldozer modules and will be referred to as 2 to 8 core CPUs."

    So they will be referring to four module APUs as having eight cores, rather then a quad core with HyperThreading.
  • silverblue - Wednesday, August 25, 2010 - link

    Sorry, I did mean to tackle the part of your thread dealing with different versions of Bulldozer. Valencia is a server version of Zambezi, i.e. 4 modules/8 threads. Interlagos is 8 modules/16 threads.

    From AMD's own figures, each module is 1.8 times the speed of a current K10.5 core at the same clock speed. It is a little unfair to compare "core" to core due to the way they're designed and implemented. Considering each K10.5 core has three ALUs and Bulldozer has two per integer core, 90% of that integer performance is very good - for a quad core CPU in the current sense, Bulldozer would theoretically outpace Phenom II by 80% in integer work by only having 33% more integer resources, assuming the chip is well fed. If the rumours about a quad-channel memory bus are correct, you'd hope it would be.
  • jeremyshaw - Wednesday, August 25, 2010 - link

    I believe Intel also delegated some Atom production to TSMC, unless if I am wrong?
  • Penti - Thursday, August 26, 2010 - link

    TSMC also does manufacture VIAs / Centaur Tech x86 processor.

    Probably a few others too. There's some x86 SoCs for embedded stuff from other vendors.
  • Perisphetic - Wednesday, August 25, 2010 - link

    It's time to kick ass and chew bubble gum... and AMD is all outta gum.
  • NaN42 - Wednesday, August 25, 2010 - link

    At first: I think AMD made a huge progress with Bulldozer.
    But I'm wondering how the FPU will work exactly. A look at the latencies (especially of fma-instructions) would be interesting too. Another question is, if it is possible to start one independent multiply and one addition at the same time in a FMAC-unit. Furthermore the throughput is of interest. Is it one mul and add instruction per cycle? Is there any advantage to use 256 bit AVX-instructions, besides shorter code?
    I appreciate that AMD will drop most 3Dnow-instructions because these are just outdated. Perhaps they could also drop MMX instructions but maintain x87-instructions because these are sometimes useful and needed.

    I expect the decoder besides the FPU (compared to Sandy Bridge) to be another bottleneck because the 4-wide decoder has to feed two nearly independent cores and todays 3-wide decoders (except those in Nehalem/Westmere) are sometimes a bottleneck in a single core design.

    @Ontario: I expect this platform to be much more powerful than the Atom platforms. Perhaps it will even be much more efficient than Atom. A direct comparison between Ontario and VIA Nano 3000 might be interesting especially when VIA releases dual core chips.
  • GourdFreeMan - Thursday, August 26, 2010 - link

    It seems that AMD is ceding the traditional laptop and desktop market to Intel and chasing the server market and Atom/ARM's market with Bulldozer and Bobcat respectively. Lower theoretical peak IPC and greater parallelism target well the high level of data and transaction level parallelism in the server market, but existing consumer software excepting video encoding and a handful of games still tend to favor single threaded performance over parallelism. I suppose we should wait for benchmarks in actual applications to see how well architectural improvements have impacted the performance of AMD's new designs, but I imagine some people are already disappointed. Too bad the resources in both integer cores in a module can't work on a single thread, otherwise we could have had a very serious contender on the desktop...
  • silverblue - Thursday, August 26, 2010 - link

    He sure seemed confusing on the comments page of his blog a few weeks back. Understandably evasive considering he's a server tech guy, not consumer tech, plus AMD were yet to reveal these details, but he was comparing 16 Bulldozer cores to 12 Magny Cours cores, which is technically incorrect as they're not comparable UNLESS you're talking about integer cores. At least, that's my interpretation.

    AMD will probably market Zambezi as an 8-core CPU in order to woo the more-is-better crowd, but regardless of how it handles multi-threading, I still view a module as an actual core virtue of the fact that the "cores" are not independant of the module they belong to. I know I'm wrong and that's fine, but it helps in understanding the technology better - eight cores that exist in pairs and share additional resources might serve to confuse.
  • gruffi - Thursday, August 26, 2010 - link

    A 12-core Magny-Cours has 12 "integer cores" and 12 128-bit FPUs. A 16-core Interlagos has 16 "integer cores" and 16 128-bit FMACs. Why is it technically not comparable? At least you know you are wrong. ;)
  • silverblue - Friday, August 27, 2010 - link

    The implementation is very different to what AMD have done before, that's what I'm trying to get at. Everyone knew that despite Intel and AMD having different types of quad core processor prior to Nehalem, they were still classed the same so I suppose it doesn't matter in the grand scheme of things. There's nothing to stop AMD from releasing a 24-"core" Bulldozer; it shouldn't be any larger than Magny-Cours - perhaps slightly smaller in the end - yet its integer performance would be through the roof.

    However, people are bemoaning the fact that for 33% more "cores", AMD are only getting 50% extra performance - it's worth bearing in mind that AMD does this with 4 less, albeit better utilised ALUs than Magny-Cours (32 compared to 36). Make no mistake, Bulldozer is far more efficient and capable in this scenario, but I can't help wondering how strong Phenom II may have been if it'd had a slightly more elegant design.

Log in

Don't have an account? Sign up now