Branch Prediction and a Deeper Pipeline

Bulldozer will use a deeper pipeline with less logic per stage compared to current Phenom II/Opteron processors. AMD argues that this will ensure clock speed won’t be a problem with the design and we should expect to see Bulldozer based products at similar if not higher clock speeds than what we have today with Phenom II.

With a deeper pipe, branch prediction becomes more important and Bulldozer has a significant change in the way branch prediction works.

In Phenom II, the branch prediction and instruction fetch logic are run in lockstep - when one stalls, the other also stalls. Branches are predicted as they are encountered. If the fetch logic grabs an x86 branch instruction, the prediction logic works in parallel to predict the likely target of that branch. However if the branch is incorrectly predicted, subsequent branches aren’t predicted until the current mispredict is correctly resolved. As a result, the fetch logic and prefetchers can’t work and potential performance is lost.

In Bulldozer the branch prediction and fetch logic are decoupled. The predictor now produces a queue of future fetch addresses. Even if there’s a mispredict the branch predictor can continue to fill its prediction queue with targets. The fetch logic can then check this queue of addresses against what’s in the instruction cache to avoid future misses in L1.

Prefetchers

With Phenom AMD implemented comparable prefetching logic to what Intel did with Core. In Bulldozer, AMD is ramping up the aggressiveness of those prefetchers. There are independent prefetchers at both the L1 and L2 levels that support larger numbers of strides and large stride sizes (both compared to what exists in current AMD architectures). There’s also a non-strided data prefetcher that looks at correlated cache misses and uses that data to prefetch into the caches.

AMD unfortunately didn’t go into more detail on its prefetchers other than to promise that they are much more aggressive than what we have today. Aggressive prefetching usually means there’s a good amount of memory bandwidth available so I’m wondering if we’ll see Bulldozer adopt a 3 - 4 channel DDR3 memory controller in high end configurations similar to what we have today with Gulftown.

Power Gating & Real Turbo Mode

Each Bulldozer module in a processor can be clocked and power gated independently. This has two implications. You can now power off cores (in sets of two) that aren’t in use and save tons of idle power. You can also use the power savings to drive up the frequency of other cores in a Bulldozer CPU. With Bulldozer, AMD should have something functionally equivalent to Intel’s Turbo Boost modes. Since clock speed and power gating is controlled at the module level and not the core level there will still be some differences between the two but this should be much better than AMD’s current Core Turbo technology.

There’s of course extensive clock gating around the chip, but obviously the big change is power gating which AMD hasn’t had up to this point (Bobcat is also power gated).

Performance and Availability

While Bobcat is going to be in production in Q4 of this year, with system availability in Q1 of 2011 - Bulldozer is still a 2011 project and AMD isn’t giving any guidance as to when in 2011.

Parts are already back and in AMD’s labs but we have no indication of performance or rollout schedule. Given Bobcat’s schedule, I’d say that the first Bulldozer CPUs will be out no earlier than Q2 2011 and AMD’s unwillingness to specify what half of the year would imply that it’ll be a late Q2/early Q3 launch.

The first Bulldozer parts will be server focused, with high end desktop CPUs following but still in 2011.

A Real Redesign Final Words
Comments Locked

76 Comments

View All Comments

  • Zoomer - Wednesday, August 25, 2010 - link

    Basically you'll need 2x the power for much less than 2x performance increase. Modern branch predictors can have very good hit rates ~90%+. It simply made more sense to use the second int unit for another thread.

    However, if you need the absolutely best single threaded int performance at all costs, imho, what you suggest wouldn't be bad. In fact,
  • Edison5do - Tuesday, August 24, 2010 - link

    Finally besides the price competition, we will be able to see some tech competition, we have to raise our praise for AMD not to reject the ATI btand because New and HiTech CPU´s, should be paired with HiQuality, nice priced, Radeon GPU´s.

    I really dont think People are ready to see "AMD" Brand as a Head-toHead Competitor to "INTEL" Brand, by this i mean that they should rely on ATI for being well accepted by the public for more time before they even star thinking about that.
  • angrysand - Tuesday, August 24, 2010 - link

    they may have had the on die memory controller, but Atom basically created the netbook market. AMD is just improving on what Intel help create (and that remains to be seen).

    I had to see AMD go because I like having resonable performance for reasonable price. But they had better get their act together and put out faster CPU's.
  • ABR - Wednesday, August 25, 2010 - link

    Atom did not create the netbook market, some convergence of wireless data and increasing use of the web by non-computer folk did. The first "netbook" products were the Crusoe-based mini-notebooks starting in 2001. Unfortunately for Transmeta, interest in the high-portability / long battery life model was low, only a couple of models even came out, and they ended up having to compete with Intel for scraps of the low-end laptop market. They lost, and Intel only finally caught up with their technology later with the Atom, when, coincidentally or not, the market was finally ready.
  • Nehemoth - Tuesday, August 24, 2010 - link

    Why Bodcat will be manufactured in the 40nm process instead of 32nm is cause the GPU?.

    Why will be manufactured on TSMC instead of GlobalFoundries?.

    I supposed that this could be a problem with GF not being ready in 32nm but can we see a switch from TSMC to GlobalFoundries after Bulldozer begin to be manufacture?.
  • iwod - Wednesday, August 25, 2010 - link

    TSMC has much higher 40nm capacity then GF's 32nm. Bobcat is going to be a low end product which will hopefully generate high volume of sales. TSMC in this case will be a much better fit then GF.
  • moozoo - Wednesday, August 25, 2010 - link

    I wonder how hard it would be to make a version has two Floating point cores and one integer core.

    Will AMD have a product to match Intel MIC's (Larrabee) .
    (http://www.anandtech.com/show/3749/intel-mic-22nm-...
  • YuryMalich - Wednesday, August 25, 2010 - link

    Hi,
    There is a mistake on page 5 on this picture http://images.anandtech.com/reviews/cpu/amd/hotchi...
    There were drawn two 128-bit FMAC units on Phenom II Microarchitecture.
    But K10 processor doesn't have FMAC units at all! It has 1 FMUL and one FADD and one FMISC(FLOAD) units.
    The FMAC (multiple-add) units are new in Bulldozer microarchitecture.
  • Jack Sparow - Wednesday, August 25, 2010 - link

    "Ivo August 25, 2010
    How many threads everyone processor (“Interlagos”, “Valencia” and “Zambezi”) can do simultaneously per core compare with Phenom II processor?

    Reply
    John Fruehe August 25, 2010
    One thread per core."

    This quote is from AMD blogs home. :)
  • silverblue - Wednesday, August 25, 2010 - link

    I think I touched on this before once on a THQ news article - John Fruehe is being confusing. The correct definition of a complete Bulldozer core is a module, which is a monolithic dual-integer core package also consisting of other shared resources - the top image on page 4 of this article is a great guide. So, a four module (or quad core as we currently term them) Bulldozer will handle eight threads concurrently as those four cores possess eight integer cores.

    As such, I don't see non-SMT Bulldozer cores ever coming out.

Log in

Don't have an account? Sign up now