Branch Prediction and a Deeper Pipeline

Bulldozer will use a deeper pipeline with less logic per stage compared to current Phenom II/Opteron processors. AMD argues that this will ensure clock speed won’t be a problem with the design and we should expect to see Bulldozer based products at similar if not higher clock speeds than what we have today with Phenom II.

With a deeper pipe, branch prediction becomes more important and Bulldozer has a significant change in the way branch prediction works.

In Phenom II, the branch prediction and instruction fetch logic are run in lockstep - when one stalls, the other also stalls. Branches are predicted as they are encountered. If the fetch logic grabs an x86 branch instruction, the prediction logic works in parallel to predict the likely target of that branch. However if the branch is incorrectly predicted, subsequent branches aren’t predicted until the current mispredict is correctly resolved. As a result, the fetch logic and prefetchers can’t work and potential performance is lost.

In Bulldozer the branch prediction and fetch logic are decoupled. The predictor now produces a queue of future fetch addresses. Even if there’s a mispredict the branch predictor can continue to fill its prediction queue with targets. The fetch logic can then check this queue of addresses against what’s in the instruction cache to avoid future misses in L1.

Prefetchers

With Phenom AMD implemented comparable prefetching logic to what Intel did with Core. In Bulldozer, AMD is ramping up the aggressiveness of those prefetchers. There are independent prefetchers at both the L1 and L2 levels that support larger numbers of strides and large stride sizes (both compared to what exists in current AMD architectures). There’s also a non-strided data prefetcher that looks at correlated cache misses and uses that data to prefetch into the caches.

AMD unfortunately didn’t go into more detail on its prefetchers other than to promise that they are much more aggressive than what we have today. Aggressive prefetching usually means there’s a good amount of memory bandwidth available so I’m wondering if we’ll see Bulldozer adopt a 3 - 4 channel DDR3 memory controller in high end configurations similar to what we have today with Gulftown.

Power Gating & Real Turbo Mode

Each Bulldozer module in a processor can be clocked and power gated independently. This has two implications. You can now power off cores (in sets of two) that aren’t in use and save tons of idle power. You can also use the power savings to drive up the frequency of other cores in a Bulldozer CPU. With Bulldozer, AMD should have something functionally equivalent to Intel’s Turbo Boost modes. Since clock speed and power gating is controlled at the module level and not the core level there will still be some differences between the two but this should be much better than AMD’s current Core Turbo technology.

There’s of course extensive clock gating around the chip, but obviously the big change is power gating which AMD hasn’t had up to this point (Bobcat is also power gated).

Performance and Availability

While Bobcat is going to be in production in Q4 of this year, with system availability in Q1 of 2011 - Bulldozer is still a 2011 project and AMD isn’t giving any guidance as to when in 2011.

Parts are already back and in AMD’s labs but we have no indication of performance or rollout schedule. Given Bobcat’s schedule, I’d say that the first Bulldozer CPUs will be out no earlier than Q2 2011 and AMD’s unwillingness to specify what half of the year would imply that it’ll be a late Q2/early Q3 launch.

The first Bulldozer parts will be server focused, with high end desktop CPUs following but still in 2011.

A Real Redesign Final Words
Comments Locked

76 Comments

View All Comments

  • Dustin Sklavos - Tuesday, August 24, 2010 - link

    Comments like this really bother me. You may not care about netbooks, but a lot of people do. Current ones don't pass the grandma test - your grandmother can do whatever task she needs to on them, like check e-mail, browse the internet, watch HD video - and any advance here is welcome.

    Generally speaking a netbook is not supposed to be your main machine, but something you can chuck into your bag and take with you and do a little work on here and there. I write a lot, and have to work on other peoples' computers from time to time, so a netbook that doesn't completely suck is invaluable to me. Netbook performance is dismal right now, but Bobcat could successfully fix this market segment.

    So no, you're not interested in netbooks and you'd rather be raked through hot coals than purchase one. But that just means they're not useful - TO YOU. There are a lot of people here interested in what Bobcat can do for these portables, and I count myself among them.
  • Lonbjerg - Wednesday, August 25, 2010 - link

    I don't care that many people care for mediocore performance in a crappy format.
    Not matter what you do with a netbook, it will alway be lacking.

    I don't care what gandma wants (she will buy intel BTW, due to Intel's brand recognition)

    I don't care for Atom either.
    Or i3
    Or i5
    Or Phenom
    I do care about a replacement for my i7 @ 3.5GHz...
  • Dustin Sklavos - Wednesday, August 25, 2010 - link

    I'm trying to figure out why you're commenting on any of this at all.
  • flipmode - Tuesday, August 24, 2010 - link

    Seriously Anand, it is crummy that I cannot find a whole section of your website. I hate to spam an entirely separate article, but how completely lame it is to have to spend 15 minutes doing a Google advanced search to find the Anandtech article I'm looking for.

    One of the very, very few truly Class A+ hardware sites on the internet - you can count all the members of that class on one hand - and you make it seriously hard to find past articles and you completely OMIT a link to an entire category of your reviews. Insane.

    Please put a link to the "System" section somewhere. Please!
  • JarredWalton - Tuesday, August 24, 2010 - link

    Our system section hasn't had a lot of updates, but you can get there via:
    http://www.anandtech.com/tag/systems

    In fact, most common tags can be put there (i.e. /AMD, /Intel, /NVIDIA, /HP, /ASUS, etc.) The only catch is that many of the tags will only bring up articles since the site redesign, so you'll want to stick with the older main topics for some areas. Hope that helps.
  • mino - Tuesday, August 24, 2010 - link

    "so I’m wondering if we’ll see Bulldozer adopt a 3 - 4 channel DDR3 memory controller"

    Bulldozer will use current G34 platform. Hoe that answers your wonder :)
  • VirtualLarry - Tuesday, August 24, 2010 - link

    BullDozer sounds like amazing stuff. I wonder, if the way that they have arranged int units into modules, if that means that we will be getting more cores for our dollars, compared to Intel. More REAL cores, I mean. I'm just a little disappointed that the int pipelines went from 3 ALU to 2 ALU, I hope that doesn't affect performance too much.
  • gruffi - Thursday, August 26, 2010 - link

    Integer instruction pipelines are increased from 3 to 4. That's 33% more peak throughput. The number of ALUs/AGUs to keep these pipelines busy is meaningless without knowing details. K10 has 3 ALUs and 3 AGUs, but they are bottlenecked and partially idling most of the time. Bulldozer can do more operations per cycle while drawing less power, even with only 2 ALUs and 2 AGUs. How can that be disappointing?
  • ezodagrom - Tuesday, August 24, 2010 - link

    I think Bulldozer has the potential to be really competitive, mainly because Sandy Bridges looks quite unimpressive.
    In a recent leaked powerpoint from Intel, apparently until Q3 2011 the best Intel CPU is still going to be Gulftown based, possibly Core i7 990X. According to Intel benchmarks on the leaked powerpoint, the best Sandy Bridge, that is, Core i7 2600, apparently will be around 15% to 25% better than the i7 870, with the i7 980X being 25% to 35% better than the i7 2600.
  • Mat3 - Tuesday, August 24, 2010 - link

    I have a question.. it was earlier speculated that BD would have four ALU pipelines per integer core. It was thought that one way they could make use of them was to send a branch down two pipes and take the correct result. Obviously this isn't the case, but my question is, why not? Wouldn't it be better to do that and just discard the branch predictors entirely? Why isn't that better?

Log in

Don't have an account? Sign up now