A Real Redesign

When we first met Phenom we were disappointed that it didn’t introduce the major architectural changes AMD needed to keep up with Intel. The front end and execution hardware remained largely unchanged from the K8, and as a result Intel pulled ahead significantly in performance per clock over the past few years. With Bulldozer, we finally got the redesign that we’ve been asking for.

If we look at Westmere, Intel has a 4-issue architecture that’s shared among two threads. At the front end, a single Bulldozer module is essentially the same. The fetch logic in Bulldozer can grab instructions from two threads and send it to the decoder. Note that either thread can occupy the full width of the front end if necessary.

The instruction fetcher pulls from a 64KB 2-way instruction cache, unchanged from the Phenom II.

The decoder is now 4-wide an increase from the 3-wide front end that AMD has had since the K7 all the way up to Phenom II. AMD can now fuse x86 branch instructions, similar to Intel’s macro-ops fusion to increase the effective width of the machine as well. At a high level, AMD’s front end has finally caught up to Intel, but here’s where AMD moves into the passing lane.

The 4-wide decode engine feeds three independent schedulers: two for the integer cores and one for the shared floating point hardware.


Bullddozer, 2 threads per module

Each integer scheduler is now unified. In the Phenom II and previous architectures AMD had individual schedulers for math and address operations, but with Bulldozer it’s all treated as one.


Phenom II, 1 thread per core

Each scheduler has four ports that feed a pair of ALUs and a pair of AGUs. This is down one ALU/AGU from Phenom II (it had 3 ALUs and 3 AGUs respectively and could do any mix of 3). AMD insists that the 3rd address generation unit wasn’t necessary in Phenom II and was only kept around for symmetry with the ALUs and to avoid redesigning that part of the chip - the integer execution core is something AMD has kept around since the K8. The 3rd ALU does have some performance benefits, and AMD canned it to reduce die size, but AMD mentioned that the 4-wide front end, fusion and other enhancements more than make up for this reduction. In other words, while there’s fewer single thread integer execution resources in Bulldozer than Phenom II, single threaded integer performance should still be higher.

Each integer core has its own 16KB L1 data cache. The L1 caches are segmented by thread so the shared FP core chooses which L1 cache to pull from depending on what thread it’s working on.

I asked AMD if the small L1 data cache was going to be a problem for performance, but it mentioned that in modern out of order machines it’s quite easy to hide the latency to L2 and thus this isn’t as big of an issue as you’d think. Given how aggressive AMD has been in the past with ramping up L1 cache sizes, this is a definite change of pace which further indicates how significant of a departure Bulldozer is from the norm at AMD.

While there are two integer schedulers in a single Bulldozer module (one for each thread), there’s only one FP scheduler. There’s some hardware duplication at the FP scheduler to allow two threads to share the execution resources behind it. While each integer core behaves like an independent core, the FP resources work as they would in a SMT (Hyper Threading) system.

The FP scheduler has four ports to its FPUs. There are two 128-bit FMAC pipes and two 128-bit packed integer pipes. Like Sandy Bridge, AMD’s Bulldozer will support SSE all the way up to 4.2 as well as Intel’s new AVX instructions. The 256-bit AVX ops will be handled by the two 128-bit FMAC units in each Bulldozer module.

Each Bulldozer module has its own private L2 cache shared by both integer cores and the FP execution hardware.

Bulldozer Predictors, Prefetching, Power Gating & Real Turbo
Comments Locked

76 Comments

View All Comments

  • Mr Perfect - Wednesday, August 25, 2010 - link

    It sounds like AMD will be selling by the integer core though, not by module. There's this from Page 4:

    "Processors may implement anywhere from one to four Bulldozer modules and will be referred to as 2 to 8 core CPUs."

    So they will be referring to four module APUs as having eight cores, rather then a quad core with HyperThreading.
  • silverblue - Wednesday, August 25, 2010 - link

    Sorry, I did mean to tackle the part of your thread dealing with different versions of Bulldozer. Valencia is a server version of Zambezi, i.e. 4 modules/8 threads. Interlagos is 8 modules/16 threads.

    From AMD's own figures, each module is 1.8 times the speed of a current K10.5 core at the same clock speed. It is a little unfair to compare "core" to core due to the way they're designed and implemented. Considering each K10.5 core has three ALUs and Bulldozer has two per integer core, 90% of that integer performance is very good - for a quad core CPU in the current sense, Bulldozer would theoretically outpace Phenom II by 80% in integer work by only having 33% more integer resources, assuming the chip is well fed. If the rumours about a quad-channel memory bus are correct, you'd hope it would be.
  • jeremyshaw - Wednesday, August 25, 2010 - link

    I believe Intel also delegated some Atom production to TSMC, unless if I am wrong?
  • Penti - Thursday, August 26, 2010 - link

    TSMC also does manufacture VIAs / Centaur Tech x86 processor.

    Probably a few others too. There's some x86 SoCs for embedded stuff from other vendors.
  • Perisphetic - Wednesday, August 25, 2010 - link

    It's time to kick ass and chew bubble gum... and AMD is all outta gum.
  • NaN42 - Wednesday, August 25, 2010 - link

    At first: I think AMD made a huge progress with Bulldozer.
    But I'm wondering how the FPU will work exactly. A look at the latencies (especially of fma-instructions) would be interesting too. Another question is, if it is possible to start one independent multiply and one addition at the same time in a FMAC-unit. Furthermore the throughput is of interest. Is it one mul and add instruction per cycle? Is there any advantage to use 256 bit AVX-instructions, besides shorter code?
    I appreciate that AMD will drop most 3Dnow-instructions because these are just outdated. Perhaps they could also drop MMX instructions but maintain x87-instructions because these are sometimes useful and needed.

    I expect the decoder besides the FPU (compared to Sandy Bridge) to be another bottleneck because the 4-wide decoder has to feed two nearly independent cores and todays 3-wide decoders (except those in Nehalem/Westmere) are sometimes a bottleneck in a single core design.

    @Ontario: I expect this platform to be much more powerful than the Atom platforms. Perhaps it will even be much more efficient than Atom. A direct comparison between Ontario and VIA Nano 3000 might be interesting especially when VIA releases dual core chips.
  • GourdFreeMan - Thursday, August 26, 2010 - link

    It seems that AMD is ceding the traditional laptop and desktop market to Intel and chasing the server market and Atom/ARM's market with Bulldozer and Bobcat respectively. Lower theoretical peak IPC and greater parallelism target well the high level of data and transaction level parallelism in the server market, but existing consumer software excepting video encoding and a handful of games still tend to favor single threaded performance over parallelism. I suppose we should wait for benchmarks in actual applications to see how well architectural improvements have impacted the performance of AMD's new designs, but I imagine some people are already disappointed. Too bad the resources in both integer cores in a module can't work on a single thread, otherwise we could have had a very serious contender on the desktop...
  • silverblue - Thursday, August 26, 2010 - link

    He sure seemed confusing on the comments page of his blog a few weeks back. Understandably evasive considering he's a server tech guy, not consumer tech, plus AMD were yet to reveal these details, but he was comparing 16 Bulldozer cores to 12 Magny Cours cores, which is technically incorrect as they're not comparable UNLESS you're talking about integer cores. At least, that's my interpretation.

    AMD will probably market Zambezi as an 8-core CPU in order to woo the more-is-better crowd, but regardless of how it handles multi-threading, I still view a module as an actual core virtue of the fact that the "cores" are not independant of the module they belong to. I know I'm wrong and that's fine, but it helps in understanding the technology better - eight cores that exist in pairs and share additional resources might serve to confuse.
  • gruffi - Thursday, August 26, 2010 - link

    A 12-core Magny-Cours has 12 "integer cores" and 12 128-bit FPUs. A 16-core Interlagos has 16 "integer cores" and 16 128-bit FMACs. Why is it technically not comparable? At least you know you are wrong. ;)
  • silverblue - Friday, August 27, 2010 - link

    The implementation is very different to what AMD have done before, that's what I'm trying to get at. Everyone knew that despite Intel and AMD having different types of quad core processor prior to Nehalem, they were still classed the same so I suppose it doesn't matter in the grand scheme of things. There's nothing to stop AMD from releasing a 24-"core" Bulldozer; it shouldn't be any larger than Magny-Cours - perhaps slightly smaller in the end - yet its integer performance would be through the roof.

    However, people are bemoaning the fact that for 33% more "cores", AMD are only getting 50% extra performance - it's worth bearing in mind that AMD does this with 4 less, albeit better utilised ALUs than Magny-Cours (32 compared to 36). Make no mistake, Bulldozer is far more efficient and capable in this scenario, but I can't help wondering how strong Phenom II may have been if it'd had a slightly more elegant design.

Log in

Don't have an account? Sign up now