A Real Redesign

When we first met Phenom we were disappointed that it didn’t introduce the major architectural changes AMD needed to keep up with Intel. The front end and execution hardware remained largely unchanged from the K8, and as a result Intel pulled ahead significantly in performance per clock over the past few years. With Bulldozer, we finally got the redesign that we’ve been asking for.

If we look at Westmere, Intel has a 4-issue architecture that’s shared among two threads. At the front end, a single Bulldozer module is essentially the same. The fetch logic in Bulldozer can grab instructions from two threads and send it to the decoder. Note that either thread can occupy the full width of the front end if necessary.

The instruction fetcher pulls from a 64KB 2-way instruction cache, unchanged from the Phenom II.

The decoder is now 4-wide an increase from the 3-wide front end that AMD has had since the K7 all the way up to Phenom II. AMD can now fuse x86 branch instructions, similar to Intel’s macro-ops fusion to increase the effective width of the machine as well. At a high level, AMD’s front end has finally caught up to Intel, but here’s where AMD moves into the passing lane.

The 4-wide decode engine feeds three independent schedulers: two for the integer cores and one for the shared floating point hardware.


Bullddozer, 2 threads per module

Each integer scheduler is now unified. In the Phenom II and previous architectures AMD had individual schedulers for math and address operations, but with Bulldozer it’s all treated as one.


Phenom II, 1 thread per core

Each scheduler has four ports that feed a pair of ALUs and a pair of AGUs. This is down one ALU/AGU from Phenom II (it had 3 ALUs and 3 AGUs respectively and could do any mix of 3). AMD insists that the 3rd address generation unit wasn’t necessary in Phenom II and was only kept around for symmetry with the ALUs and to avoid redesigning that part of the chip - the integer execution core is something AMD has kept around since the K8. The 3rd ALU does have some performance benefits, and AMD canned it to reduce die size, but AMD mentioned that the 4-wide front end, fusion and other enhancements more than make up for this reduction. In other words, while there’s fewer single thread integer execution resources in Bulldozer than Phenom II, single threaded integer performance should still be higher.

Each integer core has its own 16KB L1 data cache. The L1 caches are segmented by thread so the shared FP core chooses which L1 cache to pull from depending on what thread it’s working on.

I asked AMD if the small L1 data cache was going to be a problem for performance, but it mentioned that in modern out of order machines it’s quite easy to hide the latency to L2 and thus this isn’t as big of an issue as you’d think. Given how aggressive AMD has been in the past with ramping up L1 cache sizes, this is a definite change of pace which further indicates how significant of a departure Bulldozer is from the norm at AMD.

While there are two integer schedulers in a single Bulldozer module (one for each thread), there’s only one FP scheduler. There’s some hardware duplication at the FP scheduler to allow two threads to share the execution resources behind it. While each integer core behaves like an independent core, the FP resources work as they would in a SMT (Hyper Threading) system.

The FP scheduler has four ports to its FPUs. There are two 128-bit FMAC pipes and two 128-bit packed integer pipes. Like Sandy Bridge, AMD’s Bulldozer will support SSE all the way up to 4.2 as well as Intel’s new AVX instructions. The 256-bit AVX ops will be handled by the two 128-bit FMAC units in each Bulldozer module.

Each Bulldozer module has its own private L2 cache shared by both integer cores and the FP execution hardware.

Bulldozer Predictors, Prefetching, Power Gating & Real Turbo
POST A COMMENT

76 Comments

View All Comments

  • mino - Tuesday, August 24, 2010 - link

    From the HW design POW, those pipes are "MMX/3Dnow" class stuff.
    They run SSE3, but they are still MMX-class.

    There is a reason Bulldozer has "FMAC" written there ...
    Reply
  • Kiijibari - Tuesday, August 24, 2010 - link

    ... it is stupid to name a circuit after a deprecated ISA extension and not after its function.
    If its doing stuff like 3dnow and mmx then call it Shuffel / permutation pipeline but not MMX ...

    The FMAC is the best example .. why is it written FMAC in that case and not SSE5/AVX/XOP ?
    Reply
  • KonradK - Thursday, August 26, 2010 - link

    Depracated does not mean prohibited. Also there are existing MMX programs and other than Windows 64bit operating systems and compilers other than MSVSC.

    MMX and x87 is prohibited in 64bit kernel code.

    http://msdn.microsoft.com/en-us/library/ff545910%2...
    Reply
  • iwod - Tuesday, August 24, 2010 - link

    From the design of Bulldozer's FPU it is cleared that AMD want Multi Threaded FPU to run on OpenCL. While the dual Integer looks interesting now. It is up against the SandyBridge, the architecture that is suppose to leap again like Pentium 4 to C2D. And if Bulldozer comes any later, it will be up against the die shrink of SandyBridge, Ivy Bridge. Things dont look so good in here.

    It is mainstream / low end that looks very interesting. I am currently using a Pentium M 1.8Ghz Dothan with 2GB DDR Ram. With a Radeon 1600 Graphics. I dont get hardware acceleration from GPU, 720P is just barely playable with some very fast software decoder. It is fast enough to watch some 460p youtube and most of my day web serving.

    Now if Bobcat have similar or higher IPC then Dothan. A Quad Core Bobcat with Radeon 5000 64 SP will still be within reasonable die size on 40nm, It will be cheap when it drops to 32nm or lower. Most of us dont need SUPER FAST computer. And Bobcat with Radeon 5 Series or Higher Plus a Fast SSD are all we need.
    Reply
  • aegisofrime - Tuesday, August 24, 2010 - link

    I don't recall Sandy Bridge being a revolutionary leap. Everyone has been saying that it's more of evolutionary, the main difference being the addition of AVX.

    I REALLY REALLY REALLY hope that AMD announces later today what socket Bulldozer will be on... I desperately need more video encoding performance. I have a AM2+ motherboard and that bloody 1055T is singing it's siren song to me every night. If Bulldozer is on AM3 I can get an AM3 board and the 1055T and do a quick upgrade to Bulldozer.

    Come on AMD. Your customers need more information to make an informed decision!
    Reply
  • mino - Tuesday, August 24, 2010 - link

    Buldozer gen1 == primarily servers
    => 16/12-core (MCM) Socket G34 (current platfrom)
    => 8/6/4-core Socket G32 (current platfrom)

    Bulldozer Desktop (hopefully before X-mas 2011)
    => 8?/6/4-core Socket AM3R2(or AM3+, whatever they call it)
    Reply
  • Pirks - Tuesday, August 24, 2010 - link

    Huh? You want more video encoding perfomance and you think about upgrading CPU? What kind of idiocy is that? Use 480GTX with Badaboom and your video encoding speed won't be matched by CPUs of year 2020 or maybe even 2030 :P Reply
  • aegisofrime - Tuesday, August 24, 2010 - link

    Don't talk if you don't know what you are talking about. No GPU encoder out there is able to match x264 quality or SPEED wise. And the huge flaw in your statement is that Badaboom doesn't even support Fermi GPUs right now.

    Have you done any serious video encoding before, or are you just trolling as usual?
    Reply
  • ChronoReverse - Tuesday, August 24, 2010 - link

    Indeed. I would try out CUDA encoders every once in a while in hopes that I could at least get the quality of x264 at MINIMUM quality but they can't even match that.

    Since x264 at minimum quality encodes slightly quicker (on my quad core) a CUDA encoder does (on my GTX260) and still yields better quality, I really appreciate faster CPU's.
    Reply
  • mapesdhs - Tuesday, August 24, 2010 - link


    Hate to say it but unless GPU acceleration is available, the i7 is a far better
    choice for video encoding. I still use a 6000+ for most tasks, but numerous
    article reviews made it quite clear that AMD was not the best choice for
    video encoding, so I went with an i7 860 4GHz. Pricing was surprisingly good,
    speed is excellent.

    Ian.
    Reply

Log in

Don't have an account? Sign up now