The Bulldozer Aftermath: Delving Even Deeperby Johan De Gelas on May 30, 2012 1:15 AM EST
Integer Crunching Power
Each core has two integer executions units (EX0 and EX1) and two AGUs (Address Generation Units). For comparison, the K10 core inside Magny-Cours and Istanbul had three ports to a “Fully featured ALU + AGU” couple. AMD marketing cleverly drew four pipeline blocks inside the Bulldozer integer core, but those powerpoint blocks cannot hide the fact that each Bulldozer integer core has fewer execution resources.
In practice, the AG0 and AG1 are little more than assistants with limited capabilities to EX0 and EX1.The software optimization guide for AMD family 15h processors lists only a few instructions (page 248 in the January 2012 version) that can be processed by the AG0 and AG1 execution units and each time the remark "First op to AG0 | AG1, Second to EX0 | EX1" is made. The AG0 and AG1 execution units reduce the latency of the CALL and LEA instructions, but the maximum throughput of each integer core inside the Bulldozer module is only two integer instructions per clock cycle. It's only when a fused branch enters EX0 and another integer instruction can enter EX1 that we have a slightly higher throughput of three integer instructions.
So the Bulldozer integer core can execute one integer instruction less per cycle (2 vs 3). That doesn’t mean that the Bulldozer integer core is 1/3 slower, however. The integer core of Bulldozer is smaller but also more flexible. The per lane dedicated 8-entry schedulers are gone, and a much larger 40 entry scheduler replaced it. This means that Bulldozer should be better at extracting ILP (Instruction Level Parallelism) out of code that has low IPC (Instructions Per Clock).
In some integer intensive applications, the fact that the maximum throughput of integer instructions is somewhat lower might slow things down. That is the not very useful "it depends" answer, so let's clarify: what kind of applications are we talking about?