Cache Is Not the Only, Or Even the Main, Culprit

Most people pointed to high latency caches as a reason for subpar Bulldozer performance, but the real explanation of why Bulldozer's performance was underwhelming is a lot more complex. First of all, in most applications, an OOO processor can easily hide the 4-cycle latency of an L1 cache. Intel introduced a 4-cycle latency cache three years ago with their Nehalem architecture, and Intel's engineers claim that simulations show that a 3-cycle L1 would only boost performance by 2-3% (at the same clock), which is peanuts compared to the performance boost that is the result of the higher clock speed headroom.

Secondly, a dedicated 4-way 16KB cache, although relatively small, is hardly worse than Intel's 8-way 32KB data cache that is shared by two threads. The cache is also predicted lowering the power to search, so the Bulldozer data cache organisation does have its advantages.

Considering that SAP and Libquantum tell us that Bulldozer's prefetching works quite well, the 20-cycle L2 cache latency might not be a showstopper after all in server and HPC applications. We noticed also that the large 2MB cache offers (much) higher hit rates than the 512KB L2 cache of the older Istanbul/Magny-Cours cores. So while the L2 cache latency is not an advantage, we definitely have doubts that it is a major factor.

We do agree that it is a serious problem for desktop applications as most of our profiling shows that games and other consumer applications are much more sensitive to L2 cache latency. It was after all one of the reasons why Nehalem was not much faster than the older Penryn based CPUs. Lowly threaded desktop applications run best in a large, low latency L2 cache. But for server applications, we found worse problems than the L2 cache.

The Real Shortcomings: Branch Misprediction Penalty and Instruction Cache Hit Rate

Bulldozer is a deeply pipelined CPU, just like Sandy Bridge, but the latter has a µop cache that can cut the fetching and decoding cycles out of the branch misprediction penalty. The lower than expected performance in SAP and SQL Server, plus the fact that the worst performing subbenches in SPEC CPU2006 int are the ones with hard to predict branches, all points to there being a serious problem with branch misprediction.

Our Code Analyst profiling shows that AMD engineers did a good job on the branch prediction unit: the BPU definitely predicts better than the previous AMD designs. The problem is that Bulldozer cannot hide its long misprediction penalty, which Intel does manage with Sandy Bridge. That also explains why AMD states that branch prediction improvements in "Piledriver" ("Trinity") are only modest (1% performance improvements). As branch predictors get more advanced, a few tweaks here and there cannot do much.

It will be interesting to see if AMD will adopt a µop cache in the near future, as it would lower the branch prediction penalty, save power, and lower the pressure on the decoding part. It looks like a perfect match for this architecture.

Another significant problem is that the L1 instruction cache does not seem to cope well with 2-threads. We have measured significantly higher miss rates once we run two threads on the 2-way 64KB L1 instruction cache. It looks like the associativity of that cache is simply too low. There is a reason why Intel has an 8-way associative cache to run two threads.

Desktop Performance Was Not the Priority

No matter how rough the current implementation of Bulldozer is, if you look a bit deeper, this is not the architecture that is made for high-IPC, branch intensive, lightly-threaded applications. Higher clock speeds and Turbo Core should have made Zambezi a decent chip for enthusiasts. The CPU was supposed to offer 20 to 30% higher clock speeds at roughly the same power consumption, but in the end it could only offer a 10% boost at slightly higher power consumption.

Server Workloads: There Is Hope

If there is one thing this article should have made clear, it's that server applications have completely different demands than SPEC CPU or workstation software. They are much more limited by MLP, come with lower IPC, and are more scalable. They also come with a much larger memory footprint and punish small, low latency caches with high miss rates. Therefore a higher latency but larger L2 cache assisted by good prefetchers can perform adequately.

We strongly believe the concepts behind Bulldozer are sound ones for the professional IT world. The trade-offs are well made for these workloads, but there seem to be four show stoppers. So far we found out that the instruction cache, the branch misprediction penalty, and the lack of clock speed are the main reasons why Bulldozer underperforms in the server world.

The lack of clock speed seems to be addressed in Piledriver with the use of hard edge flops and the resonant clock edge, which is especially useful for clock speeds beyond 3GHz. That means "Abu Dhabi" might be a pleasant surprise. AMD has done it before: in 2007, "Barcelona" (K10 architecture) started at a very dissapointing 2GHz and with worse single-threaded performance than expected. At the end of 2008, a slightly improved version of this architecture (Shanghai) was running at 2.7GHz and had a cache that was three times larger with slightly lower latency. So let's hope that "Abu Dhabi" can repeat the "Shanghai stunt".

But what about the fourth show stopper? That is probably one of the most interesting ones because it seems to show up (in a lesser degree) in Sandy Bridge too. However, we're not quite ready with our final investigations into this area, so you'll have to wait a bit longer. To be continued....

Branch Prediction Analysis
Comments Locked

84 Comments

View All Comments

  • Zoomer - Thursday, May 31, 2012 - link

    True. It's probably better out way back then, but synthesized, than to come out maybe next year with all their lovingly fully customized, hand placed transistors. That's if they don't go bankrupt first.

    wolfman3k5's probably going to call nVidia, 3dfx, ATi (then), most FPGA program design houses, etc, lazy, too.
  • misiu_mp - Monday, June 11, 2012 - link

    A large margin of error means that you have a lot of space to make errors with little consequence.
    You meant of course that engineers have small margins of error in their work.
  • 500MM - Wednesday, May 30, 2012 - link

    http://images.anandtech.com/graphs/graph5057/42770...

    If lower was better, AMD would have one kickass CPU. The caption is wrong.
  • JohanAnandtech - Friday, June 1, 2012 - link

    Fixed, thx!
  • weebnuts - Wednesday, May 30, 2012 - link

    The problem with all these benchmarks is that most organizations are going to be using this is Xen or Vmware uses. The idea is that with more cores, you can run more VM's especially if you are trying to implement Virtual Desktops. How do the processors compare when you are loading the server to 80-90% capacity with lots of VM's? That's a real world comparison I want to see.
  • Iketh - Wednesday, May 30, 2012 - link

    I was dying for information like this. Thank you!

    And as for that quote on the first page by Iketh, that guy is a genious!! :D
  • Aone - Thursday, May 31, 2012 - link

    1) Maybe i missed something but, Should "Higher is better" be for "Data Cache hitrate", i.e. opposite to cache misses?

    2) And on the chart "L2 Cache hitrate", is it correct that "Opteron 6276" tag is shown on first line while "Opteron 6174" on the last line? I thought Opteron 6174 was faster in MS SQL than Opteron 6276.
  • mrdcook - Thursday, May 31, 2012 - link

    There are a few new instructions in Bulldozer's architecture that, for certain specific computations, can make it 10X faster than Intel. For example, FMA. An FMA does a multiply and then an add as one instruction, rounding only once. Combining the multiply and the add isn't such a big deal (and in many cases can even be counter-productive), but rounding only once is very important in some cases.

    For example, assume you have 3 digits of accuracy and want to calculate (1.23 * 2.31 - 2.84). Without FMA, you calculate Round(1.23 * 2.31) = 2.84, then you calculate Round(2.84 - 2.84) = 0. With FMA, you calculate 1.23 * 2.31 = 2.8413, then you calculate Round(2.8413 - 2.84) = 0.0013. While that may seem contrived (it was!), the difference is significant in certain simulations and calculations.

    When doing math, computers have a very specific level of accuracy -- a certain number of digits of precision. If you want your simulations to come out right, you have to take these limits into account. Learning how to account for the computer's rounding errors is a bit of a black art.

    Mathematicians design algorithms in terms of matrix multiplications and dot products, and if you translate those algorithms directly into computer multiplications and additions, you tend to end up with a lot of cancellation errors like the example given above. You can hire a computer science grad student to rework your algorithm to not lose accuracy, but that is expensive and has to be done for every new algorithm. Or you can use an FMA for the dot products and the matrix multiplications (the high-accuracy dot product and matrix multiplication libraries already do this).

    FMA in software is slow. Single-precision emulated FMA isn't too bad since you can use double-precision to help with the hardest bits of the emulation. The result is that you can do one fmaf in about 4X the amount of time it would take to do a single a*b+c. However, SSE2 allows you to do 4 a*b+c at a time, so emulated single-precision FMA ends up being about 15X slower than optimized SSE2 non-fused multiply-add. Double-precision is harder, taking about 10 times longer than a single a*b+c, so it ends up being 20X slower than non-fused multiply-add.

    Admittedly, the target market for FMA is probably smaller than a breadbox, but those who need it really need it. And as it becomes more common, it'll only become more important. For now, since only Bulldozer has it, nobody is going to care.
  • BaronMatrix - Thursday, May 31, 2012 - link

    There are admittedly only two viable X86 licensees in America and one of them sucks...
  • shodanshok - Thursday, May 31, 2012 - link

    Hi Johan,
    first of all, let me thank you for your wonderful analysis on Bulldozer architecture. I read it with great interest.

    However, I think that you left out a very important thing to mention: L1/L2 cache read/write bandwidth. Especially for L2, while latency is an important thing, throughput can be an even more crucial one.

    The key point is that Bulldozer has an write-through L1 cache, so all L1 writes are more or less immediately broadcasted to L2 cache. Some small writes can be effectively cached inside a write-back combining buffer called Write Combining Cache (WCC), but this cache is only 4KB in size per the entire module. So, streaming writes will immediatly fill the WCC and bring down L1 cache speed to L2 levels.

    This can really hamper CPU performance. Obviously, AMD went this road for some understandable reasons, however, the WCC is really too small to cache much data and the L2 is way too slow to efficiently serve L1 write requests.

    This bring us to another point: L2 cache is slow. Comparing this with the super-fast (but much smaller) L2 Intel cache, it has no hope; it is more or less at Intel's L3 level.

    Here you can find my analysis of AMD Bulldozer architecture: http://www.ilsistemista.net/index.php/hardware-ana...">AMD Bulldozer analysis
    Note that, while I collected and normalized data from multiple web site, I left very clear what was the original reference (so that you can easily verify my data).

    Thanks.

Log in

Don't have an account? Sign up now