Cache Is Not the Only, Or Even the Main, Culprit

Most people pointed to high latency caches as a reason for subpar Bulldozer performance, but the real explanation of why Bulldozer's performance was underwhelming is a lot more complex. First of all, in most applications, an OOO processor can easily hide the 4-cycle latency of an L1 cache. Intel introduced a 4-cycle latency cache three years ago with their Nehalem architecture, and Intel's engineers claim that simulations show that a 3-cycle L1 would only boost performance by 2-3% (at the same clock), which is peanuts compared to the performance boost that is the result of the higher clock speed headroom.

Secondly, a dedicated 4-way 16KB cache, although relatively small, is hardly worse than Intel's 8-way 32KB data cache that is shared by two threads. The cache is also predicted lowering the power to search, so the Bulldozer data cache organisation does have its advantages.

Considering that SAP and Libquantum tell us that Bulldozer's prefetching works quite well, the 20-cycle L2 cache latency might not be a showstopper after all in server and HPC applications. We noticed also that the large 2MB cache offers (much) higher hit rates than the 512KB L2 cache of the older Istanbul/Magny-Cours cores. So while the L2 cache latency is not an advantage, we definitely have doubts that it is a major factor.

We do agree that it is a serious problem for desktop applications as most of our profiling shows that games and other consumer applications are much more sensitive to L2 cache latency. It was after all one of the reasons why Nehalem was not much faster than the older Penryn based CPUs. Lowly threaded desktop applications run best in a large, low latency L2 cache. But for server applications, we found worse problems than the L2 cache.

The Real Shortcomings: Branch Misprediction Penalty and Instruction Cache Hit Rate

Bulldozer is a deeply pipelined CPU, just like Sandy Bridge, but the latter has a µop cache that can cut the fetching and decoding cycles out of the branch misprediction penalty. The lower than expected performance in SAP and SQL Server, plus the fact that the worst performing subbenches in SPEC CPU2006 int are the ones with hard to predict branches, all points to there being a serious problem with branch misprediction.

Our Code Analyst profiling shows that AMD engineers did a good job on the branch prediction unit: the BPU definitely predicts better than the previous AMD designs. The problem is that Bulldozer cannot hide its long misprediction penalty, which Intel does manage with Sandy Bridge. That also explains why AMD states that branch prediction improvements in "Piledriver" ("Trinity") are only modest (1% performance improvements). As branch predictors get more advanced, a few tweaks here and there cannot do much.

It will be interesting to see if AMD will adopt a µop cache in the near future, as it would lower the branch prediction penalty, save power, and lower the pressure on the decoding part. It looks like a perfect match for this architecture.

Another significant problem is that the L1 instruction cache does not seem to cope well with 2-threads. We have measured significantly higher miss rates once we run two threads on the 2-way 64KB L1 instruction cache. It looks like the associativity of that cache is simply too low. There is a reason why Intel has an 8-way associative cache to run two threads.

Desktop Performance Was Not the Priority

No matter how rough the current implementation of Bulldozer is, if you look a bit deeper, this is not the architecture that is made for high-IPC, branch intensive, lightly-threaded applications. Higher clock speeds and Turbo Core should have made Zambezi a decent chip for enthusiasts. The CPU was supposed to offer 20 to 30% higher clock speeds at roughly the same power consumption, but in the end it could only offer a 10% boost at slightly higher power consumption.

Server Workloads: There Is Hope

If there is one thing this article should have made clear, it's that server applications have completely different demands than SPEC CPU or workstation software. They are much more limited by MLP, come with lower IPC, and are more scalable. They also come with a much larger memory footprint and punish small, low latency caches with high miss rates. Therefore a higher latency but larger L2 cache assisted by good prefetchers can perform adequately.

We strongly believe the concepts behind Bulldozer are sound ones for the professional IT world. The trade-offs are well made for these workloads, but there seem to be four show stoppers. So far we found out that the instruction cache, the branch misprediction penalty, and the lack of clock speed are the main reasons why Bulldozer underperforms in the server world.

The lack of clock speed seems to be addressed in Piledriver with the use of hard edge flops and the resonant clock edge, which is especially useful for clock speeds beyond 3GHz. That means "Abu Dhabi" might be a pleasant surprise. AMD has done it before: in 2007, "Barcelona" (K10 architecture) started at a very dissapointing 2GHz and with worse single-threaded performance than expected. At the end of 2008, a slightly improved version of this architecture (Shanghai) was running at 2.7GHz and had a cache that was three times larger with slightly lower latency. So let's hope that "Abu Dhabi" can repeat the "Shanghai stunt".

But what about the fourth show stopper? That is probably one of the most interesting ones because it seems to show up (in a lesser degree) in Sandy Bridge too. However, we're not quite ready with our final investigations into this area, so you'll have to wait a bit longer. To be continued....

Branch Prediction Analysis


View All Comments

  • shodanshok - Thursday, May 31, 2012 - link

    Mmm... the link was malformed in the previous message.

    The correct one is:

  • name99 - Thursday, May 31, 2012 - link

    "First of all, in most applications, an OOO processor can easily hide the 4-cycle latency of an L1 cache."

    I know you guys are interested in the question --- why does Bulldozer frequently suck? --- rather than the question ---why is Sandy Bridge so much better? --- but it is this latter question that interests me the most.

    What strikes me, on going through all this data (including information that is NOT in the article, and on my experience back in the day when I was writing assembly and counting cycles) is that the "eventual cost" of misses that go all the way to RAM is not covered in the article, and I suspect this is a large part of the issue.

    What I mean here is the following: consider an extremely simple model --- an L1 hit takes 1 cycle, an L1 miss that goes to RAM takes 100 cycles. Then a 97% L1 hit rate takes a total of basically 400 cycles; a 99% L1 hit rate takes 200 cycles --- apparently minor differences have a huge effect! But that's not the point I want to focus on.
    Let's make the model more complicated. First let's make L1 hit cost more realistic --- 4 cycles. As the article says, this is, for the most part, trivially hidden by the OoO engine. But then why can't the OoO engine also hide all or most of the cost of all those cycles to RAM?
    And that, I think, is where the Intel advantage is. They do such a good job with their OoO engine.

    At a gross level, OoO engines all look kinda the same --- look at a PPC 750 and an IB and, at a superficial level, they look similar. But firstly the IB has just so much larger buffers (what, 168 or so, compared to the 750s what, 6 or so) that, of course, it has a vastly larger stock of instructions it can keep chewing through as it waits for the RAM.

    But, you say, AMD also has large buffers now. Yes, but it's not only the raw buffers. Whenever you start looking at these chips, you discover all sorts of weird limitations on what they can actually do to use all those buffers. I've no idea what the current exact limitations are, but the sort of thing you would have in the past is that maybe all the buffers are flushed on an interrupt or system call; or there'd be strange conditions that could occur where, although in theory the integer engine could keep going past a blocking FP instruction, it turned out to be easier to prevent some race condition by freezing the integer engine under these conditions.

    Secondly while you're executing other instructions, waiting on your RAM, you may well execute a few more load/store instructions that again miss in RAM. How well do you handle these? Can you just keep firing out these load/stores, or do you block at the second (or third, or fourth)? Frequently these load-stores refer to the same cache-line that's already in play from the first L1 miss, and how do you handle that? the truly dumb thing, of course, is to send out ANOTHER memory request. Smarter is to suppress that, but you're still using load-store entries in the main "miss to RAM" data structures. Smarter still is to be aware that this line will be coming eventually, and use auxiliary data structures to hold info about this load/store.

    It's these sorts of technical details, which don't appear in the gross specs (and sometimes not even in the detailed CPU descriptions) that make so much difference. They are obviously astonishingly difficult to get right. Intel has the manpower to worry about every one of them, AMD does not.

    Point is --- if I had to look for a single difference between the the two, that's what I'd be looking at --- how much time is REALLY wasted waiting on DRAM in SB vs on Bulldozer.
  • misiu_mp - Monday, June 11, 2012 - link

    It is the compiler's and Out-Of-Order engine's job to order loads, stores and other instructions to minimize the total execution time.
    So making sure no stupid and unnecessary loads are being committed is what the OOO mechanism normally does.
    There is no reason to suspect it is fundamentally broken in Bulldozer.
  • IceDread - Friday, June 1, 2012 - link

    It really is simple, Amd did a Huge mistake.

    The product is a bust, simple as that.

    The next generation or the generations after that might be a whole different matter, but guess what? No one cares. It wont help the poor souls that bought this busted product.

    It's annoying that Amd could not do better because now Intel reigns supreme and competes with itself .
  • mikato - Friday, June 1, 2012 - link

    I know I shouldn't feed the trolls but...
    You say next generations might be a whole different matter - well what do you think is the point of learning about the Bulldozer architecture? The next generations are based on it.
  • IceDread - Monday, June 4, 2012 - link

    What is the point of releasing a product that does not outperform it's predecessor?
    Hope that people will purchase the product anyway and learn it?
    Which companies would be interested in this, how many? Why would they invest money into this?
  • _vor_ - Saturday, June 2, 2012 - link

    Yes. I too would be interested in exactly what aspects you think Bulldozer failed and your design ideas and approach on how you would fix them. Do tell. Reply
  • wiyosaya - Friday, June 1, 2012 - link

    Personally, I think it is always nice to see in-depth articles like this that explain the details of the structure of a processor.

    To me, it sounds like AMD has a foundation that with a few well-directed tweaks, may put them in contention with Intel again in the CPU arena. Though AMD has said that they are through competing with Intel, I truly hope this is not the case. Perhaps this is a marketing tactic remove focus from themselves after the enthusiast arena panned BD and its siblings.

    I've built my systems with AMD for a long time; however, this time I went with Intel because I thought they had the better value. Perhaps the future will bring me back to AMD, however, I cannot see doing so right now simply because Intel has become the "value" line over AMD.

    With an i7-3820 in my most recent rig, I think I picked the SB-E value processor. I run more than games, and some of what I run takes advantage of quad-channel memory.

    In any event, I'm set for a while. Perhaps AMD will once again produce a superior product by the time I am ready for my next build.
  • jamyryals - Friday, June 1, 2012 - link

    What a great read, thanks! Reply
  • SocketF - Friday, June 1, 2012 - link

    Hi Johan,

    thanks for the test, it is great.

    However, on page 9 you have some trouble with percentage calculations. You wrote:

    We get a 65% speed up (2x 0.71 vs 0.86), which is somewhat lower than the 80% predicted by the AMD slides discussing CMT.
    This numbers are totally correct and within AMD's predictions. AMD promised 80% performance for the CMT-Bulldozer module, compared to an hypothetical Bulldozer CMP core, i.e. 2 (single) cores.

    So you have to double your single-thread results, to get the score of 2 (single) Bulldozer cores (2 CMP cores). That gives: 0.86 x 2 = 1.72

    Now compare that to the real performance of 2 CMT cores of one module, which is 0.71 x 2 = 1.42

    1.42 are 82.6% of 1.72, which is better than AMD's 80% claim. Thus their claim holds. Everything's fine, don't worry.

    Source of AMD's claim is e.g. here:
    (sorry, didn't find it on anandtech)

    Please update your article accordingly.

    Oh and one last question, why did you add up the SMT scores but not the CMT scores? Seems odd, an IPC of "two threads", This is just weired. Furthermore it is somehow useless, because you cannot compare it directly with the CMT scores. A diagram should visualize the results not force the reader to do some re-calculations.

    Thanks again


Log in

Don't have an account? Sign up now