Cache Is Not the Only, Or Even the Main, Culprit

Most people pointed to high latency caches as a reason for subpar Bulldozer performance, but the real explanation of why Bulldozer's performance was underwhelming is a lot more complex. First of all, in most applications, an OOO processor can easily hide the 4-cycle latency of an L1 cache. Intel introduced a 4-cycle latency cache three years ago with their Nehalem architecture, and Intel's engineers claim that simulations show that a 3-cycle L1 would only boost performance by 2-3% (at the same clock), which is peanuts compared to the performance boost that is the result of the higher clock speed headroom.

Secondly, a dedicated 4-way 16KB cache, although relatively small, is hardly worse than Intel's 8-way 32KB data cache that is shared by two threads. The cache is also predicted lowering the power to search, so the Bulldozer data cache organisation does have its advantages.

Considering that SAP and Libquantum tell us that Bulldozer's prefetching works quite well, the 20-cycle L2 cache latency might not be a showstopper after all in server and HPC applications. We noticed also that the large 2MB cache offers (much) higher hit rates than the 512KB L2 cache of the older Istanbul/Magny-Cours cores. So while the L2 cache latency is not an advantage, we definitely have doubts that it is a major factor.

We do agree that it is a serious problem for desktop applications as most of our profiling shows that games and other consumer applications are much more sensitive to L2 cache latency. It was after all one of the reasons why Nehalem was not much faster than the older Penryn based CPUs. Lowly threaded desktop applications run best in a large, low latency L2 cache. But for server applications, we found worse problems than the L2 cache.

The Real Shortcomings: Branch Misprediction Penalty and Instruction Cache Hit Rate

Bulldozer is a deeply pipelined CPU, just like Sandy Bridge, but the latter has a µop cache that can cut the fetching and decoding cycles out of the branch misprediction penalty. The lower than expected performance in SAP and SQL Server, plus the fact that the worst performing subbenches in SPEC CPU2006 int are the ones with hard to predict branches, all points to there being a serious problem with branch misprediction.

Our Code Analyst profiling shows that AMD engineers did a good job on the branch prediction unit: the BPU definitely predicts better than the previous AMD designs. The problem is that Bulldozer cannot hide its long misprediction penalty, which Intel does manage with Sandy Bridge. That also explains why AMD states that branch prediction improvements in "Piledriver" ("Trinity") are only modest (1% performance improvements). As branch predictors get more advanced, a few tweaks here and there cannot do much.

It will be interesting to see if AMD will adopt a µop cache in the near future, as it would lower the branch prediction penalty, save power, and lower the pressure on the decoding part. It looks like a perfect match for this architecture.

Another significant problem is that the L1 instruction cache does not seem to cope well with 2-threads. We have measured significantly higher miss rates once we run two threads on the 2-way 64KB L1 instruction cache. It looks like the associativity of that cache is simply too low. There is a reason why Intel has an 8-way associative cache to run two threads.

Desktop Performance Was Not the Priority

No matter how rough the current implementation of Bulldozer is, if you look a bit deeper, this is not the architecture that is made for high-IPC, branch intensive, lightly-threaded applications. Higher clock speeds and Turbo Core should have made Zambezi a decent chip for enthusiasts. The CPU was supposed to offer 20 to 30% higher clock speeds at roughly the same power consumption, but in the end it could only offer a 10% boost at slightly higher power consumption.

Server Workloads: There Is Hope

If there is one thing this article should have made clear, it's that server applications have completely different demands than SPEC CPU or workstation software. They are much more limited by MLP, come with lower IPC, and are more scalable. They also come with a much larger memory footprint and punish small, low latency caches with high miss rates. Therefore a higher latency but larger L2 cache assisted by good prefetchers can perform adequately.

We strongly believe the concepts behind Bulldozer are sound ones for the professional IT world. The trade-offs are well made for these workloads, but there seem to be four show stoppers. So far we found out that the instruction cache, the branch misprediction penalty, and the lack of clock speed are the main reasons why Bulldozer underperforms in the server world.

The lack of clock speed seems to be addressed in Piledriver with the use of hard edge flops and the resonant clock edge, which is especially useful for clock speeds beyond 3GHz. That means "Abu Dhabi" might be a pleasant surprise. AMD has done it before: in 2007, "Barcelona" (K10 architecture) started at a very dissapointing 2GHz and with worse single-threaded performance than expected. At the end of 2008, a slightly improved version of this architecture (Shanghai) was running at 2.7GHz and had a cache that was three times larger with slightly lower latency. So let's hope that "Abu Dhabi" can repeat the "Shanghai stunt".

But what about the fourth show stopper? That is probably one of the most interesting ones because it seems to show up (in a lesser degree) in Sandy Bridge too. However, we're not quite ready with our final investigations into this area, so you'll have to wait a bit longer. To be continued....

Branch Prediction Analysis
Comments Locked

84 Comments

View All Comments

  • Schmide - Wednesday, May 30, 2012 - link

    I do remember from some analysis that the L2 cache reads were as slow as main memory. That's great if you hit a L2 cache, but it's not going to buy you anything if it's that slow.
  • SocketF - Wednesday, May 30, 2012 - link

    Impossible, you probably mix some things up, maybe latency and bandwidth?
  • Schmide - Wednesday, May 30, 2012 - link

    Yup. It was late at night, I was thinking writes. the L1 write through basically makes L1 writes the same as L2 writes.
  • Homeles - Wednesday, May 30, 2012 - link

    Not even close. L2 is about 10 times faster than main memory.

    http://www.anandtech.com/show/4955/the-bulldozer-r...
  • jcollake - Wednesday, May 30, 2012 - link

    Through research here at Bitsum on the AMD Bulldozer platform (specifically the 9150), I found a couple things of interest.

    First, disabling CPU core parking seems to make a big difference in performance. I believe that by default the CPU core parking is just too aggressive. I wrote a tool to let you enable or disable CPU parking in *real time* without a reboot, so you can test this yourself. It is called ParkControl, http://bitsum.com/about_cpu_core_parking.php . For *me*, it seemed to make a night and day difference.

    Second, I am working on a neat little benchmarking tool called ThreadRacer, currently only in alpha prototype. It allows you to really see the effects of these paired cores, and how much it matters that the scheduler is properly aware of them. Take this 1 second or so sample, as seen in the screenshot here (downloads available, but it is an early prototype that I'll quickly be finishing up): http://bitsum.com/forum/index.php/topic,1434.0.htm...

    The scheduler update that Microsoft issued of course treats these paired cores as it would a hyper-threaded core. Indeed, the concept is very similar, except perhaps to avoid patents, AMD took the 'share a little' instead of 'share a lot' approach when it comes to shared computational resources. This was the proper way to *quickly* address the issue, but I believe the scheduler is still suboptimal on these processors (likely to be resolved in Windows 8 or a later update to Windows 7/Vista).

    For Bulldozer, as you know, they are two real processors, but because they have shared dependencies, the performance can really be drained if the other processor in the 'pair' is busy. You can see the effects from ThreadRacer, the core without its pair busy quickly out-paced the paired cores that were both busy.
  • jcollake - Wednesday, May 30, 2012 - link

    I should have also mentioned that ThreadRacer also allows you to see how a single CPU consuming thread gets swapped around to different cores (the multi-core thread in the utility). This is its other use. The less the thread gets swapped from core to core, the greater the performance will be. It is interesting to compare and contrast the behavior of the scheduler. I fully believe that most the problems with Bulldozer are due to the Windows scheduler, something that could be tested by using linux and replacing the scheduler with a custom one, or an off the shelf alternative that may behave substantially differently than the Windows scheduler.
  • SocketF - Wednesday, May 30, 2012 - link

    Some people running BOINC programs have reported that Windows-applications run faster when they use a Linux and WINE or a VM.

    The Win-scheduler especially hurts AMD chips, because of the huge exclusive caches. If a thread on an intel CPU is switched to another core, it can load the warmed up L2 portion from the L2 inclusive L3.

    I did some google-search and it seems that under Linux, each core has its own run-queue, whereas on Windows, there is only one run queue for all cores.

    But i didn't delve into it deeply, there are so many different schedulers for Linux, seems to be a complex issue ;-)

    Btw. your link to download is off limits for non-members of your discussion board:
    -------------------------
    Warning!

    The topic or board you are looking for appears to be either missing or off limits to you.
    Please login below or register an account with Bitsum Forums.
    ----------------------------

    Maybe you can upload it somewhere else?
  • jcollake - Saturday, September 1, 2012 - link

    Sorry for the late reply. First, the forum permissions were fixed. Second, the utility (still in early stages) is included in Process Lasso *and* available here: http://bitsum.com/threadracer.php
  • eoerl - Wednesday, May 30, 2012 - link

    Very interesting article, together with the hardware.fr report there's a lot of information. One question though, if you read commentaries : you didn't speak much about the influence of compilers. This proved to change a lot of things on Linux (see phoronix extensive tests on both ivy bridge and bulldozer depending on compiler used and compiler options, for example
    http://www.phoronix.com/scan.php?page=article&...
    http://www.phoronix.com/scan.php?page=article&...
    Benchmark results really change a lot with bulldozer, much more than with ivy or sandy bridge. Do you think AMD lost being oversensitive to compiler optimisations, due to a very original architecture ?
  • JohanAnandtech - Thursday, May 31, 2012 - link

    I deliberately avoided the compiler issues as this would make the article too convoluted. But notice that what we found is not influenced by compiler choice: we find the same indications in SAP and SQL server (compiled by "conservative" compilers and compiler settings) as in CPU CPU 2006, which uses the best optimized settings and compiler as possible.

Log in

Don't have an account? Sign up now