Zooming in on SPEC CPU2006: the Good

We filtered out those benchmarks that showed a 30% improvement over Magny-Cours (based on the K10 core). Remember the Bulldozer architecture has been designed to deliver 33% more cores in the same power envelope while keeping the IPC more or less at 95% of the K10. The rest of the performance should have come from a clock speed increase. The clock speed increases did not materialize in the real world, and we also kept the clock speed the same to focus on the architecture. Where a 30-35% performance increase is good, anything over 35% indicates that the Bulldozer architecture handles that particular sort of software better than Magny-Cours.

SPEC Int CPU2006: the Bulldozer friendly

The Libquantum score is the most spectacular. Bulldozer performs over twice as fast and the score of 2750 is not that far from the all mighty Xeon 2660 at 2.2GHz (3310). Bulldozer here is only 17% slower.

At first sight, there is nothing that should make Libquantum run very fast on Bulldozer. Libquantum contains a high amount of branches (27%) and we have seen before that although Bulldozer has a somewhat improved branch predictor, the deeper pipeline and higher branch misprediction penalty can cause a lot of trouble. In fact, Perlbench (23%), Sjeng Chess (21%), and Gobmk (AI, 21%) are branchy software and are among the worst performing tests on Bulldozer. Luckily, Libquantum has a much easier to predict branches: libquantum is among the software pieces that has the lowest branch misprediction rates (less than six per 1000 instructions).

We all know that Bulldozer can deal much better with loads and stores than Magny-Cours. However, libquantum has the lowest (!) amount of load/stores (19%=14% Loads, 5% Stores). The improved Memory Level Parallelism of Bulldozer is not the answer. The table below gives an idea of the instruction mix of SPEC CPU2006int.

SPEC Int 2006 Application IPC* Branches Stores Loads Total Loads/
Stores
perlbench 1.67 23 12 24 36
Bzip compression 1.43 15 9 26 35
Gcc 0.83 22 13 26 39
mcf 0.28 19 9 31 40
Go AI 1.00 21 14 28 42
hmmer 1.67 8 16 41 57
Chess 1.25 21 8 21 29
libquantum 0.43 27 5 14 1
h264 encoding 2.00 8 12 35 47
omnetppp 0.38 21 18 34 52
astar 0.56 17 5 27 32
XML processing 0.66 26 9 32 41

* IPC as measured on Core 2 Duo.

Libquantum has a relatively high amount of cache misses on most CPUs as it works with a 32MB data set, so it benefits from a larger cache. The 8MB L3 vs 6MB L3 might have boosted performance a bit, but the main reason is vastly improved prefetching inside Bulldozer. According to the researchers of the university of Austin and Microsoft, the prefetch requests in libquantum are very accurate. If you check AMD's own publications you'll notice that there were two major improvements to improve the single-threaded performance of the Bulldozer architecture (compared to the previous ones): an improved Turbo Core and vastly improved prefetching.

Next, let's look at the excellent mcf result. mcf is by far the most memory intensive SPEC CPU Int benchmark out there. mcf misses the L1 data cache about five times more than all the other benchmarks on average. The hit rate is lower than 70%! mcf also misses the last level cache up to eight times more than all other benchmarks. Clearly mcf is a prime candidate to benefit from the vastly improved L/S units of Bulldozer.

Omnetpp is not that extreme, but the instruction mix has 52% loads and stores, and the L2 and last level cache misses are twice as high as the rest of the pack. In contrast to mcf, the amount of branch mispredictions is much lower, despite the fact that it has a similar, relatively high percentage of branches (20%). So the somewhat lower reliance on the memory subsystem is largely compensated for by a much lower amount of branch mispredictions. To be more precise: the amount of branch predictions is about three times lower! This most likely explains why Bulldozer makes a slightly larger step forward in omnetpp compared to the previous AMD architecture than in it does in mcf.

SPEC CPU 2006 Integer Zooming in on SPEC CPU 2006: the Bad
Comments Locked

84 Comments

View All Comments

  • ArteTetra - Wednesday, May 30, 2012 - link

    "A core this complex in my opinion has not been optimized to its fullest potential. Expect better performance when AMD introduces later steppings of this core with regard to power consumption and higher clock frequencies."

    You don't say?
  • JohanAnandtech - Thursday, May 31, 2012 - link

    A quote by a reader, not ours :-). The idea is probably that Bulldozer was AMD's very first implementation of their new architecture.
  • haplo602 - Wednesday, May 30, 2012 - link

    now this was a great read. finaly something interesting (the consumer benchmarks are NOT intereseted anymore for me).

    I hope there will be a differential analysis once you have Piledriver CPUs available.
  • JohanAnandtech - Thursday, May 31, 2012 - link

    Piledriver analysis: definitely. Thanks for the encouraging words :-)
  • mikato - Friday, June 1, 2012 - link

    I agree - great critical thinking in this article! This subject definitely needed more research.
  • Spunjji - Wednesday, June 6, 2012 - link

    +1. This is the sort of thing I come here for!
  • Beenthere - Wednesday, May 30, 2012 - link

    Expecting Vishera to be an Intel killer is foolish as it's not going to happen and there is no need for it to happen. Ivy Bridge is very much like FX in that it's only 5% faster than SB and runs hot. At least FX chips OC and scale well unlike Ivy Bridge.

    If AMD can use some of the techniques imployed in Trinity they should be able to get a 15+% improvement over the FX CPUs. This combined with higher clockspeeds now that GloFo has sorted 32nm production should provide a nice performance bump in Vishera.

    95% of consumers do not buy the fastest, most over-hyped and over-priced CPU on the planet for their PC or server apps. Mainstream use is what AMD is shooting for at the moment and doing pretty well at it. Eventually they will release APUs for all PC market segments that perform well, use less power and cost less than discrete CPU/GPU combo. THAT is what 95% of the X86 world will be using.
  • Homeles - Wednesday, May 30, 2012 - link

    "Ivy Bridge is very much like FX in that it's only 5% faster than SB and runs hot"

    I think you need to go read about Intel's tick-tock strategy.

    Also, unlike Bulldozer, Ivy Bridge was a step forward. A small one, but performance per watt went up, while with Bulldozer it often went backwards.

    Process maturity from GloFo will help, but probably not as much as you would think.

    Finally, "95% of users" aren't going to benefit best from a processor built with server workloads in mind. Even with server workloads, Bulldozer fails to deliver. APUs are definitely the future, but keep in mind that Intel's had an APU out for as long as AMD has. If you think that AMD's somehow going to pull a fast one on Intel, you're delusional. Intel and Nvidia as well are very, very well aware of heterogeneous computing.
  • The_Countess - Wednesday, May 30, 2012 - link

    looking at how much the performance per watt went up with piledriver compared with llano, I think they''ll have a lot more headroom on the desktop and server space to increase the clock frequencies to where they are suppose to be with the bulldozer launch.
  • Homeles - Wednesday, May 30, 2012 - link

    Yeah, Piledriver will likely perform the way AMD had intended Bulldozer to perform.

Log in

Don't have an account? Sign up now