The Bulldozer Aftermath: Delving Even Deeper

Name: The Bulldozer Aftermath: Delving Even Deeper
Item: The Bulldozer Aftermath: Delving Even Deeper
Author: Johan De Gelas

by Johan De Gelas on May 30, 2012 1:15 AM EST

84 Comments | Add A Comment

84 Comments

Zooming in on SPEC CPU2006: the Good

We filtered out those benchmarks that showed a 30% improvement over Magny-Cours (based on the K10 core). Remember the Bulldozer architecture has been designed to deliver 33% more cores in the same power envelope while keeping the IPC more or less at 95% of the K10. The rest of the performance should have come from a clock speed increase. The clock speed increases did not materialize in the real world, and we also kept the clock speed the same to focus on the architecture. Where a 30-35% performance increase is good, anything over 35% indicates that the Bulldozer architecture handles that particular sort of software better than Magny-Cours.

SPEC Int CPU2006: the Bulldozer friendly

The Libquantum score is the most spectacular. Bulldozer performs over twice as fast and the score of 2750 is not that far from the all mighty Xeon 2660 at 2.2GHz (3310). Bulldozer here is only 17% slower.

At first sight, there is nothing that should make Libquantum run very fast on Bulldozer. Libquantum contains a high amount of branches (27%) and we have seen before that although Bulldozer has a somewhat improved branch predictor, the deeper pipeline and higher branch misprediction penalty can cause a lot of trouble. In fact, Perlbench (23%), Sjeng Chess (21%), and Gobmk (AI, 21%) are branchy software and are among the worst performing tests on Bulldozer. Luckily, Libquantum has a much easier to predict branches: libquantum is among the software pieces that has the lowest branch misprediction rates (less than six per 1000 instructions).

We all know that Bulldozer can deal much better with loads and stores than Magny-Cours. However, libquantum has the lowest (!) amount of load/stores (19%=14% Loads, 5% Stores). The improved Memory Level Parallelism of Bulldozer is not the answer. The table below gives an idea of the instruction mix of SPEC CPU2006int.

SPEC Int 2006 Application	IPC*	Branches	Stores	Loads	Total Loads/ Stores
perlbench	1.67	23	12	24	36
Bzip compression	1.43	15	9	26	35
Gcc	0.83	22	13	26	39
mcf	0.28	19	9	31	40
Go AI	1.00	21	14	28	42
hmmer	1.67	8	16	41	57
Chess	1.25	21	8	21	29
libquantum	0.43	27	5	14	1
h264 encoding	2.00	8	12	35	47
omnetppp	0.38	21	18	34	52
astar	0.56	17	5	27	32
XML processing	0.66	26	9	32	41

* IPC as measured on Core 2 Duo.

Libquantum has a relatively high amount of cache misses on most CPUs as it works with a 32MB data set, so it benefits from a larger cache. The 8MB L3 vs 6MB L3 might have boosted performance a bit, but the main reason is vastly improved prefetching inside Bulldozer. According to the researchers of the university of Austin and Microsoft, the prefetch requests in libquantum are very accurate. If you check AMD's own publications you'll notice that there were two major improvements to improve the single-threaded performance of the Bulldozer architecture (compared to the previous ones): an improved Turbo Core and vastly improved prefetching.

Next, let's look at the excellent mcf result. mcf is by far the most memory intensive SPEC CPU Int benchmark out there. mcf misses the L1 data cache about five times more than all the other benchmarks on average. The hit rate is lower than 70%! mcf also misses the last level cache up to eight times more than all other benchmarks. Clearly mcf is a prime candidate to benefit from the vastly improved L/S units of Bulldozer.

Omnetpp is not that extreme, but the instruction mix has 52% loads and stores, and the L2 and last level cache misses are twice as high as the rest of the pack. In contrast to mcf, the amount of branch mispredictions is much lower, despite the fact that it has a similar, relatively high percentage of branches (20%). So the somewhat lower reliance on the memory subsystem is largely compensated for by a much lower amount of branch mispredictions. To be more precise: the amount of branch predictions is about three times lower! This most likely explains why Bulldozer makes a slightly larger step forward in omnetpp compared to the previous AMD architecture than in it does in mcf.

SPEC CPU 2006 Integer Zooming in on SPEC CPU 2006: the Bad

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

84 Comments

View All Comments

Taft12 - Wednesday, May 30, 2012 - link
Johan, this is the best article I've read on Anandtech in quite some time, even better than Jarred, Ryan and Anand have come up with lately.

The level of analysis goes far, far beyond just what the benchmarks show.

Bravo!
JohanAnandtech - Thursday, May 31, 2012 - link
Great! Good to read there are still people that like these kinds of analysis!

:-)
ct760ster - Wednesday, May 30, 2012 - link
Would be interesting if they could test the aforementioned benchmark in an OS with a customizable kernel like GNU-Linux since code optimization is not possible in most of the proprietary format benchmark.
alpha754293 - Wednesday, May 30, 2012 - link
What about the lacklustre FPU performance?

The very fact that the FP has to be shared between two integer cores and as far as I know, it cannot run two FP threads at the same time, so for a lot of HPC/computationally heavy workloads - Bulldozer takes a HUGE performance hit. (almost regardless of anything/everything else; although yes, it counts, but remembering that CPUs are glorified calculators, when you take out one of the lanes of the highway and two-lane traffic is now squeezed down to one lane, it's bound to get slower.)
The_Countess - Wednesday, May 30, 2012 - link
except the FP CAN run 2 threads at the same time.
only for the as yet pretty much unused 256bit instructions does it need the whole FP unit per clock.

in fact the FP can run 2 threads of 128bit, or 4 even of 64bit.
and a single CPU can use 2x128bit or both can use 1x128.
intel and AMD previously had only 1x128bit capability per core.
so there is no regression in FP performance per core. its just much more flexible.
Zoomer - Wednesday, May 30, 2012 - link
FPU throughput is much more irrelevant nowadays, as many FP intensive HPC computations have already been ported to GPUs. Yes, there may be instances where there might be FP heavy and branchy, not easily parallelization or otherwise unsuitable, but such beasts are few and far between. I can't think of any, to be honest.
Iger - Wednesday, May 30, 2012 - link
Thanks a lot, that was a very interesting read!
Rael - Wednesday, May 30, 2012 - link
AMD should fire all its marketing department, because these guys accustomed to lie at every announcement they make. The performance gains are multiplied by five or ten, and the per-core advancement, which is close to zero, is presented as 'significant'.
I don't believe these announcements anymore.
jabber - Wednesday, May 30, 2012 - link
What the whole of the AMD Marketing team?

Thats Tim the caretaker and Trisha on the front desk isnt it?

I thought AMD's marketing budget was around $42.
kyuu - Wednesday, May 30, 2012 - link
Oh hai. You must be new to the human race. Marketing and "stretching the truth" have been synonymous since... forever. AMD is hardly exceptional in this regard. Stop believing anything any marketing department sells you, period.

The Bulldozer Aftermath: Delving Even Deeper

Post Your Comment

84 Comments

View All Comments

Taft12 - Wednesday, May 30, 2012 - link

JohanAnandtech - Thursday, May 31, 2012 - link

ct760ster - Wednesday, May 30, 2012 - link

alpha754293 - Wednesday, May 30, 2012 - link

The_Countess - Wednesday, May 30, 2012 - link

Zoomer - Wednesday, May 30, 2012 - link

Iger - Wednesday, May 30, 2012 - link

Rael - Wednesday, May 30, 2012 - link

jabber - Wednesday, May 30, 2012 - link

kyuu - Wednesday, May 30, 2012 - link

Log in

Don't have an account? Sign up now