The Bulldozer Aftermath: Delving Even Deeper

Name: The Bulldozer Aftermath: Delving Even Deeper
Item: The Bulldozer Aftermath: Delving Even Deeper
Author: Johan De Gelas

by Johan De Gelas on May 30, 2012 1:15 AM EST

84 Comments | Add A Comment

84 Comments

Zooming in on SPEC CPU2006: the Good

We filtered out those benchmarks that showed a 30% improvement over Magny-Cours (based on the K10 core). Remember the Bulldozer architecture has been designed to deliver 33% more cores in the same power envelope while keeping the IPC more or less at 95% of the K10. The rest of the performance should have come from a clock speed increase. The clock speed increases did not materialize in the real world, and we also kept the clock speed the same to focus on the architecture. Where a 30-35% performance increase is good, anything over 35% indicates that the Bulldozer architecture handles that particular sort of software better than Magny-Cours.

SPEC Int CPU2006: the Bulldozer friendly

The Libquantum score is the most spectacular. Bulldozer performs over twice as fast and the score of 2750 is not that far from the all mighty Xeon 2660 at 2.2GHz (3310). Bulldozer here is only 17% slower.

At first sight, there is nothing that should make Libquantum run very fast on Bulldozer. Libquantum contains a high amount of branches (27%) and we have seen before that although Bulldozer has a somewhat improved branch predictor, the deeper pipeline and higher branch misprediction penalty can cause a lot of trouble. In fact, Perlbench (23%), Sjeng Chess (21%), and Gobmk (AI, 21%) are branchy software and are among the worst performing tests on Bulldozer. Luckily, Libquantum has a much easier to predict branches: libquantum is among the software pieces that has the lowest branch misprediction rates (less than six per 1000 instructions).

We all know that Bulldozer can deal much better with loads and stores than Magny-Cours. However, libquantum has the lowest (!) amount of load/stores (19%=14% Loads, 5% Stores). The improved Memory Level Parallelism of Bulldozer is not the answer. The table below gives an idea of the instruction mix of SPEC CPU2006int.

SPEC Int 2006 Application	IPC*	Branches	Stores	Loads	Total Loads/ Stores
perlbench	1.67	23	12	24	36
Bzip compression	1.43	15	9	26	35
Gcc	0.83	22	13	26	39
mcf	0.28	19	9	31	40
Go AI	1.00	21	14	28	42
hmmer	1.67	8	16	41	57
Chess	1.25	21	8	21	29
libquantum	0.43	27	5	14	1
h264 encoding	2.00	8	12	35	47
omnetppp	0.38	21	18	34	52
astar	0.56	17	5	27	32
XML processing	0.66	26	9	32	41

* IPC as measured on Core 2 Duo.

Libquantum has a relatively high amount of cache misses on most CPUs as it works with a 32MB data set, so it benefits from a larger cache. The 8MB L3 vs 6MB L3 might have boosted performance a bit, but the main reason is vastly improved prefetching inside Bulldozer. According to the researchers of the university of Austin and Microsoft, the prefetch requests in libquantum are very accurate. If you check AMD's own publications you'll notice that there were two major improvements to improve the single-threaded performance of the Bulldozer architecture (compared to the previous ones): an improved Turbo Core and vastly improved prefetching.

Next, let's look at the excellent mcf result. mcf is by far the most memory intensive SPEC CPU Int benchmark out there. mcf misses the L1 data cache about five times more than all the other benchmarks on average. The hit rate is lower than 70%! mcf also misses the last level cache up to eight times more than all other benchmarks. Clearly mcf is a prime candidate to benefit from the vastly improved L/S units of Bulldozer.

Omnetpp is not that extreme, but the instruction mix has 52% loads and stores, and the L2 and last level cache misses are twice as high as the rest of the pack. In contrast to mcf, the amount of branch mispredictions is much lower, despite the fact that it has a similar, relatively high percentage of branches (20%). So the somewhat lower reliance on the memory subsystem is largely compensated for by a much lower amount of branch mispredictions. To be more precise: the amount of branch predictions is about three times lower! This most likely explains why Bulldozer makes a slightly larger step forward in omnetpp compared to the previous AMD architecture than in it does in mcf.

SPEC CPU 2006 Integer Zooming in on SPEC CPU 2006: the Bad

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

84 Comments

View All Comments

Zoomer - Thursday, May 31, 2012 - link
True. It's probably better out way back then, but synthesized, than to come out maybe next year with all their lovingly fully customized, hand placed transistors. That's if they don't go bankrupt first.

wolfman3k5's probably going to call nVidia, 3dfx, ATi (then), most FPGA program design houses, etc, lazy, too.
misiu_mp - Monday, June 11, 2012 - link
A large margin of error means that you have a lot of space to make errors with little consequence.
You meant of course that engineers have small margins of error in their work.
500MM - Wednesday, May 30, 2012 - link
http://images.anandtech.com/graphs/graph5057/42770...

If lower was better, AMD would have one kickass CPU. The caption is wrong.
JohanAnandtech - Friday, June 1, 2012 - link
Fixed, thx!
weebnuts - Wednesday, May 30, 2012 - link
The problem with all these benchmarks is that most organizations are going to be using this is Xen or Vmware uses. The idea is that with more cores, you can run more VM's especially if you are trying to implement Virtual Desktops. How do the processors compare when you are loading the server to 80-90% capacity with lots of VM's? That's a real world comparison I want to see.
Iketh - Wednesday, May 30, 2012 - link
I was dying for information like this. Thank you!

And as for that quote on the first page by Iketh, that guy is a genious!! :D
Aone - Thursday, May 31, 2012 - link
1) Maybe i missed something but, Should "Higher is better" be for "Data Cache hitrate", i.e. opposite to cache misses?

2) And on the chart "L2 Cache hitrate", is it correct that "Opteron 6276" tag is shown on first line while "Opteron 6174" on the last line? I thought Opteron 6174 was faster in MS SQL than Opteron 6276.
mrdcook - Thursday, May 31, 2012 - link
There are a few new instructions in Bulldozer's architecture that, for certain specific computations, can make it 10X faster than Intel. For example, FMA. An FMA does a multiply and then an add as one instruction, rounding only once. Combining the multiply and the add isn't such a big deal (and in many cases can even be counter-productive), but rounding only once is very important in some cases.

For example, assume you have 3 digits of accuracy and want to calculate (1.23 * 2.31 - 2.84). Without FMA, you calculate Round(1.23 * 2.31) = 2.84, then you calculate Round(2.84 - 2.84) = 0. With FMA, you calculate 1.23 * 2.31 = 2.8413, then you calculate Round(2.8413 - 2.84) = 0.0013. While that may seem contrived (it was!), the difference is significant in certain simulations and calculations.

When doing math, computers have a very specific level of accuracy -- a certain number of digits of precision. If you want your simulations to come out right, you have to take these limits into account. Learning how to account for the computer's rounding errors is a bit of a black art.

Mathematicians design algorithms in terms of matrix multiplications and dot products, and if you translate those algorithms directly into computer multiplications and additions, you tend to end up with a lot of cancellation errors like the example given above. You can hire a computer science grad student to rework your algorithm to not lose accuracy, but that is expensive and has to be done for every new algorithm. Or you can use an FMA for the dot products and the matrix multiplications (the high-accuracy dot product and matrix multiplication libraries already do this).

FMA in software is slow. Single-precision emulated FMA isn't too bad since you can use double-precision to help with the hardest bits of the emulation. The result is that you can do one fmaf in about 4X the amount of time it would take to do a single a*b+c. However, SSE2 allows you to do 4 a*b+c at a time, so emulated single-precision FMA ends up being about 15X slower than optimized SSE2 non-fused multiply-add. Double-precision is harder, taking about 10 times longer than a single a*b+c, so it ends up being 20X slower than non-fused multiply-add.

Admittedly, the target market for FMA is probably smaller than a breadbox, but those who need it really need it. And as it becomes more common, it'll only become more important. For now, since only Bulldozer has it, nobody is going to care.
BaronMatrix - Thursday, May 31, 2012 - link
There are admittedly only two viable X86 licensees in America and one of them sucks...
shodanshok - Thursday, May 31, 2012 - link
Hi Johan,
first of all, let me thank you for your wonderful analysis on Bulldozer architecture. I read it with great interest.

However, I think that you left out a very important thing to mention: L1/L2 cache read/write bandwidth. Especially for L2, while latency is an important thing, throughput can be an even more crucial one.

The key point is that Bulldozer has an write-through L1 cache, so all L1 writes are more or less immediately broadcasted to L2 cache. Some small writes can be effectively cached inside a write-back combining buffer called Write Combining Cache (WCC), but this cache is only 4KB in size per the entire module. So, streaming writes will immediatly fill the WCC and bring down L1 cache speed to L2 levels.

This can really hamper CPU performance. Obviously, AMD went this road for some understandable reasons, however, the WCC is really too small to cache much data and the L2 is way too slow to efficiently serve L1 write requests.

This bring us to another point: L2 cache is slow. Comparing this with the super-fast (but much smaller) L2 Intel cache, it has no hope; it is more or less at Intel's L3 level.

Here you can find my analysis of AMD Bulldozer architecture: http://www.ilsistemista.net/index.php/hardware-ana...">AMD Bulldozer analysis
Note that, while I collected and normalized data from multiple web site, I left very clear what was the original reference (so that you can easily verify my data).

Thanks.

The Bulldozer Aftermath: Delving Even Deeper

Post Your Comment

84 Comments

View All Comments

Zoomer - Thursday, May 31, 2012 - link

misiu_mp - Monday, June 11, 2012 - link

500MM - Wednesday, May 30, 2012 - link

JohanAnandtech - Friday, June 1, 2012 - link

weebnuts - Wednesday, May 30, 2012 - link

Iketh - Wednesday, May 30, 2012 - link

Aone - Thursday, May 31, 2012 - link

mrdcook - Thursday, May 31, 2012 - link

BaronMatrix - Thursday, May 31, 2012 - link

shodanshok - Thursday, May 31, 2012 - link

Log in

Don't have an account? Sign up now