The Bulldozer Aftermath: Delving Even Deeper

Name: The Bulldozer Aftermath: Delving Even Deeper
Item: The Bulldozer Aftermath: Delving Even Deeper
Author: Johan De Gelas

by Johan De Gelas on May 30, 2012 1:15 AM EST

84 Comments | Add A Comment

84 Comments

Zooming in on SPEC CPU2006: the Good

We filtered out those benchmarks that showed a 30% improvement over Magny-Cours (based on the K10 core). Remember the Bulldozer architecture has been designed to deliver 33% more cores in the same power envelope while keeping the IPC more or less at 95% of the K10. The rest of the performance should have come from a clock speed increase. The clock speed increases did not materialize in the real world, and we also kept the clock speed the same to focus on the architecture. Where a 30-35% performance increase is good, anything over 35% indicates that the Bulldozer architecture handles that particular sort of software better than Magny-Cours.

SPEC Int CPU2006: the Bulldozer friendly

The Libquantum score is the most spectacular. Bulldozer performs over twice as fast and the score of 2750 is not that far from the all mighty Xeon 2660 at 2.2GHz (3310). Bulldozer here is only 17% slower.

At first sight, there is nothing that should make Libquantum run very fast on Bulldozer. Libquantum contains a high amount of branches (27%) and we have seen before that although Bulldozer has a somewhat improved branch predictor, the deeper pipeline and higher branch misprediction penalty can cause a lot of trouble. In fact, Perlbench (23%), Sjeng Chess (21%), and Gobmk (AI, 21%) are branchy software and are among the worst performing tests on Bulldozer. Luckily, Libquantum has a much easier to predict branches: libquantum is among the software pieces that has the lowest branch misprediction rates (less than six per 1000 instructions).

We all know that Bulldozer can deal much better with loads and stores than Magny-Cours. However, libquantum has the lowest (!) amount of load/stores (19%=14% Loads, 5% Stores). The improved Memory Level Parallelism of Bulldozer is not the answer. The table below gives an idea of the instruction mix of SPEC CPU2006int.

SPEC Int 2006 Application	IPC*	Branches	Stores	Loads	Total Loads/ Stores
perlbench	1.67	23	12	24	36
Bzip compression	1.43	15	9	26	35
Gcc	0.83	22	13	26	39
mcf	0.28	19	9	31	40
Go AI	1.00	21	14	28	42
hmmer	1.67	8	16	41	57
Chess	1.25	21	8	21	29
libquantum	0.43	27	5	14	1
h264 encoding	2.00	8	12	35	47
omnetppp	0.38	21	18	34	52
astar	0.56	17	5	27	32
XML processing	0.66	26	9	32	41

* IPC as measured on Core 2 Duo.

Libquantum has a relatively high amount of cache misses on most CPUs as it works with a 32MB data set, so it benefits from a larger cache. The 8MB L3 vs 6MB L3 might have boosted performance a bit, but the main reason is vastly improved prefetching inside Bulldozer. According to the researchers of the university of Austin and Microsoft, the prefetch requests in libquantum are very accurate. If you check AMD's own publications you'll notice that there were two major improvements to improve the single-threaded performance of the Bulldozer architecture (compared to the previous ones): an improved Turbo Core and vastly improved prefetching.

Next, let's look at the excellent mcf result. mcf is by far the most memory intensive SPEC CPU Int benchmark out there. mcf misses the L1 data cache about five times more than all the other benchmarks on average. The hit rate is lower than 70%! mcf also misses the last level cache up to eight times more than all other benchmarks. Clearly mcf is a prime candidate to benefit from the vastly improved L/S units of Bulldozer.

Omnetpp is not that extreme, but the instruction mix has 52% loads and stores, and the L2 and last level cache misses are twice as high as the rest of the pack. In contrast to mcf, the amount of branch mispredictions is much lower, despite the fact that it has a similar, relatively high percentage of branches (20%). So the somewhat lower reliance on the memory subsystem is largely compensated for by a much lower amount of branch mispredictions. To be more precise: the amount of branch predictions is about three times lower! This most likely explains why Bulldozer makes a slightly larger step forward in omnetpp compared to the previous AMD architecture than in it does in mcf.

SPEC CPU 2006 Integer Zooming in on SPEC CPU 2006: the Bad

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

84 Comments

View All Comments

shodanshok - Thursday, May 31, 2012 - link
Mmm... the link was malformed in the previous message.

The correct one is: http://www.ilsistemista.net/index.php/hardware-ana...

Thanks.
name99 - Thursday, May 31, 2012 - link
"First of all, in most applications, an OOO processor can easily hide the 4-cycle latency of an L1 cache."

I know you guys are interested in the question --- why does Bulldozer frequently suck? --- rather than the question ---why is Sandy Bridge so much better? --- but it is this latter question that interests me the most.

What strikes me, on going through all this data (including information that is NOT in the article, and on my experience back in the day when I was writing assembly and counting cycles) is that the "eventual cost" of misses that go all the way to RAM is not covered in the article, and I suspect this is a large part of the issue.

What I mean here is the following: consider an extremely simple model --- an L1 hit takes 1 cycle, an L1 miss that goes to RAM takes 100 cycles. Then a 97% L1 hit rate takes a total of basically 400 cycles; a 99% L1 hit rate takes 200 cycles --- apparently minor differences have a huge effect! But that's not the point I want to focus on.
Let's make the model more complicated. First let's make L1 hit cost more realistic --- 4 cycles. As the article says, this is, for the most part, trivially hidden by the OoO engine. But then why can't the OoO engine also hide all or most of the cost of all those cycles to RAM?
And that, I think, is where the Intel advantage is. They do such a good job with their OoO engine.

At a gross level, OoO engines all look kinda the same --- look at a PPC 750 and an IB and, at a superficial level, they look similar. But firstly the IB has just so much larger buffers (what, 168 or so, compared to the 750s what, 6 or so) that, of course, it has a vastly larger stock of instructions it can keep chewing through as it waits for the RAM.

But, you say, AMD also has large buffers now. Yes, but it's not only the raw buffers. Whenever you start looking at these chips, you discover all sorts of weird limitations on what they can actually do to use all those buffers. I've no idea what the current exact limitations are, but the sort of thing you would have in the past is that maybe all the buffers are flushed on an interrupt or system call; or there'd be strange conditions that could occur where, although in theory the integer engine could keep going past a blocking FP instruction, it turned out to be easier to prevent some race condition by freezing the integer engine under these conditions.

Secondly while you're executing other instructions, waiting on your RAM, you may well execute a few more load/store instructions that again miss in RAM. How well do you handle these? Can you just keep firing out these load/stores, or do you block at the second (or third, or fourth)? Frequently these load-stores refer to the same cache-line that's already in play from the first L1 miss, and how do you handle that? the truly dumb thing, of course, is to send out ANOTHER memory request. Smarter is to suppress that, but you're still using load-store entries in the main "miss to RAM" data structures. Smarter still is to be aware that this line will be coming eventually, and use auxiliary data structures to hold info about this load/store.

It's these sorts of technical details, which don't appear in the gross specs (and sometimes not even in the detailed CPU descriptions) that make so much difference. They are obviously astonishingly difficult to get right. Intel has the manpower to worry about every one of them, AMD does not.

Point is --- if I had to look for a single difference between the the two, that's what I'd be looking at --- how much time is REALLY wasted waiting on DRAM in SB vs on Bulldozer.
misiu_mp - Monday, June 11, 2012 - link
It is the compiler's and Out-Of-Order engine's job to order loads, stores and other instructions to minimize the total execution time.
So making sure no stupid and unnecessary loads are being committed is what the OOO mechanism normally does.
There is no reason to suspect it is fundamentally broken in Bulldozer.
IceDread - Friday, June 1, 2012 - link
It really is simple, Amd did a Huge mistake.

The product is a bust, simple as that.

The next generation or the generations after that might be a whole different matter, but guess what? No one cares. It wont help the poor souls that bought this busted product.

It's annoying that Amd could not do better because now Intel reigns supreme and competes with itself .
mikato - Friday, June 1, 2012 - link
I know I shouldn't feed the trolls but...
You say next generations might be a whole different matter - well what do you think is the point of learning about the Bulldozer architecture? The next generations are based on it.
IceDread - Monday, June 4, 2012 - link
What is the point of releasing a product that does not outperform it's predecessor?
Hope that people will purchase the product anyway and learn it?
Which companies would be interested in this, how many? Why would they invest money into this?
_vor_ - Saturday, June 2, 2012 - link
Yes. I too would be interested in exactly what aspects you think Bulldozer failed and your design ideas and approach on how you would fix them. Do tell.
wiyosaya - Friday, June 1, 2012 - link
Personally, I think it is always nice to see in-depth articles like this that explain the details of the structure of a processor.

To me, it sounds like AMD has a foundation that with a few well-directed tweaks, may put them in contention with Intel again in the CPU arena. Though AMD has said that they are through competing with Intel, I truly hope this is not the case. Perhaps this is a marketing tactic remove focus from themselves after the enthusiast arena panned BD and its siblings.

I've built my systems with AMD for a long time; however, this time I went with Intel because I thought they had the better value. Perhaps the future will bring me back to AMD, however, I cannot see doing so right now simply because Intel has become the "value" line over AMD.

With an i7-3820 in my most recent rig, I think I picked the SB-E value processor. I run more than games, and some of what I run takes advantage of quad-channel memory.

In any event, I'm set for a while. Perhaps AMD will once again produce a superior product by the time I am ready for my next build.
jamyryals - Friday, June 1, 2012 - link
What a great read, thanks!
SocketF - Friday, June 1, 2012 - link
Hi Johan,

thanks for the test, it is great.

However, on page 9 you have some trouble with percentage calculations. You wrote:

quote:
-------------------
We get a 65% speed up (2x 0.71 vs 0.86), which is somewhat lower than the 80% predicted by the AMD slides discussing CMT.
-------------------
This numbers are totally correct and within AMD's predictions. AMD promised 80% performance for the CMT-Bulldozer module, compared to an hypothetical Bulldozer CMP core, i.e. 2 (single) cores.

So you have to double your single-thread results, to get the score of 2 (single) Bulldozer cores (2 CMP cores). That gives: 0.86 x 2 = 1.72

Now compare that to the real performance of 2 CMT cores of one module, which is 0.71 x 2 = 1.42

1.42 are 82.6% of 1.72, which is better than AMD's 80% claim. Thus their claim holds. Everything's fine, don't worry.

Source of AMD's claim is e.g. here:
http://techreport.com/r.x/bulldozer-uarch/bulldoze...
(sorry, didn't find it on anandtech)

Please update your article accordingly.

Oh and one last question, why did you add up the SMT scores but not the CMT scores? Seems odd, an IPC of "two threads", This is just weired. Furthermore it is somehow useless, because you cannot compare it directly with the CMT scores. A diagram should visualize the results not force the reader to do some re-calculations.

Thanks again

Erik

The Bulldozer Aftermath: Delving Even Deeper

Post Your Comment

84 Comments

View All Comments

shodanshok - Thursday, May 31, 2012 - link

name99 - Thursday, May 31, 2012 - link

misiu_mp - Monday, June 11, 2012 - link

IceDread - Friday, June 1, 2012 - link

mikato - Friday, June 1, 2012 - link

IceDread - Monday, June 4, 2012 - link

_vor_ - Saturday, June 2, 2012 - link

wiyosaya - Friday, June 1, 2012 - link

jamyryals - Friday, June 1, 2012 - link

SocketF - Friday, June 1, 2012 - link

Log in

Don't have an account? Sign up now