The Bulldozer Aftermath: Delving Even Deeper

Name: The Bulldozer Aftermath: Delving Even Deeper
Item: The Bulldozer Aftermath: Delving Even Deeper
Author: Johan De Gelas

by Johan De Gelas on May 30, 2012 1:15 AM EST

84 Comments | Add A Comment

84 Comments

The Current Situation

It's not hard to explain why an 8-thread processor with slightly lower single-threaded performance does not do well in many desktop applications. If you compare for example the hex-core Core i7-3960X with a quad-core i7-3820, four games did not benefit from the extra two cores: Civilization V, Crysis, Dirt 3 and Metro 2033. In Starcraft 2, World of Warcraft, and Dawn of War 2, the 50% higher core count was good for a 10% performance boost at best. In other words, the situation has improved, but most games don't scale well beyond four cores. There are also other factors at play, though, as it's already known that StarCraft II doesn't use more than two cores; instead, it's likely the 15MB (vs. 10MB in i7-3820) L3 cache that helps improve performance.

The situation in the server space is a lot harder to explain. The Opteron 6100 was able to keep up—more or less—with the Xeon 5600 performancewise. However, the Xeon 5600 was equipped with much better power management and the Xeon won the performance/watt race in most applications, with the exception of HPC applications.

The Opteron 6200 added a bit of performance but sips much less power at low and medium load, so it was capable of offering a better performance per Watt ratio than its older brother. However, since the Xeon E5 came out, the situation became pretty dramatic for the Opteron. One telling example is the fact that only one VMmark 2.0 result on the Opteron 6200 exists, but it has been withdrawn. Even if the reported 12.77 score is close to truth, we need four AMD Opteron 6726 (2.3GHz) to beat the best dual Xeon E5 (2690 at 2.9GHz) by 15%.

We have shown already quite a few benchmarks in two Opteron 6276 articles and one Xeon E5 review. We summarized the relevant numbers of both articles in the table below. The benchmarks below are real world and very relevant to the professional in our opinion.

Software: Importance in the market	Opteron 6276 vs. Opteron 6174	Xeon E5-2660 vs. Opteron 6276
Virtualization: 20-50%
ESXi + Linux (vApusMark FOS)	+1%	+40%
OLAP Databases: 10-15%
MS SQL Server 2008 R2 (OLAP throughput)	-9%	+34%
HPC: 5-7%
LS-Dyna (Neon-Refined)	+21%	+26%
Rendering software: 2-3%
Cinebench	+2%	+37%
ERP
SAP	+18%	+13%

Now consider that all these applications are highly-threaded and scale well. Despite the 33% higher integer core count, the Opteron 6276 is not able to outperform the older Magny-Cours in the OLAP, virtualization and rendering benchmarks. However, the architecture is showing its promise by offering about 20% better performance in SAP and HPC applications.

What makes the Bulldozer cores fail in the OLAP benchmark and succeed in SAP? We now have some interesting profiling details on SAP as well as our OLAP benchmark, so we can delve deeper.

Setting Expectations: the Back End SAP S&D Benchmark in Depth

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

84 Comments

View All Comments

shodanshok - Thursday, May 31, 2012 - link
Mmm... the link was malformed in the previous message.

The correct one is: http://www.ilsistemista.net/index.php/hardware-ana...

Thanks.
name99 - Thursday, May 31, 2012 - link
"First of all, in most applications, an OOO processor can easily hide the 4-cycle latency of an L1 cache."

I know you guys are interested in the question --- why does Bulldozer frequently suck? --- rather than the question ---why is Sandy Bridge so much better? --- but it is this latter question that interests me the most.

What strikes me, on going through all this data (including information that is NOT in the article, and on my experience back in the day when I was writing assembly and counting cycles) is that the "eventual cost" of misses that go all the way to RAM is not covered in the article, and I suspect this is a large part of the issue.

What I mean here is the following: consider an extremely simple model --- an L1 hit takes 1 cycle, an L1 miss that goes to RAM takes 100 cycles. Then a 97% L1 hit rate takes a total of basically 400 cycles; a 99% L1 hit rate takes 200 cycles --- apparently minor differences have a huge effect! But that's not the point I want to focus on.
Let's make the model more complicated. First let's make L1 hit cost more realistic --- 4 cycles. As the article says, this is, for the most part, trivially hidden by the OoO engine. But then why can't the OoO engine also hide all or most of the cost of all those cycles to RAM?
And that, I think, is where the Intel advantage is. They do such a good job with their OoO engine.

At a gross level, OoO engines all look kinda the same --- look at a PPC 750 and an IB and, at a superficial level, they look similar. But firstly the IB has just so much larger buffers (what, 168 or so, compared to the 750s what, 6 or so) that, of course, it has a vastly larger stock of instructions it can keep chewing through as it waits for the RAM.

But, you say, AMD also has large buffers now. Yes, but it's not only the raw buffers. Whenever you start looking at these chips, you discover all sorts of weird limitations on what they can actually do to use all those buffers. I've no idea what the current exact limitations are, but the sort of thing you would have in the past is that maybe all the buffers are flushed on an interrupt or system call; or there'd be strange conditions that could occur where, although in theory the integer engine could keep going past a blocking FP instruction, it turned out to be easier to prevent some race condition by freezing the integer engine under these conditions.

Secondly while you're executing other instructions, waiting on your RAM, you may well execute a few more load/store instructions that again miss in RAM. How well do you handle these? Can you just keep firing out these load/stores, or do you block at the second (or third, or fourth)? Frequently these load-stores refer to the same cache-line that's already in play from the first L1 miss, and how do you handle that? the truly dumb thing, of course, is to send out ANOTHER memory request. Smarter is to suppress that, but you're still using load-store entries in the main "miss to RAM" data structures. Smarter still is to be aware that this line will be coming eventually, and use auxiliary data structures to hold info about this load/store.

It's these sorts of technical details, which don't appear in the gross specs (and sometimes not even in the detailed CPU descriptions) that make so much difference. They are obviously astonishingly difficult to get right. Intel has the manpower to worry about every one of them, AMD does not.

Point is --- if I had to look for a single difference between the the two, that's what I'd be looking at --- how much time is REALLY wasted waiting on DRAM in SB vs on Bulldozer.
misiu_mp - Monday, June 11, 2012 - link
It is the compiler's and Out-Of-Order engine's job to order loads, stores and other instructions to minimize the total execution time.
So making sure no stupid and unnecessary loads are being committed is what the OOO mechanism normally does.
There is no reason to suspect it is fundamentally broken in Bulldozer.
IceDread - Friday, June 1, 2012 - link
It really is simple, Amd did a Huge mistake.

The product is a bust, simple as that.

The next generation or the generations after that might be a whole different matter, but guess what? No one cares. It wont help the poor souls that bought this busted product.

It's annoying that Amd could not do better because now Intel reigns supreme and competes with itself .
mikato - Friday, June 1, 2012 - link
I know I shouldn't feed the trolls but...
You say next generations might be a whole different matter - well what do you think is the point of learning about the Bulldozer architecture? The next generations are based on it.
IceDread - Monday, June 4, 2012 - link
What is the point of releasing a product that does not outperform it's predecessor?
Hope that people will purchase the product anyway and learn it?
Which companies would be interested in this, how many? Why would they invest money into this?
_vor_ - Saturday, June 2, 2012 - link
Yes. I too would be interested in exactly what aspects you think Bulldozer failed and your design ideas and approach on how you would fix them. Do tell.
wiyosaya - Friday, June 1, 2012 - link
Personally, I think it is always nice to see in-depth articles like this that explain the details of the structure of a processor.

To me, it sounds like AMD has a foundation that with a few well-directed tweaks, may put them in contention with Intel again in the CPU arena. Though AMD has said that they are through competing with Intel, I truly hope this is not the case. Perhaps this is a marketing tactic remove focus from themselves after the enthusiast arena panned BD and its siblings.

I've built my systems with AMD for a long time; however, this time I went with Intel because I thought they had the better value. Perhaps the future will bring me back to AMD, however, I cannot see doing so right now simply because Intel has become the "value" line over AMD.

With an i7-3820 in my most recent rig, I think I picked the SB-E value processor. I run more than games, and some of what I run takes advantage of quad-channel memory.

In any event, I'm set for a while. Perhaps AMD will once again produce a superior product by the time I am ready for my next build.
jamyryals - Friday, June 1, 2012 - link
What a great read, thanks!
SocketF - Friday, June 1, 2012 - link
Hi Johan,

thanks for the test, it is great.

However, on page 9 you have some trouble with percentage calculations. You wrote:

quote:
-------------------
We get a 65% speed up (2x 0.71 vs 0.86), which is somewhat lower than the 80% predicted by the AMD slides discussing CMT.
-------------------
This numbers are totally correct and within AMD's predictions. AMD promised 80% performance for the CMT-Bulldozer module, compared to an hypothetical Bulldozer CMP core, i.e. 2 (single) cores.

So you have to double your single-thread results, to get the score of 2 (single) Bulldozer cores (2 CMP cores). That gives: 0.86 x 2 = 1.72

Now compare that to the real performance of 2 CMT cores of one module, which is 0.71 x 2 = 1.42

1.42 are 82.6% of 1.72, which is better than AMD's 80% claim. Thus their claim holds. Everything's fine, don't worry.

Source of AMD's claim is e.g. here:
http://techreport.com/r.x/bulldozer-uarch/bulldoze...
(sorry, didn't find it on anandtech)

Please update your article accordingly.

Oh and one last question, why did you add up the SMT scores but not the CMT scores? Seems odd, an IPC of "two threads", This is just weired. Furthermore it is somehow useless, because you cannot compare it directly with the CMT scores. A diagram should visualize the results not force the reader to do some re-calculations.

Thanks again

Erik

The Bulldozer Aftermath: Delving Even Deeper

Virtualization: 20-50%

+1%

+40%

OLAP Databases: 10-15%

-9%

+34%

HPC: 5-7%

+21%

+26%

Rendering software: 2-3%

+2%

+37%

ERP

+18%

+13%

Post Your Comment

84 Comments

View All Comments

shodanshok - Thursday, May 31, 2012 - link

name99 - Thursday, May 31, 2012 - link

misiu_mp - Monday, June 11, 2012 - link

IceDread - Friday, June 1, 2012 - link

mikato - Friday, June 1, 2012 - link

IceDread - Monday, June 4, 2012 - link

_vor_ - Saturday, June 2, 2012 - link

wiyosaya - Friday, June 1, 2012 - link

jamyryals - Friday, June 1, 2012 - link

SocketF - Friday, June 1, 2012 - link

Log in

Don't have an account? Sign up now