The Bulldozer Aftermath: Delving Even Deeper

Name: The Bulldozer Aftermath: Delving Even Deeper
Item: The Bulldozer Aftermath: Delving Even Deeper
Author: Johan De Gelas

by Johan De Gelas on May 30, 2012 1:15 AM EST

84 Comments | Add A Comment

84 Comments

SAP S&D profiled

The SAP S&D 2-Tier benchmark has always been one of my favorites. This is probably the most real world benchmark of all server benchmarks done by the vendors. It is a full blown application living on top of a heavy relational database. And don't forget that SAP is one of the most successful software companies out there, the undisputed market leader of Enterprise Resource Planning.

Profiling this benchmark is beyond the capabilities of our lab but Intel shared some of their profiling data when they compared the Xeon E5 with the Xeon 5600. This gives us very interesting insights in how the SAP application behaves.

	SAP S&D	SPEC Int 2006
Typical IPC (on Intel Westmere)	0.5	1.1
Typical IPC (on Intel Sandy Bridge)	0.55	1.29
Branches	18%	19%
Mispredictions	0.9%	1.1%
Loads (percentage of instruction mix)	32%	28%
Stores (percentage of instruction mix)	16%	11%

Besides the high level profiling numbers, quite a few details surfaced. For example, increasing the ROB (ReOrder Buffer) from 128 (Westmere) to 168 (Sandy Bridge) reduced the ROB stalls from 10% to almost nothing. Increasing the load buffers from 48 to 64 reduced the load buffers stalls to one fifth of what they were before! This clearly shows that SAP puts quite a bit of pressure on both the ROB and the load units. The application finds ample integer processing power in most modern processors, but it is limited by how fast data can be loaded and how well the Out of Order engine (of which the ROB is the primary buffer) is able to hide the load latency.

Further data confirms this. It is was my understanding that the hardware prefetchers of Sandy Bridge were improved a bit compared to Westmere/Nehalem, but in fact the smarter prefetchers are able to reduce the L2 cache misses by no less than 40%! Now, consider that in most SPEC CPU int 2006 benchmarks only 1 to 10 instructions out of 1000 typically miss the L2 cache. In contrast, in SAP, about 40 out of 1000 instructions miss the small 256KB L2 cache of the Westmere Xeon 5600, which is in the same range as the most memory intensive application in the SPEC CPU2006 int CPU suite (mcf).

SAP is thus an application that misses the L2 cache much more than most applications out there, with the exception of some exotic HPC apps. The better prefetchers inside Sandy Bridge make much better use of the extra bandwidth available and reduce the L2 and L1 misses. Hence, these improved prefetchers are probably one of the main reasons why Sandy Bridge performs better.

Interestingly, the L1 instruction cache misses were halved, and most of the L2 cache miss reduction came from instruction prefetching (less than half the cache misses). Data requests could not be prefetched.

So the end conclusion about SAP is:

The application has very low instruction level parallelism (ILP) and as a result is not taxing the integer units much.
The application has a relatively large but "prefetcheable" instruction footprint, which allows the prefetchers to reduce the instruction related cache misses
The application has a massive and random data footprint, putting great pressure on the load subsystem. As a result the out of order engine has to hide the latency the best it can, and large ROB and load buffers help a lot. The latency of the memory subsystem matters.

Combine this with the fact that the SAP application has a high amount of TLP (Thread Level Parallism) and you'll understand that this is an application ideally suited for Hyper-Threading and Clustered Multi-Threading. Hyper-Threading for example is good for a 30% performance boost. The SAP S&D benchmark is a prime example on how a CPU architecture can be more server or more consumer oriented. The charactheristics of server applications are vastly different from the software that we run on our laptops and desktops.

SAP will hardly be limited by the lower integer execution resources of the individual Bulldozer integer cores. Bulldozer has vastly improved prefetching capabilities and larger OOO buffers. Add to this the 33% higher core count, and we should expect Bulldozer to outperform Magny-Cours chips by at least 33%, as the SAP benchmark emphasizes the strong points of the individual Bulldozer core without stressing the weak points (lower integer throughput). However, we are nowhere near 33% better performance, let alone the 50% higher throughput once promised by AMD. Why?

We have uncovered some additional understanding with the above information, but our job is not done yet.

Reevaluating the Situation SPEC CPU 2006 Integer

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

84 Comments

View All Comments

shodanshok - Thursday, May 31, 2012 - link
Mmm... the link was malformed in the previous message.

The correct one is: http://www.ilsistemista.net/index.php/hardware-ana...

Thanks.
name99 - Thursday, May 31, 2012 - link
"First of all, in most applications, an OOO processor can easily hide the 4-cycle latency of an L1 cache."

I know you guys are interested in the question --- why does Bulldozer frequently suck? --- rather than the question ---why is Sandy Bridge so much better? --- but it is this latter question that interests me the most.

What strikes me, on going through all this data (including information that is NOT in the article, and on my experience back in the day when I was writing assembly and counting cycles) is that the "eventual cost" of misses that go all the way to RAM is not covered in the article, and I suspect this is a large part of the issue.

What I mean here is the following: consider an extremely simple model --- an L1 hit takes 1 cycle, an L1 miss that goes to RAM takes 100 cycles. Then a 97% L1 hit rate takes a total of basically 400 cycles; a 99% L1 hit rate takes 200 cycles --- apparently minor differences have a huge effect! But that's not the point I want to focus on.
Let's make the model more complicated. First let's make L1 hit cost more realistic --- 4 cycles. As the article says, this is, for the most part, trivially hidden by the OoO engine. But then why can't the OoO engine also hide all or most of the cost of all those cycles to RAM?
And that, I think, is where the Intel advantage is. They do such a good job with their OoO engine.

At a gross level, OoO engines all look kinda the same --- look at a PPC 750 and an IB and, at a superficial level, they look similar. But firstly the IB has just so much larger buffers (what, 168 or so, compared to the 750s what, 6 or so) that, of course, it has a vastly larger stock of instructions it can keep chewing through as it waits for the RAM.

But, you say, AMD also has large buffers now. Yes, but it's not only the raw buffers. Whenever you start looking at these chips, you discover all sorts of weird limitations on what they can actually do to use all those buffers. I've no idea what the current exact limitations are, but the sort of thing you would have in the past is that maybe all the buffers are flushed on an interrupt or system call; or there'd be strange conditions that could occur where, although in theory the integer engine could keep going past a blocking FP instruction, it turned out to be easier to prevent some race condition by freezing the integer engine under these conditions.

Secondly while you're executing other instructions, waiting on your RAM, you may well execute a few more load/store instructions that again miss in RAM. How well do you handle these? Can you just keep firing out these load/stores, or do you block at the second (or third, or fourth)? Frequently these load-stores refer to the same cache-line that's already in play from the first L1 miss, and how do you handle that? the truly dumb thing, of course, is to send out ANOTHER memory request. Smarter is to suppress that, but you're still using load-store entries in the main "miss to RAM" data structures. Smarter still is to be aware that this line will be coming eventually, and use auxiliary data structures to hold info about this load/store.

It's these sorts of technical details, which don't appear in the gross specs (and sometimes not even in the detailed CPU descriptions) that make so much difference. They are obviously astonishingly difficult to get right. Intel has the manpower to worry about every one of them, AMD does not.

Point is --- if I had to look for a single difference between the the two, that's what I'd be looking at --- how much time is REALLY wasted waiting on DRAM in SB vs on Bulldozer.
misiu_mp - Monday, June 11, 2012 - link
It is the compiler's and Out-Of-Order engine's job to order loads, stores and other instructions to minimize the total execution time.
So making sure no stupid and unnecessary loads are being committed is what the OOO mechanism normally does.
There is no reason to suspect it is fundamentally broken in Bulldozer.
IceDread - Friday, June 1, 2012 - link
It really is simple, Amd did a Huge mistake.

The product is a bust, simple as that.

The next generation or the generations after that might be a whole different matter, but guess what? No one cares. It wont help the poor souls that bought this busted product.

It's annoying that Amd could not do better because now Intel reigns supreme and competes with itself .
mikato - Friday, June 1, 2012 - link
I know I shouldn't feed the trolls but...
You say next generations might be a whole different matter - well what do you think is the point of learning about the Bulldozer architecture? The next generations are based on it.
IceDread - Monday, June 4, 2012 - link
What is the point of releasing a product that does not outperform it's predecessor?
Hope that people will purchase the product anyway and learn it?
Which companies would be interested in this, how many? Why would they invest money into this?
_vor_ - Saturday, June 2, 2012 - link
Yes. I too would be interested in exactly what aspects you think Bulldozer failed and your design ideas and approach on how you would fix them. Do tell.
wiyosaya - Friday, June 1, 2012 - link
Personally, I think it is always nice to see in-depth articles like this that explain the details of the structure of a processor.

To me, it sounds like AMD has a foundation that with a few well-directed tweaks, may put them in contention with Intel again in the CPU arena. Though AMD has said that they are through competing with Intel, I truly hope this is not the case. Perhaps this is a marketing tactic remove focus from themselves after the enthusiast arena panned BD and its siblings.

I've built my systems with AMD for a long time; however, this time I went with Intel because I thought they had the better value. Perhaps the future will bring me back to AMD, however, I cannot see doing so right now simply because Intel has become the "value" line over AMD.

With an i7-3820 in my most recent rig, I think I picked the SB-E value processor. I run more than games, and some of what I run takes advantage of quad-channel memory.

In any event, I'm set for a while. Perhaps AMD will once again produce a superior product by the time I am ready for my next build.
jamyryals - Friday, June 1, 2012 - link
What a great read, thanks!
SocketF - Friday, June 1, 2012 - link
Hi Johan,

thanks for the test, it is great.

However, on page 9 you have some trouble with percentage calculations. You wrote:

quote:
-------------------
We get a 65% speed up (2x 0.71 vs 0.86), which is somewhat lower than the 80% predicted by the AMD slides discussing CMT.
-------------------
This numbers are totally correct and within AMD's predictions. AMD promised 80% performance for the CMT-Bulldozer module, compared to an hypothetical Bulldozer CMP core, i.e. 2 (single) cores.

So you have to double your single-thread results, to get the score of 2 (single) Bulldozer cores (2 CMP cores). That gives: 0.86 x 2 = 1.72

Now compare that to the real performance of 2 CMT cores of one module, which is 0.71 x 2 = 1.42

1.42 are 82.6% of 1.72, which is better than AMD's 80% claim. Thus their claim holds. Everything's fine, don't worry.

Source of AMD's claim is e.g. here:
http://techreport.com/r.x/bulldozer-uarch/bulldoze...
(sorry, didn't find it on anandtech)

Please update your article accordingly.

Oh and one last question, why did you add up the SMT scores but not the CMT scores? Seems odd, an IPC of "two threads", This is just weired. Furthermore it is somehow useless, because you cannot compare it directly with the CMT scores. A diagram should visualize the results not force the reader to do some re-calculations.

Thanks again

Erik

The Bulldozer Aftermath: Delving Even Deeper

Post Your Comment

84 Comments

View All Comments

shodanshok - Thursday, May 31, 2012 - link

name99 - Thursday, May 31, 2012 - link

misiu_mp - Monday, June 11, 2012 - link

IceDread - Friday, June 1, 2012 - link

mikato - Friday, June 1, 2012 - link

IceDread - Monday, June 4, 2012 - link

_vor_ - Saturday, June 2, 2012 - link

wiyosaya - Friday, June 1, 2012 - link

jamyryals - Friday, June 1, 2012 - link

SocketF - Friday, June 1, 2012 - link

Log in

Don't have an account? Sign up now