The Bulldozer Aftermath: Delving Even Deeper

Name: The Bulldozer Aftermath: Delving Even Deeper
Item: The Bulldozer Aftermath: Delving Even Deeper
Author: Johan De Gelas

by Johan De Gelas on May 30, 2012 1:15 AM EST

84 Comments | Add A Comment

84 Comments

Zooming in on SPEC CPU2006: the Bad

The optimized SPEC CPU2006 int binaries allow gains in the range of 30% to 117%. Unfortunately the complete benchmark suite only shows a gain of 21% when we compare the Opteron 6276 with the 6176. Closer inspection shows that four benchmarks regress. The regression appears to be small in most benchmarks (7 to 14%), but remember that we have 33% more cores. Even a small regression of 7% means that we are losing up to 30% of the previous architecture's single-threaded performance!

SPEC Int CPU2006: the Bulldozer unfriendly

Perlbench has high locality in the L1 and L2 caches and rarely accesses the Last Level Cache, let alone the memory. The result is a benchmark that delivers high IPC: 1.67 on a five year old Core 2 Duo ("Merom"), and close to +/- 1.9 IPC on the latest Intel CPUs. The interesting thing to note is that h264ref and Perlbench are among the top IPC performers in the SPEC CPU2006 suite.

Sjeng (chess) and Gobmk are both Artificial Intelligence subroutines. Again, the IPC is relatively high (>1), but their most important performance characteristic is that they contain a very high percentage of hard to predict branches: twice the average of the SPEC CPU integer suite.

Granted, the evidence we've presented is still circumstantial. It would take an extremely long and intensive profiling session on all new processors to really determine what is going on, and that is beyond our time budget: one SPEC CPU run alone consumes a whole day. However, we did get our hands dirty. A short profiling session on three different benchmarks gives us some very interesting results that we want to discuss next.

Zooming in on SPEC CPU 2006: the Good IPC Analysis

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

84 Comments

View All Comments

Spunjji - Wednesday, June 6, 2012 - link
Agreed. That will be nice!
haukionkannel - Wednesday, May 30, 2012 - link
Very nice article! Can we get more thorough explanation about µop cache? It seems to be important part of Sandy bridge and you predict that it would help bulldoser...
How complex it is to do and how heavily it has been lisensed?
JohanAnandtech - Thursday, May 31, 2012 - link
Don't think there is a license involved. AMD has their own "macro ops" so they can do a macro ops cache. Unfortunately I can not answer your question of the top of head on how easy it is to do, I would have to some research first.
name99 - Thursday, May 31, 2012 - link
Oh for fsck's sake.
The stupid spam filter won't let me post a URL.

Do a google search for
sandy bridge Real World Technologies
and look at the main article that comes up.
SocketF - Friday, June 1, 2012 - link
It is already planned, AMD has a patent for sth like that, google for "Redirect Recovery Cache". Dresdenboy found it already back in 2009:

http://citavia.blog.de/2009/10/02/return-of-the-tr...

The BIG Question is:
Why did AMD not implement it yet?

My guess is that they were already very busy with the whole CMT approach. Maybe Streamroller will bring it, there are some credible rumors in that direction.
yuri69 - Wednesday, May 30, 2012 - link
Howdy,
FOA thanks for the effort to investigate the shortcomings of this march :)

Quoting M. Butler (BD's chief architect): 'The pipeline within our latest "Bulldozer" microarchitecture is approximately 25 percent deeper than that of the previous generation architectures. ' This gives us 12 stages on K8/K10 => 12 * 1.25 = 15.

Btw all the major and significant architectural improvements & features for the upcoming BD successor line were set in stone long time ago. Remember, it takes 4-5 years for a general purpose CPU from the initial draft to mass availability. The stage when you can move and bend stuff seems to be around half of this period.
BenchPress - Wednesday, May 30, 2012 - link
"This means that Bulldozer should be better at extracting ILP (Instruction Level Parallelism) out of code that has low IPC (Instructions Per Clock)."

This should be reversed. ILP is inherent to the code, and it's the hardware's job to extract it and achieve a high IPC.
Arnulf - Wednesday, May 30, 2012 - link
Ugh, so much crap in a single article ... this should never have been posted on AT.

You weren't promised anything. You came across a website put up by some "fanboy" dumbass and you're actually using it as a reference. Why not quote some actual references (such as transcripts of the conference where T. Seifert clearly stated that gains are expected to be in line with core number increase, i.e. ~33%) instead of rehashing this Fruehe nonsense ?
erikvanvelzen - Wednesday, May 30, 2012 - link
Yes AMD totally set out to make a completely new architecture with a massive increase in transistors per core but 0 gains in IPC.

Don't fool yourself.
Homeles - Wednesday, May 30, 2012 - link
It's a more intelligent analysis than your sorry ass could ever produce. Getting hung up on one quote... really?

The Bulldozer Aftermath: Delving Even Deeper

Post Your Comment

84 Comments

View All Comments

Spunjji - Wednesday, June 6, 2012 - link

haukionkannel - Wednesday, May 30, 2012 - link

JohanAnandtech - Thursday, May 31, 2012 - link

name99 - Thursday, May 31, 2012 - link

SocketF - Friday, June 1, 2012 - link

yuri69 - Wednesday, May 30, 2012 - link

BenchPress - Wednesday, May 30, 2012 - link

Arnulf - Wednesday, May 30, 2012 - link

erikvanvelzen - Wednesday, May 30, 2012 - link

Homeles - Wednesday, May 30, 2012 - link

Log in

Don't have an account? Sign up now