IPC Analysis: What Is Going On?

We decided to focus our attention on our MS SQL Server benchmark and profile it on the most important hardware events (IPC, cache, and branch prediction). We used Intel's VTune Amplifier XE 2011 and AMD's Code Analyst 3.4.1037.88 to get a better understand of this benchmark. To put things into perspective, we compared the results with the extremely popular Cinebench benchmark and the 7-Zip compression benchmark.

Note that VTune has a rather steep learning curve and the numbers presented are more detailed but also harder to interprete than those of Code Analyst. In some cases we had doubts about our measurements on the brand spanking new Xeon E5. That is why we are refraining from publishing those numbers until we are absolutely sure they are accurate, so some of the Xeon E5 numbers are missing.

IPC Workloads

* We add the IPC of the two threads up

It is interesting to note the high Instruction Level Parallelism (ILP) that the Xeon E5 "Sandy Bridge" is able to extract out of these server benchmarks. Almost 1.4 instructions per clock cycle are retired and if you add SMT (Simultaneous Multi Threading), another 0.4 IPC flows through a single core. That is pretty remarkable for a benchmark that consists mostly of SQL statements that result in many branches and loads.

Our Opteron 6200 reveals a bit more about its internal working. Using the extra integer cluster inside the Opteron module causes the separate threads to slow down somewhat. In the case of Cinebench, this is not a real surprise since it contains a lot of SSE floating point commands; a single thread can have the out of order FP cluster all to itself while two threads have to share the SMT capable floating point engine.

But in case of our datamining benchmarks, something else is going on. Single-threaded performance regresses by 18% once you enable the second cluster. We get a 65% speed up (2x 0.71 vs 0.86), which is somewhat lower than the 80% predicted by the AMD slides discussing CMT. So some of the shared resources are slowing down the total performance of our module. We will find out more on the next page.

Zooming in on SPEC CPU 2006: the Bad Caching Analysis
Comments Locked

84 Comments

View All Comments

  • Spunjji - Wednesday, June 6, 2012 - link

    Agreed. That will be nice!
  • haukionkannel - Wednesday, May 30, 2012 - link

    Very nice article! Can we get more thorough explanation about µop cache? It seems to be important part of Sandy bridge and you predict that it would help bulldoser...
    How complex it is to do and how heavily it has been lisensed?
  • JohanAnandtech - Thursday, May 31, 2012 - link

    Don't think there is a license involved. AMD has their own "macro ops" so they can do a macro ops cache. Unfortunately I can not answer your question of the top of head on how easy it is to do, I would have to some research first.
  • name99 - Thursday, May 31, 2012 - link

    Oh for fsck's sake.
    The stupid spam filter won't let me post a URL.

    Do a google search for
    sandy bridge Real World Technologies
    and look at the main article that comes up.
  • SocketF - Friday, June 1, 2012 - link

    It is already planned, AMD has a patent for sth like that, google for "Redirect Recovery Cache". Dresdenboy found it already back in 2009:

    http://citavia.blog.de/2009/10/02/return-of-the-tr...

    The BIG Question is:
    Why did AMD not implement it yet?

    My guess is that they were already very busy with the whole CMT approach. Maybe Streamroller will bring it, there are some credible rumors in that direction.
  • yuri69 - Wednesday, May 30, 2012 - link

    Howdy,
    FOA thanks for the effort to investigate the shortcomings of this march :)

    Quoting M. Butler (BD's chief architect): 'The pipeline within our latest "Bulldozer" microarchitecture is approximately 25 percent deeper than that of the previous generation architectures. ' This gives us 12 stages on K8/K10 => 12 * 1.25 = 15.

    Btw all the major and significant architectural improvements & features for the upcoming BD successor line were set in stone long time ago. Remember, it takes 4-5 years for a general purpose CPU from the initial draft to mass availability. The stage when you can move and bend stuff seems to be around half of this period.
  • BenchPress - Wednesday, May 30, 2012 - link

    "This means that Bulldozer should be better at extracting ILP (Instruction Level Parallelism) out of code that has low IPC (Instructions Per Clock)."

    This should be reversed. ILP is inherent to the code, and it's the hardware's job to extract it and achieve a high IPC.
  • Arnulf - Wednesday, May 30, 2012 - link

    Ugh, so much crap in a single article ... this should never have been posted on AT.

    You weren't promised anything. You came across a website put up by some "fanboy" dumbass and you're actually using it as a reference. Why not quote some actual references (such as transcripts of the conference where T. Seifert clearly stated that gains are expected to be in line with core number increase, i.e. ~33%) instead of rehashing this Fruehe nonsense ?
  • erikvanvelzen - Wednesday, May 30, 2012 - link

    Yes AMD totally set out to make a completely new architecture with a massive increase in transistors per core but 0 gains in IPC.

    Don't fool yourself.
  • Homeles - Wednesday, May 30, 2012 - link

    It's a more intelligent analysis than your sorry ass could ever produce. Getting hung up on one quote... really?

Log in

Don't have an account? Sign up now