SAP S&D profiled

The SAP S&D 2-Tier benchmark has always been one of my favorites. This is probably the most real world benchmark of all server benchmarks done by the vendors. It is a full blown application living on top of a heavy relational database. And don't forget that SAP is one of the most successful software companies out there, the undisputed market leader of Enterprise Resource Planning.

Profiling this benchmark is beyond the capabilities of our lab but Intel shared some of their profiling data when they compared the Xeon E5 with the Xeon 5600. This gives us very interesting insights in how the SAP application behaves.

  SAP S&D SPEC Int 2006
Typical IPC (on Intel Westmere) 0.5 1.1
Typical IPC (on Intel Sandy Bridge) 0.55 1.29
Branches 18% 19%
Mispredictions 0.9% 1.1%
Loads (percentage of instruction mix) 32% 28%
Stores (percentage of instruction mix) 16% 11%

Besides the high level profiling numbers, quite a few details surfaced. For example, increasing the ROB (ReOrder Buffer) from 128 (Westmere) to 168 (Sandy Bridge) reduced the ROB stalls from 10% to almost nothing. Increasing the load buffers from 48 to 64 reduced the load buffers stalls to one fifth of what they were before! This clearly shows that SAP puts quite a bit of pressure on both the ROB and the load units. The application finds ample integer processing power in most modern processors, but it is limited by how fast data can be loaded and how well the Out of Order engine (of which the ROB is the primary buffer) is able to hide the load latency.

Further data confirms this. It is was my understanding that the hardware prefetchers of Sandy Bridge were improved a bit compared to Westmere/Nehalem, but in fact the smarter prefetchers are able to reduce the L2 cache misses by no less than 40%! Now, consider that in most SPEC CPU int 2006 benchmarks only 1 to 10 instructions out of 1000 typically miss the L2 cache. In contrast, in SAP, about 40 out of 1000 instructions miss the small 256KB L2 cache of the Westmere Xeon 5600, which is in the same range as the most memory intensive application in the SPEC CPU2006 int CPU suite (mcf).

SAP is thus an application that misses the L2 cache much more than most applications out there, with the exception of some exotic HPC apps. The better prefetchers inside Sandy Bridge make much better use of the extra bandwidth available and reduce the L2 and L1 misses. Hence, these improved prefetchers are probably one of the main reasons why Sandy Bridge performs better.

Interestingly, the L1 instruction cache misses were halved, and most of the L2 cache miss reduction came from instruction prefetching (less than half the cache misses). Data requests could not be prefetched.

So the end conclusion about SAP is:

  1. The application has very low instruction level parallelism (ILP) and as a result is not taxing the integer units much.
  2. The application has a relatively large but "prefetcheable" instruction footprint, which allows the prefetchers to reduce the instruction related cache misses
  3. The application has a massive and random data footprint, putting great pressure on the load subsystem. As a result the out of order engine has to hide the latency the best it can, and large ROB and load buffers help a lot. The latency of the memory subsystem matters.

Combine this with the fact that the SAP application has a high amount of TLP (Thread Level Parallism) and you'll understand that this is an application ideally suited for Hyper-Threading and Clustered Multi-Threading. Hyper-Threading for example is good for a 30% performance boost. The SAP S&D benchmark is a prime example on how a CPU architecture can be more server or more consumer oriented. The charactheristics of server applications are vastly different from the software that we run on our laptops and desktops.

SAP will hardly be limited by the lower integer execution resources of the individual Bulldozer integer cores. Bulldozer has vastly improved prefetching capabilities and larger OOO buffers. Add to this the 33% higher core count, and we should expect Bulldozer to outperform Magny-Cours chips by at least 33%, as the SAP benchmark emphasizes the strong points of the individual Bulldozer core without stressing the weak points (lower integer throughput). However, we are nowhere near 33% better performance, let alone the 50% higher throughput once promised by AMD. Why?

We have uncovered some additional understanding with the above information, but our job is not done yet.

Reevaluating the Situation SPEC CPU 2006 Integer
Comments Locked

84 Comments

View All Comments

  • Spunjji - Wednesday, June 6, 2012 - link

    Agreed. That will be nice!
  • haukionkannel - Wednesday, May 30, 2012 - link

    Very nice article! Can we get more thorough explanation about µop cache? It seems to be important part of Sandy bridge and you predict that it would help bulldoser...
    How complex it is to do and how heavily it has been lisensed?
  • JohanAnandtech - Thursday, May 31, 2012 - link

    Don't think there is a license involved. AMD has their own "macro ops" so they can do a macro ops cache. Unfortunately I can not answer your question of the top of head on how easy it is to do, I would have to some research first.
  • name99 - Thursday, May 31, 2012 - link

    Oh for fsck's sake.
    The stupid spam filter won't let me post a URL.

    Do a google search for
    sandy bridge Real World Technologies
    and look at the main article that comes up.
  • SocketF - Friday, June 1, 2012 - link

    It is already planned, AMD has a patent for sth like that, google for "Redirect Recovery Cache". Dresdenboy found it already back in 2009:

    http://citavia.blog.de/2009/10/02/return-of-the-tr...

    The BIG Question is:
    Why did AMD not implement it yet?

    My guess is that they were already very busy with the whole CMT approach. Maybe Streamroller will bring it, there are some credible rumors in that direction.
  • yuri69 - Wednesday, May 30, 2012 - link

    Howdy,
    FOA thanks for the effort to investigate the shortcomings of this march :)

    Quoting M. Butler (BD's chief architect): 'The pipeline within our latest "Bulldozer" microarchitecture is approximately 25 percent deeper than that of the previous generation architectures. ' This gives us 12 stages on K8/K10 => 12 * 1.25 = 15.

    Btw all the major and significant architectural improvements & features for the upcoming BD successor line were set in stone long time ago. Remember, it takes 4-5 years for a general purpose CPU from the initial draft to mass availability. The stage when you can move and bend stuff seems to be around half of this period.
  • BenchPress - Wednesday, May 30, 2012 - link

    "This means that Bulldozer should be better at extracting ILP (Instruction Level Parallelism) out of code that has low IPC (Instructions Per Clock)."

    This should be reversed. ILP is inherent to the code, and it's the hardware's job to extract it and achieve a high IPC.
  • Arnulf - Wednesday, May 30, 2012 - link

    Ugh, so much crap in a single article ... this should never have been posted on AT.

    You weren't promised anything. You came across a website put up by some "fanboy" dumbass and you're actually using it as a reference. Why not quote some actual references (such as transcripts of the conference where T. Seifert clearly stated that gains are expected to be in line with core number increase, i.e. ~33%) instead of rehashing this Fruehe nonsense ?
  • erikvanvelzen - Wednesday, May 30, 2012 - link

    Yes AMD totally set out to make a completely new architecture with a massive increase in transistors per core but 0 gains in IPC.

    Don't fool yourself.
  • Homeles - Wednesday, May 30, 2012 - link

    It's a more intelligent analysis than your sorry ass could ever produce. Getting hung up on one quote... really?

Log in

Don't have an account? Sign up now