SAP S&D profiled

The SAP S&D 2-Tier benchmark has always been one of my favorites. This is probably the most real world benchmark of all server benchmarks done by the vendors. It is a full blown application living on top of a heavy relational database. And don't forget that SAP is one of the most successful software companies out there, the undisputed market leader of Enterprise Resource Planning.

Profiling this benchmark is beyond the capabilities of our lab but Intel shared some of their profiling data when they compared the Xeon E5 with the Xeon 5600. This gives us very interesting insights in how the SAP application behaves.

  SAP S&D SPEC Int 2006
Typical IPC (on Intel Westmere) 0.5 1.1
Typical IPC (on Intel Sandy Bridge) 0.55 1.29
Branches 18% 19%
Mispredictions 0.9% 1.1%
Loads (percentage of instruction mix) 32% 28%
Stores (percentage of instruction mix) 16% 11%

Besides the high level profiling numbers, quite a few details surfaced. For example, increasing the ROB (ReOrder Buffer) from 128 (Westmere) to 168 (Sandy Bridge) reduced the ROB stalls from 10% to almost nothing. Increasing the load buffers from 48 to 64 reduced the load buffers stalls to one fifth of what they were before! This clearly shows that SAP puts quite a bit of pressure on both the ROB and the load units. The application finds ample integer processing power in most modern processors, but it is limited by how fast data can be loaded and how well the Out of Order engine (of which the ROB is the primary buffer) is able to hide the load latency.

Further data confirms this. It is was my understanding that the hardware prefetchers of Sandy Bridge were improved a bit compared to Westmere/Nehalem, but in fact the smarter prefetchers are able to reduce the L2 cache misses by no less than 40%! Now, consider that in most SPEC CPU int 2006 benchmarks only 1 to 10 instructions out of 1000 typically miss the L2 cache. In contrast, in SAP, about 40 out of 1000 instructions miss the small 256KB L2 cache of the Westmere Xeon 5600, which is in the same range as the most memory intensive application in the SPEC CPU2006 int CPU suite (mcf).

SAP is thus an application that misses the L2 cache much more than most applications out there, with the exception of some exotic HPC apps. The better prefetchers inside Sandy Bridge make much better use of the extra bandwidth available and reduce the L2 and L1 misses. Hence, these improved prefetchers are probably one of the main reasons why Sandy Bridge performs better.

Interestingly, the L1 instruction cache misses were halved, and most of the L2 cache miss reduction came from instruction prefetching (less than half the cache misses). Data requests could not be prefetched.

So the end conclusion about SAP is:

  1. The application has very low instruction level parallelism (ILP) and as a result is not taxing the integer units much.
  2. The application has a relatively large but "prefetcheable" instruction footprint, which allows the prefetchers to reduce the instruction related cache misses
  3. The application has a massive and random data footprint, putting great pressure on the load subsystem. As a result the out of order engine has to hide the latency the best it can, and large ROB and load buffers help a lot. The latency of the memory subsystem matters.

Combine this with the fact that the SAP application has a high amount of TLP (Thread Level Parallism) and you'll understand that this is an application ideally suited for Hyper-Threading and Clustered Multi-Threading. Hyper-Threading for example is good for a 30% performance boost. The SAP S&D benchmark is a prime example on how a CPU architecture can be more server or more consumer oriented. The charactheristics of server applications are vastly different from the software that we run on our laptops and desktops.

SAP will hardly be limited by the lower integer execution resources of the individual Bulldozer integer cores. Bulldozer has vastly improved prefetching capabilities and larger OOO buffers. Add to this the 33% higher core count, and we should expect Bulldozer to outperform Magny-Cours chips by at least 33%, as the SAP benchmark emphasizes the strong points of the individual Bulldozer core without stressing the weak points (lower integer throughput). However, we are nowhere near 33% better performance, let alone the 50% higher throughput once promised by AMD. Why?

We have uncovered some additional understanding with the above information, but our job is not done yet.

Reevaluating the Situation SPEC CPU 2006 Integer
Comments Locked

84 Comments

View All Comments

  • Schmide - Wednesday, May 30, 2012 - link

    I do remember from some analysis that the L2 cache reads were as slow as main memory. That's great if you hit a L2 cache, but it's not going to buy you anything if it's that slow.
  • SocketF - Wednesday, May 30, 2012 - link

    Impossible, you probably mix some things up, maybe latency and bandwidth?
  • Schmide - Wednesday, May 30, 2012 - link

    Yup. It was late at night, I was thinking writes. the L1 write through basically makes L1 writes the same as L2 writes.
  • Homeles - Wednesday, May 30, 2012 - link

    Not even close. L2 is about 10 times faster than main memory.

    http://www.anandtech.com/show/4955/the-bulldozer-r...
  • jcollake - Wednesday, May 30, 2012 - link

    Through research here at Bitsum on the AMD Bulldozer platform (specifically the 9150), I found a couple things of interest.

    First, disabling CPU core parking seems to make a big difference in performance. I believe that by default the CPU core parking is just too aggressive. I wrote a tool to let you enable or disable CPU parking in *real time* without a reboot, so you can test this yourself. It is called ParkControl, http://bitsum.com/about_cpu_core_parking.php . For *me*, it seemed to make a night and day difference.

    Second, I am working on a neat little benchmarking tool called ThreadRacer, currently only in alpha prototype. It allows you to really see the effects of these paired cores, and how much it matters that the scheduler is properly aware of them. Take this 1 second or so sample, as seen in the screenshot here (downloads available, but it is an early prototype that I'll quickly be finishing up): http://bitsum.com/forum/index.php/topic,1434.0.htm...

    The scheduler update that Microsoft issued of course treats these paired cores as it would a hyper-threaded core. Indeed, the concept is very similar, except perhaps to avoid patents, AMD took the 'share a little' instead of 'share a lot' approach when it comes to shared computational resources. This was the proper way to *quickly* address the issue, but I believe the scheduler is still suboptimal on these processors (likely to be resolved in Windows 8 or a later update to Windows 7/Vista).

    For Bulldozer, as you know, they are two real processors, but because they have shared dependencies, the performance can really be drained if the other processor in the 'pair' is busy. You can see the effects from ThreadRacer, the core without its pair busy quickly out-paced the paired cores that were both busy.
  • jcollake - Wednesday, May 30, 2012 - link

    I should have also mentioned that ThreadRacer also allows you to see how a single CPU consuming thread gets swapped around to different cores (the multi-core thread in the utility). This is its other use. The less the thread gets swapped from core to core, the greater the performance will be. It is interesting to compare and contrast the behavior of the scheduler. I fully believe that most the problems with Bulldozer are due to the Windows scheduler, something that could be tested by using linux and replacing the scheduler with a custom one, or an off the shelf alternative that may behave substantially differently than the Windows scheduler.
  • SocketF - Wednesday, May 30, 2012 - link

    Some people running BOINC programs have reported that Windows-applications run faster when they use a Linux and WINE or a VM.

    The Win-scheduler especially hurts AMD chips, because of the huge exclusive caches. If a thread on an intel CPU is switched to another core, it can load the warmed up L2 portion from the L2 inclusive L3.

    I did some google-search and it seems that under Linux, each core has its own run-queue, whereas on Windows, there is only one run queue for all cores.

    But i didn't delve into it deeply, there are so many different schedulers for Linux, seems to be a complex issue ;-)

    Btw. your link to download is off limits for non-members of your discussion board:
    -------------------------
    Warning!

    The topic or board you are looking for appears to be either missing or off limits to you.
    Please login below or register an account with Bitsum Forums.
    ----------------------------

    Maybe you can upload it somewhere else?
  • jcollake - Saturday, September 1, 2012 - link

    Sorry for the late reply. First, the forum permissions were fixed. Second, the utility (still in early stages) is included in Process Lasso *and* available here: http://bitsum.com/threadracer.php
  • eoerl - Wednesday, May 30, 2012 - link

    Very interesting article, together with the hardware.fr report there's a lot of information. One question though, if you read commentaries : you didn't speak much about the influence of compilers. This proved to change a lot of things on Linux (see phoronix extensive tests on both ivy bridge and bulldozer depending on compiler used and compiler options, for example
    http://www.phoronix.com/scan.php?page=article&...
    http://www.phoronix.com/scan.php?page=article&...
    Benchmark results really change a lot with bulldozer, much more than with ivy or sandy bridge. Do you think AMD lost being oversensitive to compiler optimisations, due to a very original architecture ?
  • JohanAnandtech - Thursday, May 31, 2012 - link

    I deliberately avoided the compiler issues as this would make the article too convoluted. But notice that what we found is not influenced by compiler choice: we find the same indications in SAP and SQL server (compiled by "conservative" compilers and compiler settings) as in CPU CPU 2006, which uses the best optimized settings and compiler as possible.

Log in

Don't have an account? Sign up now