SAP S&D Benchmark

The SAP SD (sales and distribution, 2-tier internet configuration) benchmark is an interesting benchmark as it is a real world client-server application contrary to many server benchmark (such as SpecJBB, SpecIntRate, etc.). We looked at SAP's benchmark database for these results. The results below all run on Windows 2008 and MS SQL Server 2008 database (both 64-bit).

Every 2-tier Sales & Distribution benchmark was performed with SAP's latest ERP 6 enhancement package 4. These results are NOT comparable with any benchmark performed before 2009. We analyzed the SAP Benchmark in-depth in one of our earlier articles. So far, our profile of the benchmark shows:

  • Very parallel resulting in excellent scaling
  • Likes large caches (memory latency)
  • Very sensitive to sync ("cache coherency") latency
  • Low IPC
  • Branch memory intensive code

We managed to get even better profiling of the benchmark. IPC is as low as 0.5 (!) on the most modern Intel CPU architectures. About 48% of the instructions are loads and stores and 18% are branches. One percent of those branches is mispredicted, so the branch misprediction ratio is slightly higher than 5% on modern Intel cores.

Especially the instruction cache is hit hard, and the hit rate is typically a lot lower than in other applications (probably 10% misses and lower). Even the large L3 caches are not capable of satisfying all requests. The SAP SD benchmarks needs between 10-30GB/s, depending on how aggressive the prefetchers are.

SAP Sales & Distribution 2 Tier benchmark

SAP is one of the benchmarks that scale very well and it is shows: the server CPUs with the highest thread count are on top. We remember from older benchmarks that enabling Hyper-Threading (on Nehalem and later) boosts SAP's performance by 35%. As the IPC of a single SAP thread is relatively low (0.5 and lower), the decoding front end of the Bulldozer core should be able to handle this easily. Therefore, the extra integer cluster on the Opteron can really do its magic.

We don't have any Xeon X5650 benchmarks, but a quick calculation tells us that the new Opteron 6276 should be about 20% faster than the X5650. It is also about 18% faster, clock for clock, than the older Opteron 6176. The new Opteron does well here.

Sysbench: MySQL OLTP Making Sense of the New Interlagos Opteron
Comments Locked

46 Comments

View All Comments

  • sonofgodfrey - Thursday, February 9, 2012 - link

    Have you explicitly tested one socket vs. two sockets? We've found an immense increase in contention once a cache-line has to be shared between sockets on some systems.
  • JohanAnandtech - Friday, February 10, 2012 - link

    That is one suggestion I will try out next week. Thanks!
  • Klimax - Thursday, February 9, 2012 - link

    Hello.

    Nice tests.

    However I would like to see MySQL tested on Windows Server 2008 R2
    Would be interesting comparsion.

    (Especially due to http://channel9.msdn.com/shows/Going+Deep/Arun-Kis... )
  • Klimax - Thursday, February 9, 2012 - link

    Title of post is wrong... (I have deleted second thing and forgot to fix title)
  • Scali - Thursday, February 9, 2012 - link

    Unless I'm mistaken, the Xeon 5650 is a 1.17B transistor chip, where the Interlagos 6276 is a 2.4B transistor chip.
    In that light, doesn't that make Intel's SMT implementation a lot better than CMT?
    I mean, yes CMT may give more of a performance boost when you increase the threadcount. But considering the fact that AMD spends more than twice the number of transistors on the chip... well, that's pretty obvious.
    AMD might as well just have used conventional cores.
    The true strength of SMT is not so much that it improves performance in multithreaded scenarios, but that it does so at virtually no extra cost in terms of transistors (and with little or no impact on the single-threaded performance either).
  • JohanAnandtech - Friday, February 10, 2012 - link

    Interlagos is 1.2 billion chip (maybe 1.3 but anyway). Most of those transistors are spend on the L3 cache: about 0.5 billion. Only 213 million transistors are in a module and each module contains a 2 MB L2-cache, probably good for 120 million transistors. That leaves 90 million transistors to the core, and it has been stated that the second cluster added 12%. So that second cluster costs about 12 million transistors, or 48 million on the total 4 module die. That is less than 5% of the total transistor count but you get a 30-90% performance boost!

    So for AMD, this was clearly a great choice.

    SMT is perfect for Intel, as the Intel architecture puts all instructions in one big ROB.

    For very low IPC serverworkloads, I think the CMT approach gives better results. Unfortunately AMD lowered some of the CMT benefits by keeping the datacache so small and the low associativity of the Icache.
  • Scali - Friday, February 10, 2012 - link

    Uhhh, I think you're wrong here... the 4-module Bulldozer is a 1.2B chip (Zambezi). But you tested the 8-module Interlagos (16 threads), which is TWO Zambezi dies in one package.
    Hence 2*1.2 = 2.4B transistors.
  • JohanAnandtech - Friday, February 10, 2012 - link

    Ok, it is two chips of 1.2 billion. That doesn't change anything about our analyses of CMT.
  • Scali - Friday, February 10, 2012 - link

    Not in the article, because you did not factor in transistor count (which is the flaw I tried to point out in the first place... comparing two chips, where once is twice the transistor count of the other, is quite the apples-to-oranges comparison. One would expect a chip with twice the transistorcount to be considerably better in multithreading scenarios, not 'catching up' to the smaller chip).

    But in your above post, I think it changes everything about your analysis. All your figures have to be done times two.
    Which makes it a very poor comparison, not only to Intel, but also to AMD's own previous line of CPUs.
    The 6174 Magny Cours is actually beating Interlagos, with 'only' 12 threads, no kind of CMT/SMT, and 'only' 1.8B transistors.

    How does that make CMT look like a great choice for AMD?
  • slycer.tech - Friday, February 10, 2012 - link

    What i read on benchmark configuration page, Anand used 2x Intel Xeon X5650. So 2x 1.17B = 2.34B. I think it is comparable to AMD CPU used in this test. Am I right?

Log in

Don't have an account? Sign up now