ERP benchmark 1: SAP SD

The SAP SD (sales and distribution, 2-tier internet configuration) benchmark is an interesting benchmark as it is a real world client-server application. We decided to take a look at SAP's benchmark database. The results below are 2-tier benchmarks, so the database and the underlying OS can make a difference. It is best to keep those parameters the same, although the type of database (Oracle, MS SQL server, MaxDb or DB2) only makes a small difference. The results below all run on Windows 2003 Enterprise Edition and MS SQL Server 2005 database (both 64-bit). Every "2-tier Sales & Distribution" benchmark was performed on SAP's "ERP release 2005".

In our previous server oriented article, we summed up a rough profile of SAP S&D:

  • Very parallel resulting in excellent scaling
  • Low to medium IPC, mostly due to "branchy" code
  • Somewhat limited by memory bandwidth
  • Likes large caches (memory latency!)
  • Very sensitive to sync ("cache coherency") latency
SAP Sales & Distribution 2 Tier benchmark
(*) Estimate based on Intel's internal testing

If you focus on the cores only, the differences between the Xeon 55xx "Nehalem" and the previous generation Xeon 54xx "Harpertown" and Xeon 53xx "Clovertown" is relatively small. The enormous differences in SAP scores are solely a result of Hyper-Threading, the "uncore", and the NUMA platform. According to SAP benchmark specialist Tuan Bui (Intel), enabling Hyper-Threading accounts for a 31% performance boost. Using somewhat higher clocked DDR3 (1066 instead of 800 or 1333 instead of 1066) is good for another 2-3%. Enabling the prefetcher provides another 3% and the Turbo mode increased performance by almost 5%. As this SAP benchmark scales almost perfectly with clock speed, that means that the Xeon X5570 2.93GHz was in fact running at 3.07GHz on average.

Consider the following facts:

  • The quad-core AMD Opteron 8384 at 2.7GHz has no problem beating the higher clocked 5470 at 3.3GHz.
  • It is well known that the Xeon 54xx raw integer power is a lot higher than any of the Opterons (just take a look at SPECint2006).
  • Faster memory and thus bandwidth plays only a minor role in the SAP benchmark.
  • SAP threads share a lot of data (as is typical for these kind of database driven applications).

It is clear that synchronization (between L2 caches) that happens in the L3 cache, the fast inter-CPU synchronization that happens via dedicated interconnects, is what made the "native quad-cores" of AMD winners in this benchmark. Slow cache synchronization is probably the main reason why the integer crunching power hidden deep inside the "Harpertown" cores did not result in better performance.

Take the same (slightly improved) core and give it the right (L3 as quick syncing point for the L2s) cache architecture and NUMA platform with fast CPU interconnects and all that integer power is unleashed. The result is the Nehalem X5570 Xeon is clock for clock about 66% faster than its predecessor (19000 vs. 11420). Add SMT (Simultaneous Multi-Threading) and you allow the integer core to process a second thread when it is bogged down by one of those pesky branches. The last hurdle for supreme SAP performance is taken: The eight core "Nehalem" server is just as fast as a 24 core "Dunnington" and 80% faster than the competition.

AMD has just launched the Opteron 2389 at 2.9GHz. We estimate that this will bring AMD's best SAP score to about 14800, so Nehalem's advantage will be lowered to ~70%. Unfortunately for AMD, that is still a very large advantage!

Benchmark Configuration OLTP - Dell DVD Store
Comments Locked

44 Comments

View All Comments

  • snakeoil - Monday, March 30, 2009 - link

    oops it seems that hypertreading is not scaling very well too bad for intel
  • eva2000 - Tuesday, March 31, 2009 - link

    Bloody awesome results for the new 55xx series. Can't wait to see some of the larger vBulletin forums online benefiting from these monsters :)
  • ssj4Gogeta - Monday, March 30, 2009 - link

    huh?
  • ltcommanderdata - Monday, March 30, 2009 - link

    I was wondering if you got any feeling whether Hyperthreading scaled better on Nehalem than Netburst? And if so, do you think this is due to improvements made to HT itself in Nehalem, just do to Nehalem 4+1 instruction decoders and more execution units or because software is better optimized for multithreading/hyperthreading now? Maybe I'm thinking mostly desktop, but HT had kind of a hit or miss reputation in Netburst, and it'd be interesting to see if it just came before it's time.
  • TA152H - Monday, March 30, 2009 - link

    Well, for one, the Nehalem is wider than the Pentium 4, so that's a big issue there. On the negative side (with respect to HT increase, but really a positive) you have better scheduling with Nehalem, in particular, memory disambiguation. The weaker the scheduler, the better the performance increase from HT, in general.

    I'd say it's both. Clearly, the width of Nehalem would help a lot more than the minor tweaks. Also, you have better memory bandwidth, and in particular, a large L1 cache. I have to believe it was fairly difficult for the Pentium 4 to keep feeding two threads with such a small L1 cache, and then you have the additional L2 latency vis-a-vis the Nehalem.

    So, clearly the Nehalem is much better designed for it, and I think it's equally clear software has adjusted to the reality of more computers having multiple processors.

    On top of this, these are server applications they are running, not mainstream desktop apps, which might show a different profile with regards to Hyper-threading improvements.

    It would have to be a combination.
  • JohanAnandtech - Monday, March 30, 2009 - link

    The L1-cache and the way that the Pentium 4 decoded was an important (maybe even the most important) factor in the mediocre SMT performance. Whenever the trace cache missed (and it was quite small, something of the equivalent of 16 KB), the Pentium 4 had only one real decoder. This means that you have to feed two threads with one decoder. In other words, whenever you get a miss in the trace cache, HT did more bad than good in the Pentium 4. That is clearly is not the case in Nehalem with excellent decoding capabilities and larger L1.

    And I fully agree with your comments, although I don't think mem disambiguation has a huge impact on the "usefullness" of SMT. After all, there are lots of reasons why the ample execution resources are not fully used: branches, L2-cache misses etc.
  • IntelUser2000 - Tuesday, March 31, 2009 - link

    Not only that, Pentium 4 had the Replay feature to try to make up for having such a long pipeline stage architecture. When Replay went wrong, it would use resources that would be hindering the 2nd thread.

    Core uarch has no such weaknesses.
  • SilentSin - Monday, March 30, 2009 - link

    Wow...that's just ridiculous how much improvement was made, gg Intel. Can't wait to see how the 8-core EX's do, if this launch is any indication that will change the server landscape overnight.

    However, one thing I would like to see compared, or slightly modified, is the power consumption figures. Instead of an average amount of power used at idle or load, how about a total consumption figure over the length of a fixed benchmark (ie- how much power was used while running SPECint). I think that would be a good metric to illustrate very plainly how much power is saved from the greater performance with a given load. I saw the chart in the power/performance improvement on the Bottom Line page but it's not quite as digestible as or as easy to compare as a straight kW per benchmark figure would be. Perhaps give it the same time range as the slowest competing part completes the benchmark in. This would give you the ability to make a conclusion like "In the same amount of time the Opteron 8384 used to complete this benchmark, the 5570 used x watts less, and spent x seconds in idle". Since servers are rarely at 100% load at all times it would be nice to see how much faster it is and how much power it is using once it does get something to chew on.

    Anyway, as usual that was an extremely well done write up, covered mostly everything I wanted to see.
  • 7Enigma - Wednesday, April 1, 2009 - link

    I think that is a very good method for determining total power consumption. Obviously this doesn't show cpu power consumption, but more importantly the overall consumption for a given unit of work.

    Nice thinking.
  • JohanAnandtech - Wednesday, April 1, 2009 - link

    I am trying to hard, but I do not see the difference with our power numbers. This is the average power consumption of one CPU during 10 minutes of DVD-store OLTP activity. As readers have the performance numbers, you can perfectly calculate performance/watt or per KWh. Per server would be even better (instead of per CPU) but our servers were too different.

    Or am I missing something?

Log in

Don't have an account? Sign up now