We have emphasized it more than once: the Nehalem architecture is all about regaining the performance crown in servers and HPC, desktop and mobile use were sometimes a bonus, sometimes an afterthought. Today it becomes almost painfully obvious. Just read Anand's thoughts about the Core i7:
 
"The Core i7's general purpose performance is solid, you're looking at a 5 - 10% increase in general application performance at the same clock speeds as Penryn"
and now look at the graph below.

 
Intel has apparantely allowed HP and Fujitsu-Siemens to break the NDA on the Xeon 5570 processor for PR reasons as both companies have published SAP numbers on a Dual Xeon 5570. The Xeon 5570 is based on the same architecture as the Core i7. It is a 2.93 GHz quadcore CPU with 4 times a 256 KB L2-cache and one huge shared 8 MB L3. 
 
 
SAP Sales & Distribution 2 Tier benchmark
 
The SAP numbers are absolutely astonishing, as Intel's dual socket is able to outperform quad socket opteron machines. Based on the scaling of Barcelona, we speculate that a quad Shanghai at 2.7 GHz would obtain the performance of the Dual Xeon 5570 w/o HT.The new Xeon 5570 outperforms the "old" 5450 by 119%!!!
 
These numbers are so high, that we checked and checked again. The database used is the same (SQL Server 2005), so unless there is some incredible tuning parameter that HP and FS have discovered and that we have yet to hear about, that is not it.
 
At this point we have no idea how it is possible that a 3 GHz Nehalem outperforms the latest Opteron by a margin as high as 80% and more. But we can give it a try. In a previous server oriented article, we summed up a rough profile of SAP S&D:

• Very parallel resulting in excellent scaling
• Low to medium IPC, mostly due to “branchy” code
• Not really limited by memory bandwidth
• Likes large caches
• Sensitive to Sync (“cache coherency”) latency
 
One of the biggest bottlenecks for Intel has been the sync latency. It is possible that once the "sync" bottleneck was removed, the intel architecture is able to show it's real integer crunching power thanks to the out of order loads (memory disambiguation) and better branch prediction.Those are two areas where the opteron architecture is still weak.
 
The slightly lower latency of the L3-cache of Nehalem helps too. This kind of software also makes the buffers fill up due to the long dependency chains. Those OOO buffers have been increased and the depencency chains have been shortened by a very low latency L2 cache and relatively fast L3.
 
Still we are absolutely amazed that the difference is this large. We would have expected Nehalem to outperform Shanghai by lower margins. Although we still are a bit skeptical that the difference is this large ("too good to be true" syndrome), we do not see how you could artificially inflate a SAP benchmark. It sure is not as easy as SPECJBB or SPECfp/int. 
 
 
Update (a few hours later): It seems that the SAP page was wrong about HT. It reported 8 threads on 8 cores on the Fujitsu Siemens Primergy Server. The certification page says otherwise: 16 threads on 8 cores. So hyperthreading (SMT) plays probably an important role in this benchmark as the SAP application has very low IPC and is very parallel. So this completely annihilating performance comes from combining a wide superscalar CPU with an excellent Simultaneous Multithreading implementation. Hats off to the Intel engineers...
 
 
 
POST A COMMENT

28 Comments

View All Comments

  • Pablitus - Tuesday, December 16, 2008 - link

    I think that the Core architechture stars to shine with the add of the Memory controller on die. Having memory controller outside gives you flexibility in the mainboard/chipset selection, but you pay this with latency. Now the improvements in the Nehalem (wider execution units, HT, blah blah blah) plus the Ondie memory controller gives the CPU all the bandwidth neccesary to has the CPU very busy crunching integers.

    It was well documented that adding the memory controller on die to any cpu boost the performance, so i think that this record was expected by intel engineers...but not with this huge margin.
    Reply
  • wpapolis - Tuesday, December 16, 2008 - link

    Yes, indeed, "Head's off those Intel Engineers!"

    How dare they?

    Bill
    Reply
  • icrf - Tuesday, December 16, 2008 - link

    Heads off sounds more like they're on the chopping block. Reply
  • zsdersw - Tuesday, December 16, 2008 - link

    Don't you mean "hats off"? I don't think the Intel engineers should have their heads taken off for this stellar result :) Reply
  • JohanAnandtech - Tuesday, December 16, 2008 - link

    ouch. Fixed :-). Reply
  • Trisagion - Tuesday, December 16, 2008 - link

    If it's too good to be true, it probably is... Reply
  • BSMonitor - Tuesday, December 16, 2008 - link

    Tell that to the guy who won the $207 million lotto this past weekend....


    Not surprising really. Wolfdale dual-cores were always competitive against quad-core Phenoms... Now you have removed the one thing keeping Core processors from scaling as well as K10... ie the FSB.. Especially in a highly threaded application, as the writer mentions.. Shows how data starved Penryn really was!
    Reply
  • JohanAnandtech - Tuesday, December 16, 2008 - link

    I agree. Still this is a certified by SAP benchmark, and one that is mostly CPU limited. I don't see how you can "cheat" on this one. It is not like you can recompile the SAP code. Reply
  • duploxxx - Tuesday, December 16, 2008 - link

    both systems have HT on, check the detailed scores.

    to good to be thrue??? no, just obvious that HT is working fine on this SAP benchmark, count 70-80% off when you shut it down. Weather or not if that is required in real life SAP environments is yet to be shown.
    Reply
  • JohanAnandtech - Tuesday, December 16, 2008 - link

    Good point, I updated the blog post. Well, when I see a +100% boost over the previous generation we have to be prudent.

    I don't think 70% is a result of HT. Doubling the cores gives you a 70% increase, and there is no way that HT can be as good as doubling the cores. I expect 40% to be more realistic. Still, it is incredible how a dual machine is capable of defeating a quad server which is only a few months older.
    Reply

Log in

Don't have an account? Sign up now