The SAP SD (sales and distribution, two-tier internet configuration) benchmark is an extremely interesting benchmark as it is a real-world client/server application. So we decided to take a look at SAP's benchmark database. The results below are two-tier benchmarks, so the database and the underlying OS can make a big difference. Unless we keep those parameters the same, we cannot compare the results. The results below are all run on a Windows 2003 Enterprise Edition and MS SQL Server 2005 (both 64-bit). All Xeon 73xx and Opteron 83x systems were equipped with 64GB of RAM; only a few older systems (Opteron 22xx) had only 32GB, but the impact is not significant. All these benchmarks are done on the SAP "ERP release 2005" two-tier Sales and Distribution benchmark.
The graph above contains a flood of numbers, but it definitely deserves a deeper analyses. The Quad Opteron 2.3GHz manages to outperform a 2.4GHz Xeon 7440, and is less than 6% slower than an x7350 which has a 27% higher clock. This excellent performance of the AMD chip is not only the result of the ample bandwidth that the AMD cores have to their disposal. The Xeon E5345 "Clovertown" and Xeon "Tigerton" E7340 are the same CPUs, running on very similar platforms. The former has about 6GB/s for 8 cores, the latter 9GB/s for 16 cores. The SAP performance still scales almost as perfectly as you can expect from such a complex piece of software: we are seeing up to 80% performance gain from doubling the number of cores.
If bandwidth really was the bottleneck, the quad Opteron 8222SE (3GHz, up to 12GB for 8 cores) should be able to outperform the Xeon 5365 (3GHz, 6GB/s for 8 cores), and the Dual Intel Xeon 5365 at 3GHz (quad-core) would not be able to scale so well (+86%!) compared to the Dual Xeon 5160 (3GHz, dual-core) as both systems have the same bandwidth.
It would require a lengthy profiling of the SAP application and the underlying database to fully understand the results, but there are few hints available. First of all, analyses by for example Intel and Sun show that the underlying code of SAP SD is rather branch intensive, latency sensitive, and running at low IPC. Now look at the dual (8220 extrapolate to 8222), quad, and octal-socket Opteron 8222. We know that the software scales very well. From dual to quad we get again an estimated 80-85% scaling, but from quad to octal only 41%. There is another clue: the problem of the octal socket Opteron 8222 is that some nodes are three hops away from each other (from CPU0 to CPU15 for example) and that the synchronization latency from those CPUs can be quite high. So let us sum this all up in a rough profile of SAP SD.
parallel; excellent scaling
- Low to medium IPC, mostly due to "branchy" code
- Not really limited by memory bandwidth
- Likes large caches
- Sensitive to Sync latency; Octal Core Opteron scales rather badly
Add to the above clues that eight cores of 2.3GHz "Barcelona" are a bit faster than eight cores of the previous Opteron generation at 3GHz. The small improvements in integer IPC cannot explain this, as the SAP is hard to speed up by IPC improvements. We think it is safe to conclude that the SAP SD benchmark is one of the examples where being a "native quad-core" pays off. The cache coherency traffic of the third generation of Opterons scales with the number of sockets and not the number of caches. The four L2 caches might be small, but their latency is good, and they make sure that all four different threads running on a CPU do not interfere too much with each other. In case of SAP, this pays off in excellent performance.
The 45nm Xeons are about 10-15% faster than older siblings. If AMD can finally produce some higher clock speeds of Barcelona, the latest quad-core Opteron should be able to keep up with the fastest 45nm CPUs… until Intel's next generation "Nehalem" comes out, at least, but we'll save that analysis for a future article.