Single Core Integer Performance With SPEC CPU2006

In past server reviews, I used LZMA (7-zip) compression and decompression to evaluate single threaded performance. But I was well aware that while it was a decent integer test, it also gave a very myopic view in the process. After noticing that my colleagues used SPEC CPU2006, and after discussing the matter with several people, I realized that running SPEC CPU2006 was a much better way to evaluate single core performance. Even though SPEC CPU2006 is more HPC and workstation oriented, it contains a good variety of integer workloads.

I also wanted to keep the settings as "normal" as possible. So I used:

  • 64 bit gcc : most used compiler on linux, good all round compiler that does not try to "break" benchmarks (libquantum...)
  • gcc version 4.8.4: 4.8.x has been around for a long time, very mature version
  • -O2 -fno-strict-aliasing: standard compiler settings that many developers use
  • Run 2 copies and bind them to the first core

The ultimate objective is to measure performance in non-"aggressively optimized" applications where for some reason - as is frequently the case - a "multi thread unfriendly" task keeps us waiting. As we want to be able to compare these numbers to other architectures such as the IBM POWER 8, we decided to use all threads available on a single core. In case of Intel, this means one physical and two simultaneous threads running on top of it.

We included the Opteron 6376 for nostalgic reasons. We are showing the results of 2 threads running on top of one module with 2 "integer cores".

Subtest Xeon E5-2690 Opteron 6376 Xeon E5-2697v2 Xeon E5-2667 v3 Xeon E5-2699 v3 Xeon E5-2699 v4
400.perlbench 41.1 29.3 37.6 42.6 39.9 36.6
401.bzip2 33.4 24.1 30.1 33.1 29.9 25.3
403.gcc 40.2 26.7 38.9 42.4 36.4 33.3
429.mcf 45.1 31.7 46.8 46.4 41.6 43.9
445.gobmk 36.4 25.5 33.2 34.9 31.7 27.7
456.hmmer 30.4 26.1 27.6 31 27.1 28.4
458.sjeng 35.2 24.7 32.8 35.2 30.5 28.3
462.libquantum 74.9 39.9 79.3 84.4 62.2 67.3
464.h264ref 51.7 34.2 48.1 52.1 45.2 40.7
471.omnetpp 24.5 25.3 26.8 29.4 26.6 29.9
473.astar 28.2 20.7 26.1 27.9 24 23.6
483.xalancbmk 41.5 28.2 41.4 48.2 42.4 41.8

Unless you are used to seeing these numbers, this does not tell you too much. As Sandy Bridge EP (Xeon E5 v1) is about 4 years old, the servers based upon this CPU are going to get replaced by newer ones. So Sandy Bridge is our reference, and Sandy Bridge performance is considered to be 100%.

Subtest Application type Xeon E5-2690 Opteron 6376 Xeon E5-2697v2 Xeon E5-2667 v3 Xeon E5-2699 v3 Xeon E5-2699 v4
400.perlbench Spam filter 100% 71% 91% 104% 97% 89%
401.bzip2 Compression 100% 72% 90% 99% 90% 76%
403.gcc Compiling 100% 66% 97% 105% 91% 83%
429.mcf Vehicle scheduling 100% 70% 104% 103% 92% 97%
445.gobmk Game AI 100% 70% 91% 96% 87% 76%
456.hmmer Protein seq. analyses 100% 86% 91% 102% 89% 93%
458.sjeng Chess 100% 70% 93% 100% 87% 80%
462.libquantum Quantum sim 100% 53% 106% 113% 83% 90%
464.h264ref Video encoding 100% 66% 93% 101% 87% 79%
471.omnetpp Network sim 100% 103% 109% 120% 110% 122%
473.astar Pathfinding 100% 73% 93% 99% 85% 84%
483.xalancbmk XML processing 100% 68% 100% 116% 102% 101%

Many smart people have spent weeks - if not months - on SPEC CPU2006 analysis, so we will not pretend we can offer you a complete picture in a few days. If you want a detailed analysis of compilers and CPU 2006, I recommend the very detailed article of SPEC CPU 2006 meister Andreas Stiller in the February issue of C'T (German computer magazine). 

We need much more profiling data than we could gather in the past weeks. But for what we can do, we'll start with the most important parameter: clockspeed.

One of the most important things to realize is that - especially with badly threaded workloads - these massive multi-core CPUs almost never work at their advertised clockspeed.

  • The Xeon E5-2690 can run at 3.3 GHz with all cores busy, and is capable of boosting up to 3.8 GHz
  • The Xeon E5-2697 v2 can run at 3 GHz with all cores busy, and is capable of boosting up to 3.5 GHz
  • The Xeon E5-2699 v3 can run at 2.8 GHz with all cores busy, and is capable of boosting up to 3.6 GHz
  • The Xeon E5-2667 v3 3.2 GHz is a specialized high frequency model. It can run at 3.4 GHz with all cores busy, and is capable of boosting up to 3.6 GHz
  • The Xeon E5-2699 v4 can run at 2.8 GHz with all cores busy, and is capable of boosting up to 3.6 GHz

So that already explains a lot. In contrast to the many benchmark applications, SPEC CPU2006 runs for a long time (5 to 15 minutes per test), and our first impression is that the HCC parts are not able to keep all of their cores at their maximum turbo boost. Otherwise there is no reason why a Xeon E5-2699 v3 or v4 would perform worse than a Xeon E5-2667 v3: both can run at 3.6 GHz when one core is active.

The low IPC, memory intensive network simulator omnetppp seems to be the only test that runs significantly better on the newer cores (Haswell, Broadwell) compared to Sandy Bridge. That also seems to be the only benchmark where the high core count chips (E5-2699 v4, E5-2699 v3) continue to outperform Sandy Bridge. We could pinpoint the reason by testing with different memory speeds and channels. The E5-2699 v4 can offer the highest performance thanks to the larger L3-cache (55 MB) and the higher DIMM speed (DDR4-2400) compared to Sandy Bridge (20 MB, DDR3-1600). Otherwise when we keep the clockspeed more or less constant, by looking at the Xeon E5-2667v3 and the Xeon E5-2690, we get a 1-5% speed difference, and only the memory intensive subtests (omnetpp, Libquantum) and xalancbmk (low IPC, branch intensive) show higher improvements.

Once we test both top SKUs with "-Ofast" (a more aggressive compiler setting), the results change quite a bit:.

Subtest Application type Xeon E5-2699 v4 vs Xeon E5-2690 (-Ofast) Xeon E5-2699 v4 vs Xeon E5-2690 (-O2)
400.perlbench Spam filter 111% 89%
401.bzip2 Compression 94% 76%
403.gcc Compiling 95% 83%
429.mcf Vehicle scheduling 114% 97%
445.gobmk Game AI 90% 76%
456.hmmer Protein seq. analyses 106% 93%
458.sjeng Chess 93% 80%
462.libquantum Quantum sim 101% 90%
464.h264ref Video encoding 89% 79%
471.omnetpp Network sim 132% 122%
473.astar Pathfinding 98% 84%
483.xalancbmk XML processing 105% 101%

Switching from -O2 to -Ofast improves Broadwell-EP's absolute performance by over 19%. Meanwhile the relative performance advantage versus the Xeon E5-2690 averages 3%. As a result, the clockspeed disadvantage of the latest Xeon is negated by the increase in IPC. Clearly the latest generation of Xeons benefit more from aggressive optimizations than the previous ones. That is unsurprising of course, but it is interesting that the newest Xeons need more optimization to "hold the line" in single core performance.

So far we can conclude that if you were to upgrade from a Xeon E5-2xxx v1 to a similar v4 model, your single threaded integer code will not run faster without recompiling and optimizing. The process improvements have been used mostly to add more cores in the same power envelope, while at same time Intel also traded a few speed bins in to add even more cores in the top models. As a result single core integer performance basically holds the line, nothing more. The only exception are memory intensive applications who benefit from every growing L3-cache and the faster DRAM technology.

Benchmark Configuration and Methodology Memory Subsystem
Comments Locked

112 Comments

View All Comments

  • iwod - Thursday, March 31, 2016 - link

    Maximum memory still 768GB?
    What happen to the 5.1Ghz Xeon E5?
  • Ian Cutress - Thursday, March 31, 2016 - link

    I never saw anyone with a confirmed source for that, making me think it's a faked rumor. I'll happily be proved wrong, but nothing like a 5.1 GHz part was announced today.
  • Brutalizer - Saturday, April 2, 2016 - link

    It would have been interesting to bench to the best cpu today, the SPARC M7. For instance:

    -SAP: two M7 cpu scores 169.000 saps vs 109.000 saps for two of this Broadwell-EP cpus

    -Hadoop, sort 10TB data: one SPARC M7 server with four cpus, finishes the sort in 4,260 seconds. Whereas a cluster of 32 PCs equipped with dual E5-2680v2 finishes in 1,054 seconds, i.e. 64 Intel Xeon cpus vs four SPARC M7 cpus.

    -TPC-C: one SPARC M7 server with one cpu gets 5,000,000 tpm, whereas one server with two E5-2699v3 cpus gets 3.600.000 tpm

    -Memory bandwidth, Stream triad: one SPARC M7 reaches 145 GB/sec, whereas two of these Broadwell-EP cpus reaches 119GB/sec

    -etc. All these benchmarks can be found here, and another 25ish benchmarks where SPARC M7 is 2-3x faster than E5-2699v3 or POWER8 (all the way up to 11x faster):
    https://blogs.oracle.com/BestPerf/entry/20151025_s...
  • Brutalizer - Saturday, April 2, 2016 - link

    BTW, all these SPARC M7 benchmarks are almost unaffected if encryption is turned on, maybe 2-5% slower. Whereas if you turn on encryption for x86 and POWER8, expect performance to halve or even less. Just check the benchmarks on the link above, and you will see that SPARC M7 benchmarks are almost unaffected encrypted or not.
  • JohanAnandtech - Saturday, April 2, 2016 - link

    "if you turn on encryption for x86 and POWER8, expect performance to halve or even less". And this is based upon what measurement? from my measurements, both x86 and POWER8 loose like 1-3% when AES encryption is on. RSA might be a bit worse (2-10%), but asymetric encryption is mostly used to open connections.
  • Brutalizer - Wednesday, April 6, 2016 - link

    If we talk about how encryption affects performance, lets look at this benchmark below. Never mind the x86 is slower than the SPARC M7, let us instead look at how encryption affects the cpus. What performance hit has encryption?
    https://blogs.oracle.com/BestPerf/entry/20160315_t...

    -For x86 we see that two E5-2699v3 cpus utilization goes from 40% without crypto, up to 80% with crypto. This leaves the x86 server with very little headroom to do anything else than executing one query. At the same time, the x86 server took 25-30% longer time to process the query. This shows that encryption has a huge impact on x86. You can not do useful work with two x86 cpus, except executing a query. If you need to do additional work, get four x86 xeons instead.

    -If we look at how SPARC M7 gets affected by encryption, we see that cpu utilization went up from 30% up to 40%. So you have lot of headroom to do additional work while processing the query. At the same time, the SPARC cpu took 2% longer time to process the query.

    It is not really interesting that this single SPARC M7 cpu is 30% faster than two E5-2699v3 in absolute numbers. No, we are looking at how much worse the performance gets affected when we turn on encryption. In case of x86, we see that the cpus gets twice the load, so they are almost fully loaded, only by turning on encryption. At the same time taking longer time to process the work. Ergo, you can not do any additional work with x86 with crypto. With SPARC, it ends up with 40% cpu utilization so you can do additional work on SPARC, and process time does not increase at all (2%). This proves that x86 encryption halves performance or worse.

    For your own AES encryption benchmark, you should also see how much cpu utilization goes up. If it gets fully loaded, you can not do any useful work except handling encryption. So you need an additional cpu to do the actual work.
  • JohanAnandtech - Saturday, April 2, 2016 - link

    Two M7 machines start at 90k, while a dual Xeon is around 20k. And most of those Oracle are very intellectually dishonest: complicated configurations to get the best out of the M7 machines, midrange older x86 configurations (10-core E5 v2, really???)
  • Brutalizer - Wednesday, April 6, 2016 - link

    The "dishonest" benchmarks from Oracle, are often (always?) using what is published. If for instance, IBM only has one published benchmark, then Oracle has no other choice than use it, right? Of course when there are faster IBM benchmarks out there, Oracle use that. Same with x86. In all these 25ish cases we see that SPARC M7 is 2-3x faster, all the way up to 11x faster. The benhcmarks vary very much, raw compute power, databases, deep learning, SAP, etc etc
  • Phil_Oracle - Thursday, May 12, 2016 - link

    I disagree Johan! You don't appear to know much about the new SPARC M7 systems and suggest you do a full evaluation before making such remarks. A SPARC T7-1 with 32-cores has a list price of about $39K outperforms a 2-socket 36-core E5-2699v3 anywhere from 38% (OLTP HammerDB) to over 8x faster (OLTP w/ in-memory analytics). A similarly configured *enterprise* class 2-socket 36-core E5-2699v3 from HPE or Cisco lists for $25K+, so in terms of price/performance, the SPARC T7-1 beats the 2-socket E5-2699v3. And if you take into account SW that’s licensed per core, the SPARC M7 is 60% to 2.6x faster/core, dramatically lowering licensing costs. With the new E5-2699v4, providing ~20% more cores at roughly the same price, gets closer, but with performance/core not changing much with E5 v4, SPARC M7 still has a huge lead. And the difference is while the E5 v3/v4 chips don't scale beyond 2-socket, you can get an SPARC M7 system up to 16-sockets with the almost identical price/performance of the 1-socket system.
  • adamod - Friday, June 3, 2016 - link

    BUT CAN IT PLAY CRYSIS?????

Log in

Don't have an account? Sign up now