Single Core Integer Performance With SPEC CPU2006

In past server reviews, I used LZMA (7-zip) compression and decompression to evaluate single threaded performance. But I was well aware that while it was a decent integer test, it also gave a very myopic view in the process. After noticing that my colleagues used SPEC CPU2006, and after discussing the matter with several people, I realized that running SPEC CPU2006 was a much better way to evaluate single core performance. Even though SPEC CPU2006 is more HPC and workstation oriented, it contains a good variety of integer workloads.

I also wanted to keep the settings as "normal" as possible. So I used:

  • 64 bit gcc : most used compiler on linux, good all round compiler that does not try to "break" benchmarks (libquantum...)
  • gcc version 4.8.4: 4.8.x has been around for a long time, very mature version
  • -O2 -fno-strict-aliasing: standard compiler settings that many developers use
  • Run 2 copies and bind them to the first core

The ultimate objective is to measure performance in non-"aggressively optimized" applications where for some reason - as is frequently the case - a "multi thread unfriendly" task keeps us waiting. As we want to be able to compare these numbers to other architectures such as the IBM POWER 8, we decided to use all threads available on a single core. In case of Intel, this means one physical and two simultaneous threads running on top of it.

We included the Opteron 6376 for nostalgic reasons. We are showing the results of 2 threads running on top of one module with 2 "integer cores".

Subtest Xeon E5-2690 Opteron 6376 Xeon E5-2697v2 Xeon E5-2667 v3 Xeon E5-2699 v3 Xeon E5-2699 v4
400.perlbench 41.1 29.3 37.6 42.6 39.9 36.6
401.bzip2 33.4 24.1 30.1 33.1 29.9 25.3
403.gcc 40.2 26.7 38.9 42.4 36.4 33.3
429.mcf 45.1 31.7 46.8 46.4 41.6 43.9
445.gobmk 36.4 25.5 33.2 34.9 31.7 27.7
456.hmmer 30.4 26.1 27.6 31 27.1 28.4
458.sjeng 35.2 24.7 32.8 35.2 30.5 28.3
462.libquantum 74.9 39.9 79.3 84.4 62.2 67.3
464.h264ref 51.7 34.2 48.1 52.1 45.2 40.7
471.omnetpp 24.5 25.3 26.8 29.4 26.6 29.9
473.astar 28.2 20.7 26.1 27.9 24 23.6
483.xalancbmk 41.5 28.2 41.4 48.2 42.4 41.8

Unless you are used to seeing these numbers, this does not tell you too much. As Sandy Bridge EP (Xeon E5 v1) is about 4 years old, the servers based upon this CPU are going to get replaced by newer ones. So Sandy Bridge is our reference, and Sandy Bridge performance is considered to be 100%.

Subtest Application type Xeon E5-2690 Opteron 6376 Xeon E5-2697v2 Xeon E5-2667 v3 Xeon E5-2699 v3 Xeon E5-2699 v4
400.perlbench Spam filter 100% 71% 91% 104% 97% 89%
401.bzip2 Compression 100% 72% 90% 99% 90% 76%
403.gcc Compiling 100% 66% 97% 105% 91% 83%
429.mcf Vehicle scheduling 100% 70% 104% 103% 92% 97%
445.gobmk Game AI 100% 70% 91% 96% 87% 76%
456.hmmer Protein seq. analyses 100% 86% 91% 102% 89% 93%
458.sjeng Chess 100% 70% 93% 100% 87% 80%
462.libquantum Quantum sim 100% 53% 106% 113% 83% 90%
464.h264ref Video encoding 100% 66% 93% 101% 87% 79%
471.omnetpp Network sim 100% 103% 109% 120% 110% 122%
473.astar Pathfinding 100% 73% 93% 99% 85% 84%
483.xalancbmk XML processing 100% 68% 100% 116% 102% 101%

Many smart people have spent weeks - if not months - on SPEC CPU2006 analysis, so we will not pretend we can offer you a complete picture in a few days. If you want a detailed analysis of compilers and CPU 2006, I recommend the very detailed article of SPEC CPU 2006 meister Andreas Stiller in the February issue of C'T (German computer magazine). 

We need much more profiling data than we could gather in the past weeks. But for what we can do, we'll start with the most important parameter: clockspeed.

One of the most important things to realize is that - especially with badly threaded workloads - these massive multi-core CPUs almost never work at their advertised clockspeed.

  • The Xeon E5-2690 can run at 3.3 GHz with all cores busy, and is capable of boosting up to 3.8 GHz
  • The Xeon E5-2697 v2 can run at 3 GHz with all cores busy, and is capable of boosting up to 3.5 GHz
  • The Xeon E5-2699 v3 can run at 2.8 GHz with all cores busy, and is capable of boosting up to 3.6 GHz
  • The Xeon E5-2667 v3 3.2 GHz is a specialized high frequency model. It can run at 3.4 GHz with all cores busy, and is capable of boosting up to 3.6 GHz
  • The Xeon E5-2699 v4 can run at 2.8 GHz with all cores busy, and is capable of boosting up to 3.6 GHz

So that already explains a lot. In contrast to the many benchmark applications, SPEC CPU2006 runs for a long time (5 to 15 minutes per test), and our first impression is that the HCC parts are not able to keep all of their cores at their maximum turbo boost. Otherwise there is no reason why a Xeon E5-2699 v3 or v4 would perform worse than a Xeon E5-2667 v3: both can run at 3.6 GHz when one core is active.

The low IPC, memory intensive network simulator omnetppp seems to be the only test that runs significantly better on the newer cores (Haswell, Broadwell) compared to Sandy Bridge. That also seems to be the only benchmark where the high core count chips (E5-2699 v4, E5-2699 v3) continue to outperform Sandy Bridge. We could pinpoint the reason by testing with different memory speeds and channels. The E5-2699 v4 can offer the highest performance thanks to the larger L3-cache (55 MB) and the higher DIMM speed (DDR4-2400) compared to Sandy Bridge (20 MB, DDR3-1600). Otherwise when we keep the clockspeed more or less constant, by looking at the Xeon E5-2667v3 and the Xeon E5-2690, we get a 1-5% speed difference, and only the memory intensive subtests (omnetpp, Libquantum) and xalancbmk (low IPC, branch intensive) show higher improvements.

Once we test both top SKUs with "-Ofast" (a more aggressive compiler setting), the results change quite a bit:.

Subtest Application type Xeon E5-2699 v4 vs Xeon E5-2690 (-Ofast) Xeon E5-2699 v4 vs Xeon E5-2690 (-O2)
400.perlbench Spam filter 111% 89%
401.bzip2 Compression 94% 76%
403.gcc Compiling 95% 83%
429.mcf Vehicle scheduling 114% 97%
445.gobmk Game AI 90% 76%
456.hmmer Protein seq. analyses 106% 93%
458.sjeng Chess 93% 80%
462.libquantum Quantum sim 101% 90%
464.h264ref Video encoding 89% 79%
471.omnetpp Network sim 132% 122%
473.astar Pathfinding 98% 84%
483.xalancbmk XML processing 105% 101%

Switching from -O2 to -Ofast improves Broadwell-EP's absolute performance by over 19%. Meanwhile the relative performance advantage versus the Xeon E5-2690 averages 3%. As a result, the clockspeed disadvantage of the latest Xeon is negated by the increase in IPC. Clearly the latest generation of Xeons benefit more from aggressive optimizations than the previous ones. That is unsurprising of course, but it is interesting that the newest Xeons need more optimization to "hold the line" in single core performance.

So far we can conclude that if you were to upgrade from a Xeon E5-2xxx v1 to a similar v4 model, your single threaded integer code will not run faster without recompiling and optimizing. The process improvements have been used mostly to add more cores in the same power envelope, while at same time Intel also traded a few speed bins in to add even more cores in the top models. As a result single core integer performance basically holds the line, nothing more. The only exception are memory intensive applications who benefit from every growing L3-cache and the faster DRAM technology.

Benchmark Configuration and Methodology Memory Subsystem
Comments Locked

112 Comments

View All Comments

  • isrv - Sunday, April 3, 2016 - link

    i will belive that only after one by one comparison E5-1630v3 vs any of E5v4 composing wordpress front page for example.
    and so far, that's only a words about better caching etc...
  • simplyfabio - Monday, April 4, 2016 - link

    Could I ask one thing here? For a Workstation 3D, both for rendering and graphic/cad, (like illustrator, photoshop, autocad, 3dsmax), could be better have more core like the E5 2690 (considering all the turbo clock speed for each core active) ore better frequency, like the 1680? Thanks a lot to everyone, I can't find a nice review on this side of this CPUs...
  • grantdesrosiers - Monday, April 4, 2016 - link

    Not sure if anyone has pointed it out yet, but I think there is an error on the "Multi-Threaded Integer Performance" page, first graph. The 2695v4 says 22 cores, I believe it should be 18.
  • SanX - Monday, April 4, 2016 - link

    Poor Moore's law for workstations... 10-20% gain per 2-years generation.

    Think about it: there is no reason to upgrade for the next *** 5-10 generations *** or the next 10-20 years (!!!) when the processors will be only e-fold (2.71x) faster.
  • dragonsqrrl - Monday, April 4, 2016 - link

    The problem is your first assumption is already false.
  • Khenglish - Monday, April 4, 2016 - link

    I can't understand why the 4C and under turbo speeds are so slow on the v4 2699. A Broadwell with 55MB of cache being outperformed by a stock clocked Sandy Bridge is ridiculous. Why would this CPU not clock up to at least 4.2GHz with a 4 core workload, and say 4.4GHz for a 1 core workload? Hell it costs over $4000 and a massive TDP. You'd think Intel could take a minute to make the low core count speeds not terribly low.

    My workstation in my lab has a 1650 v3. My workloads peak between 4-8 cores. There is not a single CPU in the v4 lineup that would be an upgrade over the 1650 v3 despite the major power savings of 14nm and the cache size increase due to Intel's inability to set reasonable 8C and under frequencies.
  • Romulous - Monday, April 4, 2016 - link

    People who are serious about recompiling the same software often would probably use ccache and maybe even distcc. So your Linux kernel compile test is really only there for to show potential cpu performance.
  • LHL2500 - Tuesday, April 5, 2016 - link

    "It finds a home in the same LGA 2011-3 socket."
    Not according to Intel's website.
    http://ark.intel.com/compare/91754,81908
    In this comparison between a v3 and a v4 version of a E5-2680, the socket support for the two chips are different. The older version using the the FCLGA2011-3 and the newer version using FCLGA2011.
    So who is right? Anandtech or Intel?
    And it not just this chip. It's all the v4s.
    While I hope it's a typo on Intel's behalf, for now it doesn't look like the v4s are direct upgrades to the v3s. You will apparently need new motherboards.
  • xrror - Tuesday, April 5, 2016 - link

    That... is a bit disconcerting. I also like how "VID Voltage Range" for the v4 parts is simply listed as "0" ...
  • SeanJ76 - Tuesday, April 5, 2016 - link

    My School had the 3rd Generation Xeon's in their Workstations, they were slow as fuck@3.3ghz!! The consumer i7 4790K/6700K would run laps around these Xeon crap cpus!

Log in

Don't have an account? Sign up now