The Intel Xeon E5 v4 Review: Testing Broadwell-EP With Demanding Server Workloads
by Johan De Gelas on March 31, 2016 12:30 PM EST- Posted in
- CPUs
- Intel
- Xeon
- Enterprise
- Enterprise CPUs
- Broadwell
Single Core Integer Performance With SPEC CPU2006
In past server reviews, I used LZMA (7-zip) compression and decompression to evaluate single threaded performance. But I was well aware that while it was a decent integer test, it also gave a very myopic view in the process. After noticing that my colleagues used SPEC CPU2006, and after discussing the matter with several people, I realized that running SPEC CPU2006 was a much better way to evaluate single core performance. Even though SPEC CPU2006 is more HPC and workstation oriented, it contains a good variety of integer workloads.
I also wanted to keep the settings as "normal" as possible. So I used:
- 64 bit gcc : most used compiler on linux, good all round compiler that does not try to "break" benchmarks (libquantum...)
- gcc version 4.8.4: 4.8.x has been around for a long time, very mature version
- -O2 -fno-strict-aliasing: standard compiler settings that many developers use
- Run 2 copies and bind them to the first core
The ultimate objective is to measure performance in non-"aggressively optimized" applications where for some reason - as is frequently the case - a "multi thread unfriendly" task keeps us waiting. As we want to be able to compare these numbers to other architectures such as the IBM POWER 8, we decided to use all threads available on a single core. In case of Intel, this means one physical and two simultaneous threads running on top of it.
We included the Opteron 6376 for nostalgic reasons. We are showing the results of 2 threads running on top of one module with 2 "integer cores".
Subtest | Xeon E5-2690 | Opteron 6376 | Xeon E5-2697v2 | Xeon E5-2667 v3 | Xeon E5-2699 v3 | Xeon E5-2699 v4 |
400.perlbench | 41.1 | 29.3 | 37.6 | 42.6 | 39.9 | 36.6 |
401.bzip2 | 33.4 | 24.1 | 30.1 | 33.1 | 29.9 | 25.3 |
403.gcc | 40.2 | 26.7 | 38.9 | 42.4 | 36.4 | 33.3 |
429.mcf | 45.1 | 31.7 | 46.8 | 46.4 | 41.6 | 43.9 |
445.gobmk | 36.4 | 25.5 | 33.2 | 34.9 | 31.7 | 27.7 |
456.hmmer | 30.4 | 26.1 | 27.6 | 31 | 27.1 | 28.4 |
458.sjeng | 35.2 | 24.7 | 32.8 | 35.2 | 30.5 | 28.3 |
462.libquantum | 74.9 | 39.9 | 79.3 | 84.4 | 62.2 | 67.3 |
464.h264ref | 51.7 | 34.2 | 48.1 | 52.1 | 45.2 | 40.7 |
471.omnetpp | 24.5 | 25.3 | 26.8 | 29.4 | 26.6 | 29.9 |
473.astar | 28.2 | 20.7 | 26.1 | 27.9 | 24 | 23.6 |
483.xalancbmk | 41.5 | 28.2 | 41.4 | 48.2 | 42.4 | 41.8 |
Unless you are used to seeing these numbers, this does not tell you too much. As Sandy Bridge EP (Xeon E5 v1) is about 4 years old, the servers based upon this CPU are going to get replaced by newer ones. So Sandy Bridge is our reference, and Sandy Bridge performance is considered to be 100%.
Subtest | Application type | Xeon E5-2690 | Opteron 6376 | Xeon E5-2697v2 | Xeon E5-2667 v3 | Xeon E5-2699 v3 | Xeon E5-2699 v4 |
400.perlbench | Spam filter | 100% | 71% | 91% | 104% | 97% | 89% |
401.bzip2 | Compression | 100% | 72% | 90% | 99% | 90% | 76% |
403.gcc | Compiling | 100% | 66% | 97% | 105% | 91% | 83% |
429.mcf | Vehicle scheduling | 100% | 70% | 104% | 103% | 92% | 97% |
445.gobmk | Game AI | 100% | 70% | 91% | 96% | 87% | 76% |
456.hmmer | Protein seq. analyses | 100% | 86% | 91% | 102% | 89% | 93% |
458.sjeng | Chess | 100% | 70% | 93% | 100% | 87% | 80% |
462.libquantum | Quantum sim | 100% | 53% | 106% | 113% | 83% | 90% |
464.h264ref | Video encoding | 100% | 66% | 93% | 101% | 87% | 79% |
471.omnetpp | Network sim | 100% | 103% | 109% | 120% | 110% | 122% |
473.astar | Pathfinding | 100% | 73% | 93% | 99% | 85% | 84% |
483.xalancbmk | XML processing | 100% | 68% | 100% | 116% | 102% | 101% |
Many smart people have spent weeks - if not months - on SPEC CPU2006 analysis, so we will not pretend we can offer you a complete picture in a few days. If you want a detailed analysis of compilers and CPU 2006, I recommend the very detailed article of SPEC CPU 2006 meister Andreas Stiller in the February issue of C'T (German computer magazine).
We need much more profiling data than we could gather in the past weeks. But for what we can do, we'll start with the most important parameter: clockspeed.
One of the most important things to realize is that - especially with badly threaded workloads - these massive multi-core CPUs almost never work at their advertised clockspeed.
- The Xeon E5-2690 can run at 3.3 GHz with all cores busy, and is capable of boosting up to 3.8 GHz
- The Xeon E5-2697 v2 can run at 3 GHz with all cores busy, and is capable of boosting up to 3.5 GHz
- The Xeon E5-2699 v3 can run at 2.8 GHz with all cores busy, and is capable of boosting up to 3.6 GHz
- The Xeon E5-2667 v3 3.2 GHz is a specialized high frequency model. It can run at 3.4 GHz with all cores busy, and is capable of boosting up to 3.6 GHz
- The Xeon E5-2699 v4 can run at 2.8 GHz with all cores busy, and is capable of boosting up to 3.6 GHz
So that already explains a lot. In contrast to the many benchmark applications, SPEC CPU2006 runs for a long time (5 to 15 minutes per test), and our first impression is that the HCC parts are not able to keep all of their cores at their maximum turbo boost. Otherwise there is no reason why a Xeon E5-2699 v3 or v4 would perform worse than a Xeon E5-2667 v3: both can run at 3.6 GHz when one core is active.
The low IPC, memory intensive network simulator omnetppp seems to be the only test that runs significantly better on the newer cores (Haswell, Broadwell) compared to Sandy Bridge. That also seems to be the only benchmark where the high core count chips (E5-2699 v4, E5-2699 v3) continue to outperform Sandy Bridge. We could pinpoint the reason by testing with different memory speeds and channels. The E5-2699 v4 can offer the highest performance thanks to the larger L3-cache (55 MB) and the higher DIMM speed (DDR4-2400) compared to Sandy Bridge (20 MB, DDR3-1600). Otherwise when we keep the clockspeed more or less constant, by looking at the Xeon E5-2667v3 and the Xeon E5-2690, we get a 1-5% speed difference, and only the memory intensive subtests (omnetpp, Libquantum) and xalancbmk (low IPC, branch intensive) show higher improvements.
Once we test both top SKUs with "-Ofast" (a more aggressive compiler setting), the results change quite a bit:.
Subtest | Application type | Xeon E5-2699 v4 vs Xeon E5-2690 (-Ofast) | Xeon E5-2699 v4 vs Xeon E5-2690 (-O2) |
400.perlbench | Spam filter | 111% | 89% |
401.bzip2 | Compression | 94% | 76% |
403.gcc | Compiling | 95% | 83% |
429.mcf | Vehicle scheduling | 114% | 97% |
445.gobmk | Game AI | 90% | 76% |
456.hmmer | Protein seq. analyses | 106% | 93% |
458.sjeng | Chess | 93% | 80% |
462.libquantum | Quantum sim | 101% | 90% |
464.h264ref | Video encoding | 89% | 79% |
471.omnetpp | Network sim | 132% | 122% |
473.astar | Pathfinding | 98% | 84% |
483.xalancbmk | XML processing | 105% | 101% |
Switching from -O2 to -Ofast improves Broadwell-EP's absolute performance by over 19%. Meanwhile the relative performance advantage versus the Xeon E5-2690 averages 3%. As a result, the clockspeed disadvantage of the latest Xeon is negated by the increase in IPC. Clearly the latest generation of Xeons benefit more from aggressive optimizations than the previous ones. That is unsurprising of course, but it is interesting that the newest Xeons need more optimization to "hold the line" in single core performance.
So far we can conclude that if you were to upgrade from a Xeon E5-2xxx v1 to a similar v4 model, your single threaded integer code will not run faster without recompiling and optimizing. The process improvements have been used mostly to add more cores in the same power envelope, while at same time Intel also traded a few speed bins in to add even more cores in the top models. As a result single core integer performance basically holds the line, nothing more. The only exception are memory intensive applications who benefit from every growing L3-cache and the faster DRAM technology.
112 Comments
View All Comments
ltcommanderdata - Friday, April 1, 2016 - link
Does anyone know the Windows support situation for Broadwell-EP for workstation use? Microsoft said Broadwell is the last fully supported processor for Windows 7/8.1 with Skylake getting transitional support and Kaby Lake will not be supported. So how does Broadwell-EP fit in? Is it lumped in with Broadwell and is fully supported or will it be treated like Skylake with temporary support until 2018 and only critical security updates after that? And following on will Skylake-EP see any Windows 7/8.1 support at all or will it not be supported since it'll presumably be released after Kaby Lake?extide - Friday, April 1, 2016 - link
When MS says they are not supporting Skylake on Windows 7 DOES NOT MEAN it won't work. It just means they are not going to add any specific support for that processor in the older OS's. They are not adding in the speed shift support, essentially.For some reason the press has not made this very clear, and many people are freaking out thinking that there will be a hard break here will stuff will straight up not work. That is not the case.
Broadwell has no new OS level features over Haswell (unlike Skylake with speed shift) so there is nothing special about Broadwell to the OS. As the poster above mentions, they are all x86 cpu's and will all still work with x86 OS's.
The difference here is between "Fully Supported" and Compatible. Skylake and even Kaby Lake will be compatible with WIndows 7/8/8.1.
aryonoco - Friday, April 1, 2016 - link
Johan, this is yet again by far the best Enterprise CPU benchmark that's available anywhere on the net.Thank you for your detailed, scientific and well documented work. Works like this are not easy, I can only imagine how many man hours (weeks?) compiling this article must have taken. I just want you to know that it's hugely appreciated.
JohanAnandtech - Friday, April 1, 2016 - link
Great to read this after weeks of hard work! :-Dfsdjmellisse - Friday, April 1, 2016 - link
hello, i want to buy E5-2630L v4any one can give me website for buy it ?
Best regards
HrD - Friday, April 1, 2016 - link
I'm confused by the following:"The following compiler switches were used on icc:
-fast -openmp -parallel
The results are expressed in GB per second. The following compiler switches were used on icc:
-O3 –fopenmp –static"
Shouldn't one of these refer to icc and the other to gcc?
JohanAnandtech - Friday, April 1, 2016 - link
Pretty sure I did not mix them up. "-fast" does not work on gcc neither does -fopenmp work on icc.patrickjp93 - Friday, April 1, 2016 - link
Um, wrong and wrong. -Ofast works with GCC 4.9 and later for sure. And -fopenmp is a valid ICC flag post-ICC 13.JohanAnandtech - Saturday, April 2, 2016 - link
"-fast" is a typical icc flag. (I did not write -"Ofast" that works on gcc 4.8 too)extide - Friday, April 1, 2016 - link
Johan, if you read the comment, you can see that you mention icc for BOTH.