The Intel Xeon E5 v4 Review: Testing Broadwell-EP With Demanding Server Workloads
by Johan De Gelas on March 31, 2016 12:30 PM EST- Posted in
- CPUs
- Intel
- Xeon
- Enterprise
- Enterprise CPUs
- Broadwell
Single Core Integer Performance With SPEC CPU2006
In past server reviews, I used LZMA (7-zip) compression and decompression to evaluate single threaded performance. But I was well aware that while it was a decent integer test, it also gave a very myopic view in the process. After noticing that my colleagues used SPEC CPU2006, and after discussing the matter with several people, I realized that running SPEC CPU2006 was a much better way to evaluate single core performance. Even though SPEC CPU2006 is more HPC and workstation oriented, it contains a good variety of integer workloads.
I also wanted to keep the settings as "normal" as possible. So I used:
- 64 bit gcc : most used compiler on linux, good all round compiler that does not try to "break" benchmarks (libquantum...)
- gcc version 4.8.4: 4.8.x has been around for a long time, very mature version
- -O2 -fno-strict-aliasing: standard compiler settings that many developers use
- Run 2 copies and bind them to the first core
The ultimate objective is to measure performance in non-"aggressively optimized" applications where for some reason - as is frequently the case - a "multi thread unfriendly" task keeps us waiting. As we want to be able to compare these numbers to other architectures such as the IBM POWER 8, we decided to use all threads available on a single core. In case of Intel, this means one physical and two simultaneous threads running on top of it.
We included the Opteron 6376 for nostalgic reasons. We are showing the results of 2 threads running on top of one module with 2 "integer cores".
Subtest | Xeon E5-2690 | Opteron 6376 | Xeon E5-2697v2 | Xeon E5-2667 v3 | Xeon E5-2699 v3 | Xeon E5-2699 v4 |
400.perlbench | 41.1 | 29.3 | 37.6 | 42.6 | 39.9 | 36.6 |
401.bzip2 | 33.4 | 24.1 | 30.1 | 33.1 | 29.9 | 25.3 |
403.gcc | 40.2 | 26.7 | 38.9 | 42.4 | 36.4 | 33.3 |
429.mcf | 45.1 | 31.7 | 46.8 | 46.4 | 41.6 | 43.9 |
445.gobmk | 36.4 | 25.5 | 33.2 | 34.9 | 31.7 | 27.7 |
456.hmmer | 30.4 | 26.1 | 27.6 | 31 | 27.1 | 28.4 |
458.sjeng | 35.2 | 24.7 | 32.8 | 35.2 | 30.5 | 28.3 |
462.libquantum | 74.9 | 39.9 | 79.3 | 84.4 | 62.2 | 67.3 |
464.h264ref | 51.7 | 34.2 | 48.1 | 52.1 | 45.2 | 40.7 |
471.omnetpp | 24.5 | 25.3 | 26.8 | 29.4 | 26.6 | 29.9 |
473.astar | 28.2 | 20.7 | 26.1 | 27.9 | 24 | 23.6 |
483.xalancbmk | 41.5 | 28.2 | 41.4 | 48.2 | 42.4 | 41.8 |
Unless you are used to seeing these numbers, this does not tell you too much. As Sandy Bridge EP (Xeon E5 v1) is about 4 years old, the servers based upon this CPU are going to get replaced by newer ones. So Sandy Bridge is our reference, and Sandy Bridge performance is considered to be 100%.
Subtest | Application type | Xeon E5-2690 | Opteron 6376 | Xeon E5-2697v2 | Xeon E5-2667 v3 | Xeon E5-2699 v3 | Xeon E5-2699 v4 |
400.perlbench | Spam filter | 100% | 71% | 91% | 104% | 97% | 89% |
401.bzip2 | Compression | 100% | 72% | 90% | 99% | 90% | 76% |
403.gcc | Compiling | 100% | 66% | 97% | 105% | 91% | 83% |
429.mcf | Vehicle scheduling | 100% | 70% | 104% | 103% | 92% | 97% |
445.gobmk | Game AI | 100% | 70% | 91% | 96% | 87% | 76% |
456.hmmer | Protein seq. analyses | 100% | 86% | 91% | 102% | 89% | 93% |
458.sjeng | Chess | 100% | 70% | 93% | 100% | 87% | 80% |
462.libquantum | Quantum sim | 100% | 53% | 106% | 113% | 83% | 90% |
464.h264ref | Video encoding | 100% | 66% | 93% | 101% | 87% | 79% |
471.omnetpp | Network sim | 100% | 103% | 109% | 120% | 110% | 122% |
473.astar | Pathfinding | 100% | 73% | 93% | 99% | 85% | 84% |
483.xalancbmk | XML processing | 100% | 68% | 100% | 116% | 102% | 101% |
Many smart people have spent weeks - if not months - on SPEC CPU2006 analysis, so we will not pretend we can offer you a complete picture in a few days. If you want a detailed analysis of compilers and CPU 2006, I recommend the very detailed article of SPEC CPU 2006 meister Andreas Stiller in the February issue of C'T (German computer magazine).
We need much more profiling data than we could gather in the past weeks. But for what we can do, we'll start with the most important parameter: clockspeed.
One of the most important things to realize is that - especially with badly threaded workloads - these massive multi-core CPUs almost never work at their advertised clockspeed.
- The Xeon E5-2690 can run at 3.3 GHz with all cores busy, and is capable of boosting up to 3.8 GHz
- The Xeon E5-2697 v2 can run at 3 GHz with all cores busy, and is capable of boosting up to 3.5 GHz
- The Xeon E5-2699 v3 can run at 2.8 GHz with all cores busy, and is capable of boosting up to 3.6 GHz
- The Xeon E5-2667 v3 3.2 GHz is a specialized high frequency model. It can run at 3.4 GHz with all cores busy, and is capable of boosting up to 3.6 GHz
- The Xeon E5-2699 v4 can run at 2.8 GHz with all cores busy, and is capable of boosting up to 3.6 GHz
So that already explains a lot. In contrast to the many benchmark applications, SPEC CPU2006 runs for a long time (5 to 15 minutes per test), and our first impression is that the HCC parts are not able to keep all of their cores at their maximum turbo boost. Otherwise there is no reason why a Xeon E5-2699 v3 or v4 would perform worse than a Xeon E5-2667 v3: both can run at 3.6 GHz when one core is active.
The low IPC, memory intensive network simulator omnetppp seems to be the only test that runs significantly better on the newer cores (Haswell, Broadwell) compared to Sandy Bridge. That also seems to be the only benchmark where the high core count chips (E5-2699 v4, E5-2699 v3) continue to outperform Sandy Bridge. We could pinpoint the reason by testing with different memory speeds and channels. The E5-2699 v4 can offer the highest performance thanks to the larger L3-cache (55 MB) and the higher DIMM speed (DDR4-2400) compared to Sandy Bridge (20 MB, DDR3-1600). Otherwise when we keep the clockspeed more or less constant, by looking at the Xeon E5-2667v3 and the Xeon E5-2690, we get a 1-5% speed difference, and only the memory intensive subtests (omnetpp, Libquantum) and xalancbmk (low IPC, branch intensive) show higher improvements.
Once we test both top SKUs with "-Ofast" (a more aggressive compiler setting), the results change quite a bit:.
Subtest | Application type | Xeon E5-2699 v4 vs Xeon E5-2690 (-Ofast) | Xeon E5-2699 v4 vs Xeon E5-2690 (-O2) |
400.perlbench | Spam filter | 111% | 89% |
401.bzip2 | Compression | 94% | 76% |
403.gcc | Compiling | 95% | 83% |
429.mcf | Vehicle scheduling | 114% | 97% |
445.gobmk | Game AI | 90% | 76% |
456.hmmer | Protein seq. analyses | 106% | 93% |
458.sjeng | Chess | 93% | 80% |
462.libquantum | Quantum sim | 101% | 90% |
464.h264ref | Video encoding | 89% | 79% |
471.omnetpp | Network sim | 132% | 122% |
473.astar | Pathfinding | 98% | 84% |
483.xalancbmk | XML processing | 105% | 101% |
Switching from -O2 to -Ofast improves Broadwell-EP's absolute performance by over 19%. Meanwhile the relative performance advantage versus the Xeon E5-2690 averages 3%. As a result, the clockspeed disadvantage of the latest Xeon is negated by the increase in IPC. Clearly the latest generation of Xeons benefit more from aggressive optimizations than the previous ones. That is unsurprising of course, but it is interesting that the newest Xeons need more optimization to "hold the line" in single core performance.
So far we can conclude that if you were to upgrade from a Xeon E5-2xxx v1 to a similar v4 model, your single threaded integer code will not run faster without recompiling and optimizing. The process improvements have been used mostly to add more cores in the same power envelope, while at same time Intel also traded a few speed bins in to add even more cores in the top models. As a result single core integer performance basically holds the line, nothing more. The only exception are memory intensive applications who benefit from every growing L3-cache and the faster DRAM technology.
112 Comments
View All Comments
jhh - Thursday, March 31, 2016 - link
The article says TSX-NI is supported on the E5, but if one looks at Intel ARK, it say it's not. Do the processors say they support TSX-NI? Or is this another one of the things which will be left for the E7?JohanAnandtech - Friday, April 1, 2016 - link
Intel's official slides say: "supports TSX". All SKUs, no exceptions.Oxford Guy - Thursday, March 31, 2016 - link
Bigger, badder, still obsolete cores.patrickjp93 - Friday, April 1, 2016 - link
Obsolete? Troll.Oxford Guy - Tuesday, April 5, 2016 - link
Unlike you, propagandist, I know what Skylake is.benzosaurus - Thursday, March 31, 2016 - link
"You can replace a dual Xeon 5680 with one Xeon E5-2699 v4 and almost double your performance while halving the CPU power consumption."I mean you can, but you can buy 4 X5680s for a quarter the price of a single E5-2699v4. It takes a lot of power savings to make that worthwhile. The pricing in the server market's always seemed weirdly non-linear to me.
warreo - Friday, April 1, 2016 - link
Presumably, it's not just about TCO. Space is at a premium in a datacenter, and so being able to fit more performance per sq ft also warrants a higher price, just like how notebook parts have historically been more expensive than their desktop equivalents.ShieTar - Friday, April 1, 2016 - link
But you don't get 4 1366-Systems for the price of one 2011-3 System. Depending on your Memory, Storage and Interconnect Needs, even two full Systems based on the Xeon 5680 may cost you more than one system based on the E5-2699 v4. One less Infiniband-Adapter can easily save you 500$ in Hardware.And you are not only halving the CPU power consumption, but also the power consumption of the rest of the system that you no longer use, so instead of 140W you are saving probably at least 200W per System, which can already add up to more than 1k$ in electricity and cooling bills for a 24/7 machine running for 3 years.
And last, but by no means least, less parts means less space, less chance for failure, less maintenance effort. If you happily waste a few hours here or there to maintain your own workstation, you don't do the math, but if you have to pay somebody to do it, salaries matter quickly. With an MTBF for an entire server rarely being much higher than 40.000, and recovery/repair easily taking you a person-day of work, each system generates about 1.7 hours of work per year. Cost of work (it's more than salaries, of course) probably comes up to 100$ for a skilled technical administrator, thus producing another 500$ over 3 years of added operational cost.
And of course, space matters as well. If your data center is filled, it can be more cost effective to replace the old CPUs with new expensive ones, rather than build a new facility to fill with more old Systems.
If you add it all up, I doubt you can get a System with an Xeon 5680 and operate it over 3 years for anything below 20.000$. So going from two 20.000$-Systems to a single 24.000$ Dollar System (because of an extra 4000$ for the big CPU) should save you a lot of money in the long run.
JohanAnandtech - Friday, April 1, 2016 - link
Where do you get your pricing info from? I can not imagine that server vendors still sell X5680s.extide - Friday, April 1, 2016 - link
Yeah, if you go used. No enterprise sysadmin worth his salt is ever going to put used gear that is not in warranty, and in support into production.