The Intel Xeon E5 v4 Review: Testing Broadwell-EP With Demanding Server Workloads

Name: The Intel Xeon E5 v4 Review: Testing Broadwell-EP With Demanding Server Workloads
Item: The Intel Xeon E5 v4 Review: Testing Broadwell-EP With Demanding Server Workloads
Author: Johan De Gelas

by Johan De Gelas on March 31, 2016 12:30 PM EST

112 Comments | Add A Comment

112 Comments

Single Core Integer Performance With SPEC CPU2006

In past server reviews, I used LZMA (7-zip) compression and decompression to evaluate single threaded performance. But I was well aware that while it was a decent integer test, it also gave a very myopic view in the process. After noticing that my colleagues used SPEC CPU2006, and after discussing the matter with several people, I realized that running SPEC CPU2006 was a much better way to evaluate single core performance. Even though SPEC CPU2006 is more HPC and workstation oriented, it contains a good variety of integer workloads.

I also wanted to keep the settings as "normal" as possible. So I used:

64 bit gcc : most used compiler on linux, good all round compiler that does not try to "break" benchmarks (libquantum...)
gcc version 4.8.4: 4.8.x has been around for a long time, very mature version
-O2 -fno-strict-aliasing: standard compiler settings that many developers use
Run 2 copies and bind them to the first core

The ultimate objective is to measure performance in non-"aggressively optimized" applications where for some reason - as is frequently the case - a "multi thread unfriendly" task keeps us waiting. As we want to be able to compare these numbers to other architectures such as the IBM POWER 8, we decided to use all threads available on a single core. In case of Intel, this means one physical and two simultaneous threads running on top of it.

We included the Opteron 6376 for nostalgic reasons. We are showing the results of 2 threads running on top of one module with 2 "integer cores".

Subtest	Xeon E5-2690	Opteron 6376	Xeon E5-2697v2	Xeon E5-2667 v3	Xeon E5-2699 v3	Xeon E5-2699 v4
400.perlbench	41.1	29.3	37.6	42.6	39.9	36.6
401.bzip2	33.4	24.1	30.1	33.1	29.9	25.3
403.gcc	40.2	26.7	38.9	42.4	36.4	33.3
429.mcf	45.1	31.7	46.8	46.4	41.6	43.9
445.gobmk	36.4	25.5	33.2	34.9	31.7	27.7
456.hmmer	30.4	26.1	27.6	31	27.1	28.4
458.sjeng	35.2	24.7	32.8	35.2	30.5	28.3
462.libquantum	74.9	39.9	79.3	84.4	62.2	67.3
464.h264ref	51.7	34.2	48.1	52.1	45.2	40.7
471.omnetpp	24.5	25.3	26.8	29.4	26.6	29.9
473.astar	28.2	20.7	26.1	27.9	24	23.6
483.xalancbmk	41.5	28.2	41.4	48.2	42.4	41.8

Unless you are used to seeing these numbers, this does not tell you too much. As Sandy Bridge EP (Xeon E5 v1) is about 4 years old, the servers based upon this CPU are going to get replaced by newer ones. So Sandy Bridge is our reference, and Sandy Bridge performance is considered to be 100%.

Subtest	Application type	Xeon E5-2690	Opteron 6376	Xeon E5-2697v2	Xeon E5-2667 v3	Xeon E5-2699 v3	Xeon E5-2699 v4
400.perlbench	Spam filter	100%	71%	91%	104%	97%	89%
401.bzip2	Compression	100%	72%	90%	99%	90%	76%
403.gcc	Compiling	100%	66%	97%	105%	91%	83%
429.mcf	Vehicle scheduling	100%	70%	104%	103%	92%	97%
445.gobmk	Game AI	100%	70%	91%	96%	87%	76%
456.hmmer	Protein seq. analyses	100%	86%	91%	102%	89%	93%
458.sjeng	Chess	100%	70%	93%	100%	87%	80%
462.libquantum	Quantum sim	100%	53%	106%	113%	83%	90%
464.h264ref	Video encoding	100%	66%	93%	101%	87%	79%
471.omnetpp	Network sim	100%	103%	109%	120%	110%	122%
473.astar	Pathfinding	100%	73%	93%	99%	85%	84%
483.xalancbmk	XML processing	100%	68%	100%	116%	102%	101%

Many smart people have spent weeks - if not months - on SPEC CPU2006 analysis, so we will not pretend we can offer you a complete picture in a few days. If you want a detailed analysis of compilers and CPU 2006, I recommend the very detailed article of SPEC CPU 2006 meister Andreas Stiller in the February issue of C'T (German computer magazine).

We need much more profiling data than we could gather in the past weeks. But for what we can do, we'll start with the most important parameter: clockspeed.

One of the most important things to realize is that - especially with badly threaded workloads - these massive multi-core CPUs almost never work at their advertised clockspeed.

The Xeon E5-2690 can run at 3.3 GHz with all cores busy, and is capable of boosting up to 3.8 GHz
The Xeon E5-2697 v2 can run at 3 GHz with all cores busy, and is capable of boosting up to 3.5 GHz
The Xeon E5-2699 v3 can run at 2.8 GHz with all cores busy, and is capable of boosting up to 3.6 GHz
The Xeon E5-2667 v3 3.2 GHz is a specialized high frequency model. It can run at 3.4 GHz with all cores busy, and is capable of boosting up to 3.6 GHz
The Xeon E5-2699 v4 can run at 2.8 GHz with all cores busy, and is capable of boosting up to 3.6 GHz

So that already explains a lot. In contrast to the many benchmark applications, SPEC CPU2006 runs for a long time (5 to 15 minutes per test), and our first impression is that the HCC parts are not able to keep all of their cores at their maximum turbo boost. Otherwise there is no reason why a Xeon E5-2699 v3 or v4 would perform worse than a Xeon E5-2667 v3: both can run at 3.6 GHz when one core is active.

The low IPC, memory intensive network simulator omnetppp seems to be the only test that runs significantly better on the newer cores (Haswell, Broadwell) compared to Sandy Bridge. That also seems to be the only benchmark where the high core count chips (E5-2699 v4, E5-2699 v3) continue to outperform Sandy Bridge. We could pinpoint the reason by testing with different memory speeds and channels. The E5-2699 v4 can offer the highest performance thanks to the larger L3-cache (55 MB) and the higher DIMM speed (DDR4-2400) compared to Sandy Bridge (20 MB, DDR3-1600). Otherwise when we keep the clockspeed more or less constant, by looking at the Xeon E5-2667v3 and the Xeon E5-2690, we get a 1-5% speed difference, and only the memory intensive subtests (omnetpp, Libquantum) and xalancbmk (low IPC, branch intensive) show higher improvements.

Once we test both top SKUs with "-Ofast" (a more aggressive compiler setting), the results change quite a bit:.

Subtest	Application type	Xeon E5-2699 v4 vs Xeon E5-2690 (-Ofast)	Xeon E5-2699 v4 vs Xeon E5-2690 (-O2)
400.perlbench	Spam filter	111%	89%
401.bzip2	Compression	94%	76%
403.gcc	Compiling	95%	83%
429.mcf	Vehicle scheduling	114%	97%
445.gobmk	Game AI	90%	76%
456.hmmer	Protein seq. analyses	106%	93%
458.sjeng	Chess	93%	80%
462.libquantum	Quantum sim	101%	90%
464.h264ref	Video encoding	89%	79%
471.omnetpp	Network sim	132%	122%
473.astar	Pathfinding	98%	84%
483.xalancbmk	XML processing	105%	101%

Switching from -O2 to -Ofast improves Broadwell-EP's absolute performance by over 19%. Meanwhile the relative performance advantage versus the Xeon E5-2690 averages 3%. As a result, the clockspeed disadvantage of the latest Xeon is negated by the increase in IPC. Clearly the latest generation of Xeons benefit more from aggressive optimizations than the previous ones. That is unsurprising of course, but it is interesting that the newest Xeons need more optimization to "hold the line" in single core performance.

So far we can conclude that if you were to upgrade from a Xeon E5-2xxx v1 to a similar v4 model, your single threaded integer code will not run faster without recompiling and optimizing. The process improvements have been used mostly to add more cores in the same power envelope, while at same time Intel also traded a few speed bins in to add even more cores in the top models. As a result single core integer performance basically holds the line, nothing more. The only exception are memory intensive applications who benefit from every growing L3-cache and the faster DRAM technology.

Benchmark Configuration and Methodology Memory Subsystem

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

112 Comments

View All Comments

JohanAnandtech - Saturday, April 2, 2016 - link
Ok, thanks, time to sleep a little longer. I have fixed the error.
xrror - Friday, April 1, 2016 - link
It's depressing to see the mobile-first design philosophy really gutting into the last bastion of x86 performance.

I mean I get it - a 22 (20) core xeon wouldn't even exist without the aggressive power management tech needed to keep it from melting or needing exotic cooling. But it's still depressing to see ALL of the arch improvements immediately negated with lowered clock speeds, or worse "turbo speeds" you will never actually see once the machine is running production loads.

The engineering behind these big core count chips though is always very impressive. Also did Intel ever say how they "fixed" TSX?
FunBunny2 - Friday, April 1, 2016 - link
"It's depressing to see the mobile-first design philosophy really gutting into the last bastion of x86 performance."

welcome to the world of laissez faire capitalism: do what makes the most money today, irregardless of future consequences. used to be, Intel could rely on M$ making the next versions of Windoze and Office impossible to run on existing Pentiums, thus driving sales of the next Pentium (a whole machine, at that). these days it's up to gamers and data centres. not taking any bets on which turns out to be in the driver's seat.
xrror - Friday, April 1, 2016 - link
Well, considering that "computer gaming" has degraded to whatever the kids are running on their smartphones, or the parent's tablet I'm not hopeful for any new resurgence in demand for high performance PC's in the mass market.

So the future consequences for Intel prioritizing power efficiency over performance, or possibly developing a separate fabrication tech for performance is... likely not very much. So there really is no "future consequence" for Intel. Sure they could go out and actually try and make a 10Ghz 9nm part possible, but nobody in 2020 would buy it because... it probably would go into whatever iDevice they care about. And HPC market I dunno. Maybe if it datamines marketing data faster or can microtrade on the stock market faster or something. meh.

The general public really doesn't care about performance anymore (honestly, they may never have), only how portable it is and if a device is good enough to run their stuff on the go.

The high end market like these multi-core xeons though, is strange because you'd think this is where Intel would go all in, but I guess when your only competitors are IBM Power and (currently non-competitive) AMD I dunno...

I mean it's sad, even Intel has to beg to justify it's R&D expenses to shareholders - which is stupid because Intel's R&D is one of it's biggest strengths. But such as it is. Apr 1 rant over ;)
abufrejoval - Friday, April 1, 2016 - link
Johan, you keep bemoaning the fact that lack of competition seems to stop "real progress" and I wonder where you expect that progress to happen.

More specifically you seem to desire more GHz and I can understand that desire, which may originate from that crazy 40MHz to 4GHz rush we all experienced somewhere in the decade starting in the mid nineties.

I understand the emotion, but I wonder how it fits the scientific mind I see everywhere else in your work, because 8, 16 or 32 GHz is simply not going to happen, competition or not.

Sure 8GHz are possible, you can even purchase 5GHz off the shelves. But it simply doesn't deliver in terms of Oomp/$. And Web Scale is all about value/€ and the main driver of server evolution today.

We'll still see radical speedups where it counts, but it will have to be via special purpose function blocks either on SoCs, or by adding a couple of extra instructions or by doing something as radical as Micron's Automata Processor.

But general purpose von Neumann has hit the Gigahertz wall years ago and nothing can change that except a different model of compute.

I liked the reference to Andreas Stiller, but I'm not sure everybody here has a subscription to c't like I do since the early 1990's. There could also be the tiny issue that not everyone outside Belgium is quadrilingual.

Make no mistake: I love your work! It's a pleasure to read for form, style and the content!
The Von Matrices - Saturday, April 2, 2016 - link
Any indication of the QPI speed of these chips? Did Intel increase it from the 9.6 GT/s in Haswell-EP?
Ian Cutress - Saturday, April 2, 2016 - link
Most of the high end are 9.6 GT/s. https://twitter.com/IanCutress/status/715582714099...
watersb - Saturday, April 2, 2016 - link
Johan, this is fantastic work. Thanks very much.

Any way to address RAS features?
isrv - Saturday, April 2, 2016 - link
well, i'm completely dissapointed.
web servers wants higher clock speed.
single-thread load (like PHP) become even slower on those E5v4 due to drop in GHz's.
still, the best CPU's for that is E3-1290v2, E3-1281v3 (and 1286v3), E3-1280v5, E5-1630v3, E5-1620v2 and the only one 6-core E5-1660v2
all those are 3.7Ghz (pointless to look at turbo speed since we're under constant 24/7 load).

i was hoping to at least one 3.8GHz or even higher.

so no changes here, E5-1660v2 is still the fastest web-server CPU.
or E5-1630v3 by sacrificing 2 cores for a bit faster memory.
patrickjp93 - Sunday, April 3, 2016 - link
For those 4-8 core chips, the turbo boost is maintainable for 24/7 workloads if your cooling is sufficient. You seem to know far less about this environment than you let on. And who the hell still uses single-threaded PHP? And you're not taking into account better caching algorithms and other architectural improvements that make the 200MHz slower V4 run faster than your V2.

The Intel Xeon E5 v4 Review: Testing Broadwell-EP With Demanding Server Workloads

Single Core Integer Performance With SPEC CPU2006

Post Your Comment

112 Comments

View All Comments

JohanAnandtech - Saturday, April 2, 2016 - link

xrror - Friday, April 1, 2016 - link

FunBunny2 - Friday, April 1, 2016 - link

xrror - Friday, April 1, 2016 - link

abufrejoval - Friday, April 1, 2016 - link

The Von Matrices - Saturday, April 2, 2016 - link

Ian Cutress - Saturday, April 2, 2016 - link

watersb - Saturday, April 2, 2016 - link

isrv - Saturday, April 2, 2016 - link

patrickjp93 - Sunday, April 3, 2016 - link

Log in

Don't have an account? Sign up now