Intel Xeon E5 Version 3: Up to 18 Haswell EP Cores

Name: Intel Xeon E5 Version 3: Up to 18 Haswell EP Cores
Item: Intel Xeon E5 Version 3: Up to 18 Haswell EP Cores
Author: Johan De Gelas

by Johan De Gelas on September 8, 2014 12:30 PM EST

85 Comments | Add A Comment

85 Comments

OpenFoam

Computational Fluid Dynamics is a very important part of the HPC world. Several readers told us that we should look into OpenFoam, and my lab was able to work with the professionals of Actiflow. Actiflow specializes in combining aerodynamics and product design. Calculating aerodynamics involves the use of CFD software, and Actiflow uses OpenFoam to accomplish this. To give you an idea what these skilled engineers can do, they worked with Ferrari to improve the underbody airflow of the Ferrari 599 and increase its downforce.

We were allowed to use one of their test cases as a benchmark, but we are not allowed to discuss the specific solver. All tests were done on OpenFoam 2.2.1 and openmpi-1.6.3.

Many CFD calculations do not scale well on clusters, unless you use InfiniBand. InfiniBand switches are quite expensive and even then there are limits to scaling. We do not have an InfiniBand switch in the lab, unfortunately. Although it's not as low latency as InfiniBand, we do have a good 10G Ethernet infrastructure, which performs rather well. So we can compare our newest Xeon server with a basic cluster.

We also found AVX code inside OpenFoam 2.2.1, so we assume that this is one of the cases where AVX improves FP performance. To understand this real world test case better, we'll start with a single-threaded benchmark.

Actiflow OpenFOAM – One Thread

As this is AVX code, the clock speed "rules" change. A 2.3GHz Xeon E5 v3 can fall back to 1.9GHz if necessary, but it may also boost to 3.3GHz if the thermals allow it. The Xeon 2695 v3 has less TDP headroom and as a result it performs slightly slower than the Xeon E5-2699 v3. Still they cannot beat the Xeon E5-2667 v3 in single-threaded HPC performance. The latter is the better chip for this workload as it guarantees 2.7GHz and can boost up to 3.5GHz. As the previous Xeons also support AVX and run between 2.7 and 3.3GHz, they keep up with the Xeon E5-2667 v3.

Of course, most HPC code is now multi-threaded. We next ran OpenFOAM at one thread per physical core, which is about 5% faster than running with one thread per logical core (likely due to AVX).

Actiflow OpenFOAM

If you work professionally with OpenFOAM, it is clear that it pays off to understand what a certain CPU offers. If money does not matter much, the Xeon E5-2699 v3 does what is has to, which is to beat everybody else despite the fact that OpenFOAM does not scale that well beyond a certain point.

To give you an idea of what we're seeing, with 16 threads on the Xeon E5-2699 v3 we were already running at 30 runs per hour. Despite the fact that our workload is already a pretty heavy one (>600k cells), it is clear you need a larger mesh to really use the best Xeons of today to their full potential.

A less expensive option is the Xeon E5-2667 v3, but the real winner here is the Xeon E5-2650L v3 which costs a full $1000 per CPU les than the Xeon E5-2695 v3 and consumes quite a bit less as we will see on the next page.

Drupal Website: Performance per Watt Energy and HPC

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

85 Comments

View All Comments

LostAlone - Saturday, September 20, 2014 - link
Given the difference in size between the two companies it's not really all that surprising though. Intel are ten times AMD's size, and I have to imagine that Intel's chip R&D department budget alone is bigger than the whole of AMD. And that is sad really, because I'm sure most of us were learning our computer science when AMD were setting the world on fire, so it's tough to see our young loves go off the rails. But Intel have the money to spend, and can pursue so many more potential avenues for improvement than AMD and that's what makes the difference.
Kevin G - Monday, September 8, 2014 - link
I'm actually surprised they released the 18 core chip for the EP line. In the Ivy Bridge generation, it was the 15 core EX die that was harvested for the 12 core models. I was expecting the same thing here with the 14 core models, though more to do with power binning than raw yields.

I guess with the recent TSX errata, Intel is just dumping all of the existing EX dies into the EP socket. That is a good means of clearing inventory of a notably buggy chip. When Haswell-EX formally launches, it'll be of a stepping with the TSX bug resolved.
SanX - Monday, September 8, 2014 - link
You have teased us with the claim that added FMA instructions have double floating point performance. Wow! Is this still possible to do that with FP which are already close to the limit approaching just one clock cycle? This was good review of integer related performance but please combine with Ian to continue with the FP one.
JohanAnandtech - Monday, September 8, 2014 - link
Ian is working on his workstation oriented review of the latest Xeon
Kevin G - Monday, September 8, 2014 - link
FMA is common place in many RISC architectures. The reason why we're just seeing it now on x86 is that until recently, the ISA only permitted two registers per operand.

Improvements in this area maybe coming down the line even for legacy code. Intel's micro-op fusion has the potential to take an ordinary multiply and add and fuse them into one FMA operation internally. This type of optimization is something I'd like to see in a future architecture (Sky Lake?).
valarauca - Monday, September 8, 2014 - link
The Intel compiler suite I believe already converts

x *= y;
x += z;

into an FMA operation when confronted with them.
Kevin G - Monday, September 8, 2014 - link
That's with source that is going to be compiled. (And don't get me wrong, that's what a compiler should do!)

Micro-op fusion works on existing binaries years old so there is no recompile necessary. However, micro-op fusion may not work in all situations depending on the actual instruction stream. (Hypothetically the fusion of a multiply and an add in an instruction stream may have to be adjacent to work but an ancient compiler could have slipped in some other instructions in between them to hide execution latencies as an optimization so it'd never work in that binary.)
DIYEyal - Monday, September 8, 2014 - link
Very interesting read.
And I think I found a typo: page 5 (power optimization). It is well known that THE (not needed) Haswell HAS (is/ has been) optimized for low idle power.
vLsL2VnDmWjoTByaVLxb - Monday, September 8, 2014 - link
Colors or labeling for your HPC Power Consumption graph don't seem right.
JohanAnandtech - Monday, September 8, 2014 - link
Fixed, thanks for pointing it out.

Intel Xeon E5 Version 3: Up to 18 Haswell EP Cores

OpenFoam

Post Your Comment

85 Comments

View All Comments

LostAlone - Saturday, September 20, 2014 - link

Kevin G - Monday, September 8, 2014 - link

SanX - Monday, September 8, 2014 - link

JohanAnandtech - Monday, September 8, 2014 - link

Kevin G - Monday, September 8, 2014 - link

valarauca - Monday, September 8, 2014 - link

Kevin G - Monday, September 8, 2014 - link

DIYEyal - Monday, September 8, 2014 - link

vLsL2VnDmWjoTByaVLxb - Monday, September 8, 2014 - link

JohanAnandtech - Monday, September 8, 2014 - link

Log in

Don't have an account? Sign up now