Dissecting Intel's EPYC Benchmarks: Performance Through the Lens of Competitive Analysis

Name: Dissecting Intel's EPYC Benchmarks: Performance Through the Lens of Competitive Analysis
Item: Dissecting Intel's EPYC Benchmarks: Performance Through the Lens of Competitive Analysis

by Johan De Gelas & Ian Cutress on November 28, 2017 9:00 AM EST

105 Comments | Add A Comment

105 Comments

HPC Benchmarks

Discussing HPC benchmarks feels always like opening a can of worms to me. Each benchmark requires a thorough understanding of the software and performance can be tuned massively by using the right compiler settings. And to make matters worse: in many cases, these workloads can be run much faster on a GPU or MIC, making CPU benchmarking in some situations irrelevant.

NAMD (NAnoscale Molecular Dynamics) is a molecular dynamics application designed for high-performance simulation of large biomolecular systems. It is rather memory bandwidth limited, as even with the advantage of an AVX-512 binary, the Xeon 8160 does not defeat the AVX2-equipped AMD EPYC 7601.

LAMMPS is classical molecular dynamics code, and an acronym for Large-scale Atomic/Molecular Massively Parallel Simulator. GROMACS (for GROningen MAchine for Chemical Simulations) primarily does simulations for biochemical molecules (bonded interactions). Intel compiled the AMD version with the Intel compiler and AVX2. The Intel machines were running AVX-512 binaries.

For these three tests, the CPU benchmarks results do not really matter. NAMD runs about 8 times faster on an NVIDIA P100. LAMMPS and GROMACS run about 3 times faster on a GPU, and also scale out with multiple GPUs.

Monte Carlo is a numerical method that uses statistical sampling techniques to approximate solutions to quantitative problems. In finance, Monte Carlo algorithms are used to evaluate complex instruments, portfolios, and investments. This is a compute bound, double precision workload that does not run faster on a GPU than on Intel's AVX-512 capable Xeons. In fact, as far as we know the best dual socket Xeons are quite a bit faster than the P100 based Tesla. Some of these tests are also FP latency sensitive.

Black-Scholes is another popular mathematical model used in finance. As this benchmark is also double precision, the dual socket Xeons should be quite competitive compared to GPUs.

So only the Monte Carlo and Black Scholes are really relevant, showing that AVX-512 binaries give the Intel Xeons the edge in a limited number of HPC applications. In most HPC cases, it is probably better to buy a much more affordable CPU and to add a GPU or even a MIC.

The Caveats

Intel drops three big caveats when reporting these numbers, as shown in the bullet points at the bottom of the slide.

Firstly is that these are single node measurements: One 32-core EPYC vs 20/24-core Intel processors. Both of these CPUs, the Gold 6148 and the Platinum 8160, are in the ball-park pricing of the EPYC. This is different to the 8160/8180 numbers that Intel has provided throughout the rest of the benchmarking numbers.

The second is the compiler situation: in each benchmark, Intel used the Intel compiler for Intel CPUs, but compiled the AMD code on GCC, LLVM and the Intel compiler, choosing the best result. Because Intel is going for peak hardware performance, there is no obvious need for Intel to ensure compiler parity here. Compiler choice, as always, can have a substantial effect on a real-world HPC can of worms.

The third caveat is that Intel even admits that in some of these tests, they have different products oriented to these workloads because they offer faster memory. But as we point out on most tests, GPUs also work well here.

Database Performance & Variability Conclusion: Competition Is Good

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

105 Comments

View All Comments

lefty2 - Tuesday, November 28, 2017 - link
When Skylake runs AVX 512 and AVX2 instructions it causes both the clock frequency to go down *and* the voltage to go up. (https://www.intel.com/content/dam/www/public/us/en... However, it can only bring the voltage back down within 1ms. If you get a mix of AVX2 and regular instructions, like you do in the POV ray test, then it's going to be running on higher voltage the whole time. That probably explains why the Xeon 8176 drawed so much more power than the EPYC in your Energy consumption test.
The guys at cloudflare also observed a similar effect (although they only notice the performance degrade): https://blog.cloudflare.com/on-the-dangers-of-inte...
Kevin G - Tuesday, November 28, 2017 - link
In the HPC section, the article indicates that NAMD is faster on the Epyc system but the accompanying graphic points toward a draw with the Xeon Gold 6148 and a win for the Xeon Platinum 8160. Epyc does win a few benchmarks in the list prior to NAMD though.
Frank_han - Tuesday, November 28, 2017 - link
When you run those tests, have you bind CPU threads, how did you take care of different layers of numa domains.
UpSpin - Tuesday, November 28, 2017 - link
HIghly questionable article:
"A lot of time the software within a system will only vaguely know what system it is being run on, especially if that system is virtualised". Why do you say this if you publish HPC results? There the software knows exactly whay type of processor in what kind of configuration it is running.
"The second is the compiler situation: in each benchmark, Intel used the Intel compiler for Intel CPUs, but compiled the AMD code on GCC, LLVM and the Intel compiler, choosing the best result" More important, what type of math library did they use? The Intel MKL has an unmatched optimization, have they used the same for the AMD system?
"Firstly is that these are single node measurements: One 32-core EPYC vs 20/24-core Intel processors." Why don't you make it clear, that by doing this, the benchmark became useless!!! Performance doesn't scale linearly with core count: http://www.gromacs.org/@api/deki/files/240/=gromac...
So it makes a huge difference if I compare a simulation which runs on 32 cores with a simulation which runs on 20 cores. If I calculate the performance per core then, I always see that the lower core CPU is much much faster, because of scaling issues of the simulation software. You haven't disclosed how Intel got their 'relative performance' value.
Elstar - Tuesday, November 28, 2017 - link
Do we know for sure that the Omni-Path Skylake CPUs actually use PCIe internally for the fabric port? If you look at Intel's "ark" database, all of the "F" parts have one fewer UPI links, which seems weird.
HStewart - Tuesday, November 28, 2017 - link
I think this was a realistic article on analysis of the two systems. And it does point to important that Intel system is more mature system than AMD EPYC system. My personally feeling is that AMD is thrown together so that claim core count without realistically thinking about the designed.

But it does give Intel a good shot in ARM with completion and I expect Intel's next revision to have significantly leap in technology.

I did like the systems for similarly configured - as the cost, I build myself 10 years a dual Xeon 5160 that was about $8000 - but it was serious machine at the time and significantly faster than normal desktop and last much longer. It was also from Supermicro and find machine - for the longest time it was still faster than a lot machine you can get at BestBuy - it has Windows 10 on it now and still runs today - but I rarely used it because I like the portability of laptops
gescom - Tuesday, November 28, 2017 - link
https://www.servethehome.com/wp-content/uploads/20...

And suddenly - 8 core 6134 Skylake-SP - equals - 32 core Epyc 7601.
Amazing. Really amazing.
gescom - Tuesday, November 28, 2017 - link
Huh, I forgot - and that is Skylake at 130W vs Epyc at 180W.
ddriver - Tuesday, November 28, 2017 - link
Gromacs is a very narrow niche product and also very biased - they heavily optimize for intel and nvidia and push amd products to take an inefficient code path.
HStewart - Tuesday, November 28, 2017 - link
This is comparison with AVX2 / AVX512

AVX512 is twice as wide as AVX2 and significant more power than the AVX2 - so yes it very possible in this this test that CPU with 1/4 the normal CPU cores can have more power because AVX512.

Also I heard AMD's implementation of AVX2 is actually two 128 bits together - these results could show that is true.

Dissecting Intel's EPYC Benchmarks: Performance Through the Lens of Competitive Analysis

HPC Benchmarks

The Caveats

Post Your Comment

105 Comments

View All Comments

lefty2 - Tuesday, November 28, 2017 - link

Kevin G - Tuesday, November 28, 2017 - link

Frank_han - Tuesday, November 28, 2017 - link

UpSpin - Tuesday, November 28, 2017 - link

Elstar - Tuesday, November 28, 2017 - link

HStewart - Tuesday, November 28, 2017 - link

gescom - Tuesday, November 28, 2017 - link

gescom - Tuesday, November 28, 2017 - link

ddriver - Tuesday, November 28, 2017 - link

HStewart - Tuesday, November 28, 2017 - link

Log in

Don't have an account? Sign up now