AMD 3rd Gen EPYC Milan Review: A Peak vs Per Core Performance Balance

Name: AMD 3rd Gen EPYC Milan Review: A Peak vs Per Core Performance Balance
Item: AMD 3rd Gen EPYC Milan Review: A Peak vs Per Core Performance Balance

by Dr. Ian Cutress & Andrei Frumusanu on March 15, 2021 11:00 AM EST

120 Comments | Add A Comment

120 Comments

Disclaimer June 25^th: The benchmark figures in this review have been superseded by our second follow-up Milan review article, where we observe improved performance figures on a production platform compared to AMD’s reference system in this piece.

Compiling LLVM, NAMD Performance

As we’re trying to rebuild our server test suite piece by piece – and there’s still a lot of work go ahead to get a good representative “real world” set of workloads, one more highly desired benchmark amongst readers was a more realistic compilation suite. Chrome and LLVM codebases being the most requested, I landed on LLVM as it’s fairly easy to set up and straightforward.

git clone https://github.com/llvm/llvm-project.git

cd llvm-project

git checkout release/11.x

mkdir ./build

cd ..

mkdir llvm-project-tmpfs

sudo mount -t tmpfs -o size=10G,mode=1777 tmpfs ./llvm-project-tmpfs

cp -r llvm-project/* llvm-project-tmpfs

cd ./llvm-project-tmpfs/build

cmake -G Ninja \

-DLLVM_ENABLE_PROJECTS="clang;libcxx;libcxxabi;lldb;compiler-rt;lld" \

-DCMAKE_BUILD_TYPE=Release ../llvm

time cmake --build .

We’re using the LLVM 11.0.0 release as the build target version, and we’re compiling Clang, libc++abi, LLDB, Compiler-RT and LLD using GCC 10.2 (self-compiled). To avoid any concerns about I/O we’re building things on a ramdisk. We’re measuring the actual build time and don’t include the configuration phase as usually in the real world that doesn’t happen repeatedly.

LLVM Suite Compile Time

For the new Milan chips, the results are a bit mixed. The higher-power 7763 takes a lead with a +10.5% improvement over the 7742, however the 7713 doesn’t manage to keep up with that predecessor.

The 1S vs 2S scores are interesting as the 2S figures showcase the new Milan chips in a better light due to the higher single-threaded performance of the Zen3 cores. The compilation here also has linking phases which are single-thread performance bottle-necked. This results in scenarios such as the 7713 losing to the 7662 in 1S comparisons, however winning out against the same chip in the 2S comparison, as it’s able to make that advantage count for more.

It’s also great to see the 75F3 keep up with the 64-core counterparts at around 72% of the top-SKU performance.

NAMD (Git-2020-12-09) - Apolipoprotein A1

Finally, in NAMD, this is more of a core-local compute workload. We see the 7763 outperform the 7742 by +11.8%, however the Milan chip is still outperformed by the higher core compute capacity of the 80-core Altra chip.

Generally, I have my reservations about NAMD as a benchmark due to its multicore vs MPI variants and scaling anomalies, on top of the whole topic of the benchmark having a completely different algorithm for AVX512 processors.

SPECjbb MultiJVM - Java Performance Conclusion & End Remarks

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

120 Comments

View All Comments

aryonoco - Tuesday, March 16, 2021 - link
Thanks for the excellent article Andrei and Ian. Really appreciate your work.

Just wondering, is Johan no longer inlvolved in server reviews? I'll really miss him.
Andrei Frumusanu - Saturday, March 20, 2021 - link
Johan is no longer part of AT.
SanX - Tuesday, March 16, 2021 - link
In summary, the difference in performance 9 vs 8 for (Milan vs Rome) means they are EQUAL. Not a single specific application which shows more than that. So much for the many months of hype and blahblah.
tyger11 - Tuesday, March 16, 2021 - link
Okay, now give us the new Zen 3 Threadripper Pro!
AusMatt - Wednesday, March 17, 2021 - link
Page 4 text: "a 255 x 255 matrix" should read: "a 256 x 256 matrix".
hmw - Friday, March 19, 2021 - link
What was the stepping for the Milan CPUs? B0? or B1?
mkbosmans - Saturday, March 20, 2021 - link
These inter-core synchronisation latency plots are slightly misleading, or at least not representative of "real software". By fixing the cache line that is used to the first core in the system and then ping-ponging it between to other cores you do not measure core-core latency, but rather core-to-cacheline-to-core, as expressed in the article. This is not how inter-thread communication usually works (in well-designed software).
Allocating the cache line on the memory local to one of the ping-pong threads would make the plot more informative (although a bit more boring).
mode_13h - Saturday, March 20, 2021 - link
Are you saying a single memory address is used for all combinations of core x core?

Ultimately, I wonder if it makes any difference which NUMA domain the address is in, for a benchmark like this. Once it's in L1 cache, that's what you're measuring, no matter the physical memory address.

Also, I take issue with the suggestion that core-to-core communication necessarily involves memory in one of the core's NUMA domains. A lot of cases where real-world software is impacted by core-to-core latency involves global mutexes and atomic counters that won't necesarily be local to either core.
mkbosmans - Saturday, March 20, 2021 - link
Yes, otherwise the SE quadrant (socket 2 to socket 2 communication) would look identical to the NW quadrant, right?

It does matter on which NUMA node the address is in, this is exactly what is addressed later in the article about Xeon having a better cache coherency protocol where this is less of an issue.

From the software side, I was more thinking of HPC applications where a situation of threads exchanging data that is owned by one of them is the norm, e.g. using OpenMP or MPI. That is indeed a different situation from contention on global mutexes.
mode_13h - Saturday, March 20, 2021 - link
How often is MPI used for communication *within* a shared-memory domain? I tend to think of it almost exclusively as a solution for inter-node communication.

AMD 3rd Gen EPYC Milan Review: A Peak vs Per Core Performance Balance

Compiling LLVM, NAMD Performance

Post Your Comment

120 Comments

View All Comments

aryonoco - Tuesday, March 16, 2021 - link

Andrei Frumusanu - Saturday, March 20, 2021 - link

SanX - Tuesday, March 16, 2021 - link

tyger11 - Tuesday, March 16, 2021 - link

AusMatt - Wednesday, March 17, 2021 - link

hmw - Friday, March 19, 2021 - link

mkbosmans - Saturday, March 20, 2021 - link

mode_13h - Saturday, March 20, 2021 - link

mkbosmans - Saturday, March 20, 2021 - link

mode_13h - Saturday, March 20, 2021 - link

Log in

Don't have an account? Sign up now