AMD 3rd Gen EPYC Milan Review: A Peak vs Per Core Performance Balance

Name: AMD 3rd Gen EPYC Milan Review: A Peak vs Per Core Performance Balance
Item: AMD 3rd Gen EPYC Milan Review: A Peak vs Per Core Performance Balance

by Dr. Ian Cutress & Andrei Frumusanu on March 15, 2021 11:00 AM EST

120 Comments | Add A Comment

120 Comments

Disclaimer June 25^th: The benchmark figures in this review have been superseded by our second follow-up Milan review article, where we observe improved performance figures on a production platform compared to AMD’s reference system in this piece.

SPEC - Per-Core Win for "F"-Series 75F3

A metric that is actually more interesting than isolated single-thread performance, is actually per-thread performance in a fully loaded system. This actually is a measurement and benchmark figure that would greatly interest enterprises and customers which are running software or workloads that are possibly licensed on a per-core basis, or simply workloads that require a certain level of per-thread service level agreement in terms of performance.

It’s precisely this market that AMD is trying to target with its new “F”-series of processors, and this is where the new 75F3 comes into play. With 32 cores, 4 cores per chiplet with the full 256MB of L3 cache, and a base frequency of 2.95GHz, boosting up to 4.0GHz at a default 280W TDP, is the chip is squeezing out the maximum per-core performance while still offering a massive amount of multi-threaded performance.

SPEC2017 Rate-N Estimated Per-Thread Performance (1S)

At full load, this ends up with a massive per-thread performance leadership on the part of the 75F3, landing 45% ahead of the 7763 and 51% ahead of the Intel Xeon 8280.

It’s to be noted that limiting the thread count of the higher core-count SKUs will also result in a better per-thread performance metric, for example running a 7713 with only 32 threads will result in a SPECint2017 estimated score of 4.30 – the 75F3 still has a 16% advantage there even though its boost clock is only 8.8% higher at the peak – meaning the 75F3 is achieving higher effective frequencies. Unfortunately, we didn’t have enough time to do the same experiment on the equal 280W 7763 part.

AMD discloses that the biggest generational gains for the Milan stack is found in the lower core-count models, where for example the 7313 and the 7343 outperforms the 7282 and 7302 by 25%. Reason for this is that for example the new 7313 features double the L3 cache, and all the new CPUs are boosting higher with respectively higher TDPs, increasing to 150/190W from 120/155W, as well as landing in at +50% higher price points when comparing generation to generation.

SPEC - Single-Threaded Performance SPECjbb MultiJVM - Java Performance

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

120 Comments

View All Comments

mkbosmans - Tuesday, March 23, 2021 - link
Even if you have a nice two-tiered approach implemented in your software, let's say MPI for the distributed memory parallelization on top of OpenMP for the shared memory parallelization, it often turns out to be faster to limit the shared memory threads to a single socket of NUMA domain. So in case of an 2P EPYC configured as NPS4 you would have 8 MPI ranks per compute node.

But of course there's plenty of software that has parallelization implemented using MPI only, so you would need a separate process for each core. This is often because of legacy reasons, with software that was originally targetting only a couple of cores. But with the MPI 3.0 shared memory extension, this can even today be a valid approach to great performing hybrid (shared/distributed mem) code.
mode_13h - Tuesday, March 23, 2021 - link
Nice explanation. Thanks for following up!
Andrei Frumusanu - Saturday, March 20, 2021 - link
This is vastly incorrect and misleading.

The fact that I'm using a cache line spawned on a third main thread which does nothing with it is irrelevant to the real-world comparison because from the hardware perspective the CPU doesn't know which thread owns it - in the test the hardware just sees two cores using that cache line, the third main thread becomes completely irrelevant in the discussion.

The thing that is guaranteed with the main starter thread allocating the synchronisation cache line is that it remains static across the measurements. One doesn't actually have control where this cache line ends up within the coherent domain of the whole CPU, it's going to end up in a specific L3 cache slice depended on the CPU's address hash positioning. The method here simply maintains that positioning to be always the same.

There is no such thing as core-core latency because cores do not snoop each other directly, they go over the coherency domain which is the L3 or the interconnect. It's always core-to-cacheline-to-core, as anything else doesn't even exist from the hardware perspective.
mkbosmans - Saturday, March 20, 2021 - link
The original thread may have nothing to do with it, but the NUMA domain where the cache line was originally allocated certainly does. How would you otherwise explain the difference between the first quadrant for socket 1 to socket 1 communication and the fourth quadrant for socket 2 to socket 2 communication?

Your explanation about address hashing to determine the L3 cache slice may be makes sense when talking about fixing the inital thread within a L3 domain, but not why you want that that L3 domain fixed to the first one in the system, regardless of the placement of the two threads doing the ping-ponging.

And about core-core latency, you are of course right, that is sloppy wording on my part. What I meant to convey is that roundtrip latency between core-cacheline-core and back is more relevant (at least for HPC applications) when the cacheline is local to one of the cores and not remote, possibly even on another socket than the two thread.
Andrei Frumusanu - Saturday, March 20, 2021 - link
I don't get your point - don't look at the intra-remote socket figures then if that doesn't interest you - these systems are still able to work in a single NUMA node across both sockets, so it's still pretty valid in terms of how things work.

I'm not fixing it to a given L3 in the system (except for that socket), binding a thread doesn't tell the hardware to somehow stick that cacheline there forever, software has zero say in that. As you see in the results it's able to move around between the different L3's and CCXs. Intel moves (or mirrors it) it around between sockets and NUMA domains, so your premise there also isn't correct in that case, AMD currently can't because probably they don't have a way to decide most recent ownership between two remote CCXs.

People may want to just look at the local socket numbers if they prioritise that, the test method here merely just exposes further more complicated scenarios which I find interesting as they showcase fundamental cache coherency differences between the platforms.
mkbosmans - Tuesday, March 23, 2021 - link
For a quick overview of how cores are related to each other (with an allocation local to one of the cores), I like this way of visualizing it more:
http://bosmans.ch/share/naples-core-latency.png
Here you can for example clearly see how the four dies of the two sockets are connected pairwise.

The plots from the article are interesting in that they show the vast difference between the cc protocols of AMD and Intel. And the numbers from the Naples plot I've linked can be mostly gotten from the more elaborate plots from the article, although it is not entirely clear to me how to exactly extend the data to form my style of plots. That's why I prefer to measure the data I'm interested in directly and plot that.
imaskar - Monday, March 29, 2021 - link
Looking at the shares sinking, this pricing was a miss...
mode_13h - Tuesday, March 30, 2021 - link
Prices are a lot easier to lower than to raise. And as long as they can sell all their production allocation, the price won't have been too high.
Zone98 - Friday, April 23, 2021 - link
Great work! However I'm not getting why in the c2c matrix cores 62 and 74 wouldn't have a ~90ns latency as in the NW socket. Could you clarify how the test works?
node55 - Tuesday, April 27, 2021 - link
Why are the cpus not consistent?

Why do you switch between 7713 and 7763 on Milan and 7662 and 7742 on Rome?

Why do you not have results for all the server CPUs? This confuses the comparison of e.g. 7662 vs 7713. (My current buying decision )

AMD 3rd Gen EPYC Milan Review: A Peak vs Per Core Performance Balance

SPEC - Per-Core Win for "F"-Series 75F3

Post Your Comment

120 Comments

View All Comments

mkbosmans - Tuesday, March 23, 2021 - link

mode_13h - Tuesday, March 23, 2021 - link

Andrei Frumusanu - Saturday, March 20, 2021 - link

mkbosmans - Saturday, March 20, 2021 - link

Andrei Frumusanu - Saturday, March 20, 2021 - link

mkbosmans - Tuesday, March 23, 2021 - link

imaskar - Monday, March 29, 2021 - link

mode_13h - Tuesday, March 30, 2021 - link

Zone98 - Friday, April 23, 2021 - link

node55 - Tuesday, April 27, 2021 - link

Log in

Don't have an account? Sign up now