Disclaimer June 25th: The benchmark figures in this review have been superseded by our second follow-up Milan review article, where we observe improved performance figures on a production platform compared to AMD’s reference system in this piece.

SPEC - Per-Core Win for "F"-Series 75F3

A metric that is actually more interesting than isolated single-thread performance, is actually per-thread performance in a fully loaded system. This actually is a measurement and benchmark figure that would greatly interest enterprises and customers which are running software or workloads that are possibly licensed on a per-core basis, or simply workloads that require a certain level of per-thread service level agreement in terms of performance.

It’s precisely this market that AMD is trying to target with its new “F”-series of processors, and this is where the new 75F3 comes into play. With 32 cores, 4 cores per chiplet with the full 256MB of L3 cache, and a base frequency of 2.95GHz, boosting up to 4.0GHz at a default 280W TDP, is the chip is squeezing out the maximum per-core performance while still offering a massive amount of multi-threaded performance.

SPEC2017 Rate-N Estimated Per-Thread Performance (1S) 

At full load, this ends up with a massive per-thread performance leadership on the part of the 75F3, landing 45% ahead of the 7763 and 51% ahead of the Intel Xeon 8280.

It’s to be noted that limiting the thread count of the higher core-count SKUs will also result in a better per-thread performance metric, for example running a 7713 with only 32 threads will result in a SPECint2017 estimated score of 4.30 – the 75F3 still has a 16% advantage there even though its boost clock is only 8.8% higher at the peak – meaning the 75F3 is achieving higher effective frequencies. Unfortunately, we didn’t have enough time to do the same experiment on the equal 280W 7763 part.

AMD discloses that the biggest generational gains for the Milan stack is found in the lower core-count models, where for example the 7313 and the 7343 outperforms the 7282 and 7302 by 25%. Reason for this is that for example the new 7313 features double the L3 cache, and all the new CPUs are boosting higher with respectively higher TDPs, increasing to 150/190W from 120/155W, as well as landing in at +50% higher price points when comparing generation to generation.

SPEC - Single-Threaded Performance SPECjbb MultiJVM - Java Performance
Comments Locked

120 Comments

View All Comments

  • lejeczek - Monday, March 15, 2021 - link

    But those Altra Q80-33 ... gee guys. I have been thinking for a while - next upgrade of the stack in the rack might as well be...
  • mode_13h - Monday, March 15, 2021 - link

    Well, if it does well on the benchmarks that align with your workload, then I'd certainly consider at least a single-CPU Altra. IIRC, the multi-CPU interconnect was one of its weak points. You could even go dual-CPU, if you're provisioning VMs that fit on a single CPU (or better yet, just one quadrant).
  • Pinn - Monday, March 15, 2021 - link

    When does this filter to the Threadrippers?
  • mode_13h - Monday, March 15, 2021 - link

    Probably either when demand for the 3000-series Threadrippers starts slipping or if/when the supply of top-binned Zen3 dies ever catches up.

    It would be interesting to see what performance could be extracted from these CPUs, if AMD would raise the power/thermal limit another 100 W. Maybe the 5000-series TR Pro will be our chance to find out!
  • mode_13h - Monday, March 15, 2021 - link

    Someone please remind me why Altra's memory performance is so much stronger. Is it simply down to avoiding the cache write-miss penalty? I'm pretty sure x86 CPUs long-ago added store buffers to fix that, but I can't think of any other explanation for that incredible stream benchmark discrepancy!
  • Andrei Frumusanu - Monday, March 15, 2021 - link

    It's due to the Neoverse N1 cores being able to dynamically transform arbitrary memory writes into non-temporal write streams instead of doing regular RFO before a write as the x86 systems are currently doing. I explain it more in the Altra review:

    https://www.anandtech.com/show/16315/the-ampere-al...
  • mode_13h - Monday, March 15, 2021 - link

    That's more or less what I recall, but do you know it's *truly* emitting non-temporal stores? Those partially-bypass some or all of the cache hierarchy (I seem to recall that the Pentium 4 actually just restricted them to one set of L2 cache). It would seem to me that implausibly deep analysis would be needed for the CPU to determine that the core in question wouldn't access the data before it was replaced. And that's not even to speak of determining whether code running on *other* cores might need it.

    On the other hand, if it simply has enough write-buffering, it could avoid fetching the target cacheline by accumulating enough adjacent stores to determine that the entire cacheline would be overwritten. Of course, the downside would be a tiny bit more write latency, and memory-ordering constraints (esp. for x86) might mean that it'd only work for groups of consecutive stores to consecutive addresses.

    I guess a way to eliminate some of those restrictions would be to observe through analysis of the instruction stream that a group of stores would overwrite the cacheline and then issue an allocation instead of a fetch. Maybe that's what Altra is doing?
  • Andrei Frumusanu - Tuesday, March 16, 2021 - link

    You're over-complicating things. The core simply sees a stream pattern and switches over to nontemporal writes. They can fully saturate the memory controller when doing just pure write patterns.
  • mode_13h - Wednesday, March 17, 2021 - link

    But, do you know they're truly non-temporal writes? As I've tried to explain, there are ways to avoid the write-miss penalty without using true non-temporal writes.

    And how much of that are you inferring vs. basing this on what you've been told from official or unofficial sources?
  • Andrei Frumusanu - Saturday, March 20, 2021 - link

    It's 100% non-temporal writes, confirmed by both hardware tests and architects.

Log in

Don't have an account? Sign up now