The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster

Name: The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster
Item: The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster
Author: Andrei Frumusanu

by Andrei Frumusanu on December 18, 2020 6:00 AM EST

148 Comments | Add A Comment

148 Comments

SPEC - Multi-Threaded Performance

While the single-threaded numbers were interesting, what we’re all looking after are the multi-core scores and what exactly 80 Neoverse-N1 cores can achieve within a single, and two sockets.

The performance measurements here were limited to quadrant and NPS4 configurations as that’s actually the default settings in which the Altra system came in, and what also AMD usually says customers want to deploy into production, achieving better performance by reducing cross-chip memory traffic.

The main comparison point here against the Q80-33 is AMD’s EPYC 7742 – 80 cores versus 64 cores with SMT, as well as similar TDPs. Intel’s Xeon 8280 with its 28 cores also comes into play but isn’t a credible competitor against the newer generation silicon designs on 7nm.

I’m keeping the detailed result sets limited to single-socket figures – we’ll check out the dual-socket numbers later on in the aggregate chart – essentially the 2S figures are simply 2x the performance.

SPECint2017 Rate-N Estimated Scores (1 Socket)

Starting off with SPECint2017, we’re seeing some absolutely smashing figures here on the part of the Altra Q80-33, with several workloads where the chip significantly outperforms the EPYC 7742, but also losing in some other workloads.

Starting off with the losing workloads, gcc, mcf, and omnetpp, these are all workloads with either high cache pressure or high memory requirements.

The Altra losing out in 502.gcc_r doesn’t come as much of a surprise as we also saw the Graviton2 suffering in this workload – the 1MB per core of L2 as well as 400KB per core of shared L3 really isn’t much and pales against the 4MB/core that’s available to the EPYC’s Zen2 cores. The Altra going from 2.5GHz to 3.3GHz and 64 cores to 80 cores only improves the score from 176.9 to 186.1 in comparison to the Graviton2. I’m not including the Graviton2 in the charts as it’s not quite the apples-to-apples comparisons due to compiler and run environments, but one can look up the scores in that review.

Where the Altra does shine however is in more core-local workloads that are more compute oriented and have less of a memory footprint, of which we see quite a few here, such as 525.x264.

What’s really interesting here is that even though the latter tests in the suite are extremely friendly to SMT scaling on the x86 systems, with 531, 541, 548 and 557 scaling up with SMT threads in MT performance by respectively 30, 43, 25 and 36%, AMD’s Rome CPU still manages to lose to the Altra system by considerable amounts – only being slightly favoured in 557.xz_r by a slight margin – so while SMT helps, it’s not enough to counteract the raw 25% core count advantage of the Altra system when comparing 80 vs 64 cores.

SPECfp2017 Rate-N Estimated Scores

In SPECfp2017, things are also looking favourably for the Altra, although the differences aren’t as favourable except for 511.povray where the raw core count again comes into play.

The Altra again showcases really bad performance in 507.cactuBSSN_r, mirroring the lacklustre single-threaded scores and showing worse performance than a Graviton2 by considerable amounts.

The Arm design does well in 503.bwaves which is fairly high IPC as well as bandwidth hungry, however falls behind in other bandwidth hungry workloads such as 554.roms_r which has more sparse memory stores.

SPEC2017 Rate-N Estimated Total

In the overall scores, both across single-socket and dual-socket systems, the new Altra Q80-33 performs outstandingly well, actually edging out the EPYC system by a small margin in SPECint, though it’s losing out in SPECfp and more cache-heavy workloads.

Beyond testing 1-socket and 2-socket scores, I’ve also taken the opportunity of this new round of testing across the various systems to test out 1 thread per core and 2 thread per core scores across the SMT systems.

While there are definitely workloads that scale well with SMT, overall, the technology has a smaller impact on the suite, averaging out at 15% for both EPYC and Xeon.

One thing we don’t usually do in the way we run SPEC is mixing together rate figures with different thread counts, however with such large core counts and threads it’s something I didn’t want to leave out for this piece. The “mixT” result set takes the best performing sub-score of either the 1T or 2T/core runs for a higher overall aggregate. Usually officially submitted SPEC scores do this by default in their _peak submissions while we usually run _base comparative scores. Even with this best-case methodology for the SMT systems, the Altra system still slightly edges out the performance of the EPYC 7742.

Intel’s Cascade Lake Xeon system here really isn’t of any consideration in the competitive landscape as a single-socket Altra system will outperform a dual-socket Xeon.

The Altra QuickSilver still has one weakness and that’s cache-heavy workloads – 32MB of L3 for 80 cores really isn’t near enough to keep up performance scaling across that many cores. In the end of the day however, it’s up to Ampere’s customers to give input what kind of workloads they use and if they stress the caches or not – given that both Amazon and Ampere chose the minimum cache configuration for their mesh implementations, maybe that’s not the case?

SPEC2017 Rate-N Estimated Per-Thread Performance

Finally, one last figure I wanted to showcase is the per-thread performance of the different designs. While scaling out multi-threaded performance across vast number of cores is a very important way to scale performance, it’s also important to not take a flock of chickens approach with too weak cores. Especially for customers Ampere is targeting, such as enterprise and cloud service providers, many times users will be running things on a subset of a given processor socket cores, so per-core and per-thread performance remains a very important metric.

Simply dividing the single-socket performance figures by the amount of threads run, we get to an average per-thread performance figure in the context of a fully loaded system, a figure that’s actually more realistic than the single-thread figures of the previous page where the rest of the CPU cores in the systems are doing nothing.

In this regard, Intel’s Xeon offering is still extremely competitive and actually takes the lead position here – although its low core count doesn’t favour it at all in the full throughput metrics of the socket, the per-thread performance is still the best amongst the current CPU offerings out there.

In SPECint, the Altra, EPYC and Xeon are all essentially tied in performance, whilst in SPECfp the Xeon takes the lead with the Altra falling notably behind – with the EPYC Rome chip falling in-between the two.

If per-thread performance is important to you, then obviously enough SMT isn’t an option as this vastly regresses performance in favour of a chance to get more aggregate performance across multiple threads. There’s many vendors or enterprise use-cases which for this reason just outright disable SMT.

SPEC - Single-Threaded Performance SPECjbb MultiJVM - Java Performance

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

148 Comments

View All Comments

mode_13h - Thursday, December 31, 2020 - link
Isn't Blender included in SPECfp2017 as 526.blender_r? Or is that something different?
Teckk - Friday, December 18, 2020 - link
Whoever decided on naming these products — fantastic job. Simple, clear and effective.
Maybe you can offer some free advice to Intel and Sony.
Calin - Friday, December 18, 2020 - link
The answer to the question of "how powerful it is" is clear - more than good enough.
The real question in fact is:
"How much can they produce?"
AMD has the crown in x86 processor performance, but this doesn't really matter very much as long as they can build enough processors only for a part of the market.
jwittich - Friday, December 18, 2020 - link
How many do you need? :)
Bigos - Friday, December 18, 2020 - link
64kB pages might significantly enhance performance on workload with large memory sets, as the TLB will be up to 16x less used. On the other hand, memory usage of the Linux file system cache will also increase a lot.

Would you be able to test the effect of 64kB vs 4kB page size on at least some workloads?
Andrei Frumusanu - Friday, December 18, 2020 - link
It's something that I wanted to test but it requires a OS reinstall / kernel recompile - I didn't want to get into that rabbit hole of a time sink as already spent a lot of time verifying a lot of data across the three platforms over a few weeks already.
arnd - Friday, December 18, 2020 - link
I'd love to see that as well. For workloads that use transparent huge pages, there should not be much difference since both would use 2MB huge pages (512*4KB or 32*64KB), plus one or more even larger page sizes, but it needs to be measured to be

The downsides of 64KB requiring larger disk I/O and more RAM are often harder to quantify, as most benchmarks try to avoid the interesting cases.

I've tried benchmarking kernel compiles on Graviton2 both ways and found 64kB pages to be a few percent faster when there is enough RAM, but forcing the system to swap by limiting the total RAM made the 64kB kernel easily 10x to 1000x slower than the 4kB one, depending on the how the available memory lined up with the working set.
abufrejoval - Friday, December 18, 2020 - link
Thank you for the incredible amount of information and the work you put into this: Anandtech's best!

Yet I wonder who would deploy this and where. The purchasing price of the CPU would seem to become a rather miniscule part of the total system cost, especially once you go into big RAM territory. And I wonder if it's not similar with the energy budget: I see my larger systems requiring more $ and Watts on RAM than on the CPUs. Are they doing, can they do anything there to reduce DRAM energy consumption vs. Intel/AMD?

The cost of the ecosystem change to ARM may be less relevant once you have the scale to pay for it, but where exactly would those scale benefits come from? And what scales are we talking about? Would you need 100k or 1m servers to break even?

And what sort of system load would you have to reach/maintain to have significant energy advantages vs. x86 iron?

Do they support special tricks like powering down quadrants and RAM banks for load management, do they enable quick standby/actvation modes so that servers can be take off and on for load management?

And how long would the benefits last? AMD has demonstrated rather well, that the ability to execute over at least three generations of hardware are required to shift attention even from the big guys and they have still all the scaling benefits the x86 installed base provides.

These guys are on a 2nd generation product, promise 3rd but essentially this would seem to have the same level of confidence as the 1st EPIC.
askar - Friday, December 18, 2020 - link
Would you mind testing ML performance, i.e. python's SKLearn library classes that can be multithreaded (random forest for example)?
mode_13h - Sunday, December 20, 2020 - link
MLPerf?

The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster

SPEC - Multi-Threaded Performance

Post Your Comment

148 Comments

View All Comments

mode_13h - Thursday, December 31, 2020 - link

Teckk - Friday, December 18, 2020 - link

Calin - Friday, December 18, 2020 - link

jwittich - Friday, December 18, 2020 - link

Bigos - Friday, December 18, 2020 - link

Andrei Frumusanu - Friday, December 18, 2020 - link

arnd - Friday, December 18, 2020 - link

abufrejoval - Friday, December 18, 2020 - link

askar - Friday, December 18, 2020 - link

mode_13h - Sunday, December 20, 2020 - link

Log in

Don't have an account? Sign up now