The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster

Name: The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster
Item: The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster
Author: Andrei Frumusanu

by Andrei Frumusanu on December 18, 2020 6:00 AM EST

148 Comments | Add A Comment

148 Comments

Conclusion & End Remarks

The server landscape is changing very quickly. While the promise of Arm servers for many years has been just that – this year’s introduction of the Graviton2 marked the tipping point where Arm server chips no longer represented a niche use-case, but rather a real – and competitive option. The only problem with Graviton2 was that this was an internal Amazon-only solution – so you couldn’t really say it was an option against AMD or Intel.

That’s where Ampere Computing steps in, positioning themselves as an open merchant silicon vendor, and the first to use and deploy Arm’s new Neoverse CPU line-up in such a way. The Altra QuickSilver being the very first attempt at this, truly hits it out of the park and matches the high expectations of the silicon.

Ampere’s approach is significantly more aggressive, with more performance, and more power, than what the Graviton2 aimed for – the new 80-core Q80-33 flagship SKU essentially has managed to match the performance of AMD’s flagship Rome chip – the 64-core EPYC 7742. While personally that didn’t surprise me much, I could imagine that for many readers out there this to come as an unexpected turn of events.

The Altra Q80-33 sometimes beats the EPYC 7742, and loses out sometimes – depending the workload. The Altra’s strengths lie in compute-bound workloads where having 25% more cores is an advantage. The Neoverse-N1 cores clocked at 3.3GHz can more than match the per-core performance of Zen2 inside the EPYC CPUs.

There are still workloads in which the Altra doesn’t do as well – anything that puts higher cache pressure on the cores will heavily favours the EPYC as while 1MB per core L2 is nice to have, 32MB of L3 shared amongst 80 cores isn’t very much cache to go around. Generally, I think the mesh interconnect remains a weak-point for this generation of Neoverse products and there’s improvements to be done in the next iteration of designs.

Today we’ve tested the Wiwynn based “Mount Jade” 2S Ampere Altra server – the Altra’s support for dual-socket platforms is functional, but relying on CCIX instead of a native coherency protocol between CPU cores in the two sockets means that performance isn’t nearly as good as the scaling we see from AMD or Intel. The single-socket “Mount Snow” Altra platforms as well as the platform solutions from GIGABYTE might be a better option for some deployments.

In terms of power-efficiency, the Q80-33 really operates at the far end of the frequency/voltage curves at 3.3GHz. While the TDP of 250W really isn’t comparable to the figures of AMD and Intel are publishing, as average power consumption of the Altra in many workloads is well below that figure – ranging from 180 to 220W – let’s say a 200W median across a variety of workloads, with few workloads actually hitting that peak 250W. I would say that yes, the Altra does have a power efficiency advantage over AMD’s EPYC platform, but not something that is overly significant enough to say that it completely changes the landscape.

Ampere 1st Gen Altra 'QuickSilver' Product List
AnandTech	Cores	Frequency	TDP	PCIe	DDR4	Price
Q80-33 (Tested)	80	3.3 GHz	250 W	128x G4	8 x 3200	$4050
Q80-30	80	3.0 GHz	210 W	128x G4	8 x 3200	$3950
Q80-26	80	2.6 GHz	175 W	128x G4	8 x 3200	$3810
Q80-23	80	2.3 GHz	150 W	128x G4	8 x 3200	$3700
Q72-30	72	3.0 GHz	195 W	128x G4	8 x 3200	$3590
Q64-33	64	3.3 GHz	220 W	128x G4	8 x 3200	$3810
Q64-30	64	3.0 GHz	180 W	128x G4	8 x 3200	$3480
Q64-26	64	2.6 GHz	125 W	128x G4	8 x 3200	$3260
Q64-24	64	2.4 GHz	95 W	128x G4	8 x 3200	$3090
Q48-22	48	2.2 GHz	85 W	128x G4	8 x 3200	$2200
Q32-17	32	1.7 GHz	45 W	128x G4	8 x 3200	$800

Where Ampere and the Altra definitely is beating AMD in is TCO, or total cost of ownership. Taking the flagship models as comparison points – the Q80-33 costs only $4050 which generally matching the performance of AMD’s EPYC 7742 which still comes in at $6950, essentially 42% cheaper. Of course, performance/$ will vary depending on workloads, but the Altra’s performance is so good that I don’t think it really changes the narrative of that large a cost difference. We’re really on basing this on both companies’ MSRP prices and we know for a fact many customers will be paying less than that for volume purchases and relying on discounts, but that can also apply to Ampere and the Altra.

One will note I didn’t make any mention of Intel yet - Intel’s current Xeon offering simply isn’t competitive in any way or form at this moment in time. Cascade Lake is twice as slow and half as efficient – so unless Intel is giving away the chips at a fraction of a price, they really make no sense. Ice Lake-SP is around the corner, but I don’t expect it to manage to bridge the performance or efficiency gap. Ampere and AMD here have free reign on the server market share – with Ampere having to cross the hurdle to convince customers to switch over from x86 to Arm.

Ampere is already shipping Altra systems to customers, with Oracle’s cloud business being the first big notable win for the company – signifying already very positive reactions in the market.

What we need to keep in mind though, is that today’s comparisons were against AMD’s EPYC 7742 which was launched almost 15 months ago. Rome’s successor, Milan, is already shipping to customers and has already started hitting the channel, and we expect to hear more about the Zen3-based EPYC chips in the coming weeks. I’m not expecting major leaps, but a 20% performance bump is pretty much a safe bet to make – it would beat the Q80-33 in more workloads and shift the balance a bit – but Ampere’s aggressive pricing would still be something for AMD to worry about.

What really excites me, is the potential of future Altra designs. Ampere has already announced that Altra-Max “Mystique” will be coming in 2021 – essentially a 128-core version of the same Neoverse-N1 platform used in the QuickSilver design today. We’ll have to see how that scales, but it’ll certainly be a compute monster. The real big deal will be the 5nm 2022 “Siryn” design – if Ampere adopts the Neoverse-V1 CPU core from Arm, and I hope they will, then that would signify at minimum a +50% performance uplift, which is massive.

The Altra overall is an astounding achievement – the company has managed to meet, and maybe even surpass all expectations out of this first-generation design. With one fell swoop Ampere managed to position itself as a top competitor in the server CPU market. The Arm server dream is no longer a dream, it’s here today, and it’s real.

Compiling LLVM, NAMD Performance

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

148 Comments

View All Comments

mostlyfishy - Friday, December 18, 2020 - link
Interesting article thanks. One thing I missed, what process is this on? 7nm?

It's also interesting that the M1 has demonstrated that with the right sizings, a very wide backend can give you significant single threaded performance. Not really that useful for a server processor where you're likely to be running many threads and want to trade for more cores though.
Josh128 - Friday, December 18, 2020 - link
Yes, 7nm and monolithic, which seems fairly incredible as this thing is huge. Dont have the die size numbers though. Wonder what the yield is on these...
Calin - Friday, December 18, 2020 - link
Maybe there are quite a few more than 80 cores on this beast - in which case you can "eat" some die errors by deactivating cores/complexes/...
Wilco1 - Friday, December 18, 2020 - link
Each Neoverse N1 core with 1MB L2 is just 1.4mm^2, so 80 of them add up to 112mm^2. The die size is estimated at about 350mm^2, so tiny compared to the total ~1100mm^2 in EPYC 7742.

So performance/area is >3x that of EPYC. Now that is efficiency!
andrewaggb - Friday, December 18, 2020 - link
Timing of this article is awkward. We're comparing to the 18 month old 7742 vs the soon to be released Zen 3 Milan parts which based on the already launched Zen 3 desktop parts (and Milan leaks) will be 9-27% faster in the same power envelope.

Cache is a big part of the die size for the AMD chip and the N1 has much less of it which makes the die size smaller. AMD's Desktop IGP parts with way less cache perform very similarly in many workloads to those with the extra cache and the same has been true for intel parts over the years. Some workloads don't benefit much at all from the extra cache and some do which makes choosing the benchmarks more important.

That's not to say the N1 isn't more efficient, but rather that it's hard to make a fair comparison, particularly around die size. They may have similar core counts but have made very different design decisions around cache.
Wilco1 - Friday, December 18, 2020 - link
I don't see how it matters, but Altra is about 9 months old and Neoverse N1 is a sibling of Cortex-A76 which has been available in phones for 2 years. As for Milan, I expect the gain on SPECrate to be about 10-15%. And Milan will be competing with the Altra Max which has 60% more cores which should give ~40% speedup.

Yes the design decisions are quite different, and it is interesting that they end up with similar performance despite the disparity in L3 cache. I suspect that 8 memory channels is becoming a limit, and a future generation with DDR5 will enable more throughput (and even more cores).
Gondalf - Friday, December 18, 2020 - link
I am sorry but looking carefully the heatsink and the application of the thermal paste, we are facing a limit of the reticle thing on 7nm.
We are in front of a 700/800 mm2 thing. On 7nm this means very few units sold and nearly zero market penetration. Same thing on 5nm given the higher core numbers.

In pratics we have nothing in our hands. Another failure in Server market
Andrei Frumusanu - Friday, December 18, 2020 - link
Ampere is doing Altra Max with 128 cores still on 7nm, so this one certainly isn't near hitting reticle limits.
Wilco1 - Friday, December 18, 2020 - link
No it is not anywhere near the reticle limit. You can't estimate the die size from the heatsink, but you estimate it based on similar designs for which we do have numbers. Graviton 2 is a similar design at 30B transistors. This has another 16 cores which adds another 16X1.4 = 22.4mm^2. So around 350mm^2 in N7.
milli - Monday, December 21, 2020 - link
This is just a ridiculous statement. 350mm^2 ... no way.
Firstly, the die size of Graviton 2 is not known.
A realistic comparison would be AMD's Zen2 chiplet which has 3.9b transistors and is 72mm^2.
One would deduce from that, that Graviton 2 is > 550mm^2. Also your napkin calculation to add 22mm2 is flawed. Firstly, you don't know if a N1 core is actually taking 1.4mm^2 in this CPU. Secondly, you're forgetting to add 64 PCI-E lanes.
Let's say, 25mm2 for the CPU and 25mm2 for the lanes. That would bring the total to 600mm^2. Quite a bit bigger to your 350mm^2.

The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster

Conclusion & End Remarks

Post Your Comment

148 Comments

View All Comments

mostlyfishy - Friday, December 18, 2020 - link

Josh128 - Friday, December 18, 2020 - link

Calin - Friday, December 18, 2020 - link

Wilco1 - Friday, December 18, 2020 - link

andrewaggb - Friday, December 18, 2020 - link

Wilco1 - Friday, December 18, 2020 - link

Gondalf - Friday, December 18, 2020 - link

Andrei Frumusanu - Friday, December 18, 2020 - link

Wilco1 - Friday, December 18, 2020 - link

milli - Monday, December 21, 2020 - link

Log in

Don't have an account? Sign up now