The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster
by Andrei Frumusanu on December 18, 2020 6:00 AM EST- Posted in
- Servers
- Neoverse N1
- Ampere
- Altra
Conclusion & End Remarks
The server landscape is changing very quickly. While the promise of Arm servers for many years has been just that – this year’s introduction of the Graviton2 marked the tipping point where Arm server chips no longer represented a niche use-case, but rather a real – and competitive option. The only problem with Graviton2 was that this was an internal Amazon-only solution – so you couldn’t really say it was an option against AMD or Intel.
That’s where Ampere Computing steps in, positioning themselves as an open merchant silicon vendor, and the first to use and deploy Arm’s new Neoverse CPU line-up in such a way. The Altra QuickSilver being the very first attempt at this, truly hits it out of the park and matches the high expectations of the silicon.
Ampere’s approach is significantly more aggressive, with more performance, and more power, than what the Graviton2 aimed for – the new 80-core Q80-33 flagship SKU essentially has managed to match the performance of AMD’s flagship Rome chip – the 64-core EPYC 7742. While personally that didn’t surprise me much, I could imagine that for many readers out there this to come as an unexpected turn of events.
The Altra Q80-33 sometimes beats the EPYC 7742, and loses out sometimes – depending the workload. The Altra’s strengths lie in compute-bound workloads where having 25% more cores is an advantage. The Neoverse-N1 cores clocked at 3.3GHz can more than match the per-core performance of Zen2 inside the EPYC CPUs.
There are still workloads in which the Altra doesn’t do as well – anything that puts higher cache pressure on the cores will heavily favours the EPYC as while 1MB per core L2 is nice to have, 32MB of L3 shared amongst 80 cores isn’t very much cache to go around. Generally, I think the mesh interconnect remains a weak-point for this generation of Neoverse products and there’s improvements to be done in the next iteration of designs.
Today we’ve tested the Wiwynn based “Mount Jade” 2S Ampere Altra server – the Altra’s support for dual-socket platforms is functional, but relying on CCIX instead of a native coherency protocol between CPU cores in the two sockets means that performance isn’t nearly as good as the scaling we see from AMD or Intel. The single-socket “Mount Snow” Altra platforms as well as the platform solutions from GIGABYTE might be a better option for some deployments.
In terms of power-efficiency, the Q80-33 really operates at the far end of the frequency/voltage curves at 3.3GHz. While the TDP of 250W really isn’t comparable to the figures of AMD and Intel are publishing, as average power consumption of the Altra in many workloads is well below that figure – ranging from 180 to 220W – let’s say a 200W median across a variety of workloads, with few workloads actually hitting that peak 250W. I would say that yes, the Altra does have a power efficiency advantage over AMD’s EPYC platform, but not something that is overly significant enough to say that it completely changes the landscape.
Ampere 1st Gen Altra 'QuickSilver' Product List |
||||||
AnandTech | Cores | Frequency | TDP | PCIe | DDR4 | Price |
Q80-33 (Tested) |
80 | 3.3 GHz | 250 W | 128x G4 | 8 x 3200 | $4050 |
Q80-30 | 80 | 3.0 GHz | 210 W | 128x G4 | 8 x 3200 | $3950 |
Q80-26 | 80 | 2.6 GHz | 175 W | 128x G4 | 8 x 3200 | $3810 |
Q80-23 | 80 | 2.3 GHz | 150 W | 128x G4 | 8 x 3200 | $3700 |
Q72-30 | 72 | 3.0 GHz | 195 W | 128x G4 | 8 x 3200 | $3590 |
Q64-33 | 64 | 3.3 GHz | 220 W | 128x G4 | 8 x 3200 | $3810 |
Q64-30 | 64 | 3.0 GHz | 180 W | 128x G4 | 8 x 3200 | $3480 |
Q64-26 | 64 | 2.6 GHz | 125 W | 128x G4 | 8 x 3200 | $3260 |
Q64-24 | 64 | 2.4 GHz | 95 W | 128x G4 | 8 x 3200 | $3090 |
Q48-22 | 48 | 2.2 GHz | 85 W | 128x G4 | 8 x 3200 | $2200 |
Q32-17 | 32 | 1.7 GHz | 45 W | 128x G4 | 8 x 3200 | $800 |
Where Ampere and the Altra definitely is beating AMD in is TCO, or total cost of ownership. Taking the flagship models as comparison points – the Q80-33 costs only $4050 which generally matching the performance of AMD’s EPYC 7742 which still comes in at $6950, essentially 42% cheaper. Of course, performance/$ will vary depending on workloads, but the Altra’s performance is so good that I don’t think it really changes the narrative of that large a cost difference. We’re really on basing this on both companies’ MSRP prices and we know for a fact many customers will be paying less than that for volume purchases and relying on discounts, but that can also apply to Ampere and the Altra.
One will note I didn’t make any mention of Intel yet - Intel’s current Xeon offering simply isn’t competitive in any way or form at this moment in time. Cascade Lake is twice as slow and half as efficient – so unless Intel is giving away the chips at a fraction of a price, they really make no sense. Ice Lake-SP is around the corner, but I don’t expect it to manage to bridge the performance or efficiency gap. Ampere and AMD here have free reign on the server market share – with Ampere having to cross the hurdle to convince customers to switch over from x86 to Arm.
Ampere is already shipping Altra systems to customers, with Oracle’s cloud business being the first big notable win for the company – signifying already very positive reactions in the market.
What we need to keep in mind though, is that today’s comparisons were against AMD’s EPYC 7742 which was launched almost 15 months ago. Rome’s successor, Milan, is already shipping to customers and has already started hitting the channel, and we expect to hear more about the Zen3-based EPYC chips in the coming weeks. I’m not expecting major leaps, but a 20% performance bump is pretty much a safe bet to make – it would beat the Q80-33 in more workloads and shift the balance a bit – but Ampere’s aggressive pricing would still be something for AMD to worry about.
What really excites me, is the potential of future Altra designs. Ampere has already announced that Altra-Max “Mystique” will be coming in 2021 – essentially a 128-core version of the same Neoverse-N1 platform used in the QuickSilver design today. We’ll have to see how that scales, but it’ll certainly be a compute monster. The real big deal will be the 5nm 2022 “Siryn” design – if Ampere adopts the Neoverse-V1 CPU core from Arm, and I hope they will, then that would signify at minimum a +50% performance uplift, which is massive.
The Altra overall is an astounding achievement – the company has managed to meet, and maybe even surpass all expectations out of this first-generation design. With one fell swoop Ampere managed to position itself as a top competitor in the server CPU market. The Arm server dream is no longer a dream, it’s here today, and it’s real.
148 Comments
View All Comments
Wilco1 - Monday, December 21, 2020 - link
Why would they introduce Graviton if it would run at a loss??? A significant percentage of AWS is already Graviton (probably 20% by now). If anything Graviton increases profitability due to vertical integration and other cost reduction.mode_13h - Monday, December 21, 2020 - link
First, there's a fundamental disparity between an in-house CPU and a 3rd Party one, where Amazon can cut out some overheads by building their own. So, that already skews the price-comparison.The other question is whether Amazon is partially-subsidizing the price of their Graviton2 instances as an incentive to get more people to switch. For a business, the least risky thing is to stay on x86, so Amazon needs to present an immediate and significant cost savings to get people to switch. After they've switched and ARM server cores have had more time to mature, Amazon can charge more and make back a good return on investment.
I obviously don't know if that's what they're doing, but we don't know that it's not. So, you really can't read much into their current pricing. That's all I'm saying.
mode_13h - Sunday, December 20, 2020 - link
Finally, I guess you missed this part, in the discussion of SPECjbb:> One thing that did come to mind immediately when I saw the results was SMT.
> Due to this being a transactional data-plane resident type of workload,
> SMT will undoubtedly help a lot in terms of performance,
> so I tested out the EPYC chip figures with SMT disabled,
> and indeed max-jOPS went down to 209.5k for the 2S THP enabled results,
> meaning that SMT accounts for a 29.7% performance benefit in this benchmark.
...
> It’s generally these kinds of workloads that SMT works best on,
> and that’s why IBM can deploy SMT4 or SMT8 processors,
> and the type of workloads Marvell’s ThunderX was trying to carve a niche or itself with SMT4.
mode_13h - Sunday, December 20, 2020 - link
As the article mentions, Marvell’s ThunderX did support SMT on ARMv8-A.Were SMT's reputation not bruised by all the recent side-channel exploits, perhaps it would be showing up in some of ARM's own cores. Maybe their V-series will get it, since that's a much larger core.
Wilco1 - Monday, December 21, 2020 - link
ThunderX2/X3 and Neoverse E1 have SMT, but neither has been hugely successful. SMT doesn't provide a significant benefit across a wide range of workloads, so adding another core remains simpler and cheaper. And yes, security is another nail in the coffin.EthiaW - Saturday, December 19, 2020 - link
The performance of Graviton2 meets our expectation for Neoverse N1 (or Cortex A76) better. How can Q80 manage to deliver so much higher IPC with the same architecture? Incredible.Brutalizer - Saturday, December 19, 2020 - link
One old Oracle SPARC T8 cpu does 153.500 Java max-JOPS SPECjbb2015. And the crit-JOPS value is 90.000. Easily smashing all cpus here.https://blogs.oracle.com/bestperf/specjbb2015:-spa...
satai - Saturday, December 19, 2020 - link
Benchmarked by Oracle... Definitely trustworthy.zepi - Saturday, December 19, 2020 - link
SPECJBB graphs kill me.For the love of god, please keep the axis scaling identical!
Same applies to every single metric always. If you provide separate graphs for different products, please make sure that axis-scaling is the same in all images!
Andrei Frumusanu - Sunday, December 20, 2020 - link
The graphs are generated by the benchmark itself.