Arm Announces Neoverse N1 & E1 Platforms & CPUs: Enabling A Huge Jump In Infrastructure Performance

Name: Arm Announces Neoverse N1 & E1 Platforms & CPUs: Enabling A Huge Jump In Infrastructure Performance
Item: Arm Announces Neoverse N1 & E1 Platforms & CPUs: Enabling A Huge Jump In Infrastructure Performance
Author: Andrei Frumusanu

by Andrei Frumusanu on February 20, 2019 9:00 AM EST

101 Comments | Add A Comment

101 Comments

Performance Targets: What Are The Numbers?

Naturally all this talk about performance and efficiency needs to substantiated with some concrete numbers. In the context of today’s announcement, most performance figures disclosed by Arm were relative improvements compared to the A72 Cosmos platform, which might not be the most relevant data-point in terms of trying to actually place the N1 in the competitive landscape, however we also have some more concrete absolute figures we’ll try to put some more context behind shortly.

The comparison to the A72 at the same frequency as well as a similarly configured system with SLC configuration, the new N1 outright smashes its predecessor platform / microarchitecture. The figures here represent single-threaded performance in SPEC. In integer workloads we see PPC (performance per clock) and absolute performance gains from 60 to 70%. The floating point benchmarks are even more impressive with gains ranging from 100 to 120%. The data-points represent modelled and emulated performance estimates, the actual real-life performance improvements will higher due other SoC-level improvements as well as software improvements that aren’t available in existing actual A72 silicon products.

Arm again iterates the very large compute performance improvement compared to existing solutions, achieving beyond 2x performance boosts in vector workloads. Naturally, the N1’s ARMv8.2 ISA implementation also means that it supports 8-bit dot product as well as FP16 half-precision instructions which are particularly well fitted for machine-learning workloads, achieving performance boosts near 5x the predecessor platform.

Overall, Arm’s comparison to the A72 makes sense in the context that this is to its predecessor, however we have to keep in mind that the Cortex A72 was a core that was first introduced back in 2015 with first silicon products being released late that year as well as 2016, while the new Neoverse N1 in all likelihood isn’t something which we’ll be seeing in products for another 12-18 months, resulting in a ~3-4 year time span between the two products.

Arm did also divulge absolute SPEC numbers, and here we can have some more interesting analysis to competing platforms:

For a Neoverse N1 64-core hyperscale reference design running at about 2.6GHz, Arm proclaims single-threaded SPECint2006 Speed score of ~37 while reaching an estimated multi-threaded scores of 1310. The figures here are achieved in a quite low whole-server TDP of only 105W. The figures weren’t run actual silicon but rather estimated on Arm’s server farm in an emulation environment with RTL.

Arm made a big note that among the many efforts to improve performance for the Arm ecosystem isn’t only offering better hardware, but also better software. Over the last few years Arm has put a lot of effort into improving open-source tools and compilers, such as GCC. Comparing the latest GCC9 version to GCC5, we’re seeing improvements of 15-13% in integer and floating-point workloads. It’s to be noted that the optimisations made here are real-world use-case improvements, and not targeted changes that are meant to improve SPEC scores.

In order to put context around Arm’s numbers, I went ahead and compiled a set of binaries with GCC8 and had Ian run it on Intel’s and AMD’s latest and greatest, an Xeon W-3172X as well as a AMD Epyc 7601. It’s to be noted that the compiler flags weren’t exactly the same – both AnandTech’s and Arm’s builds were running under –Ofast, however Arm also added some minor flags which I hadn’t had the chance and time to cross-check, as well as enabling LTO. I’m not too concerned about the flag variations, however LTO will give Arm a 2-3% performance advantage over our internal numbers. It’s also to be noted that Arm’s single-threaded figures are marked as “Peak” scores, meaning each individual workload was run with the best performing compiler flags, while our internal figures are “Base” scores, meaning we’re running the same flags across all binaries and tests.

Edit: 25/02/2019: Arm have reached out to clarify that the performance scores were in fact Base runs and without LTO - the slides in question were mixing things up. Thus we have proper apples-to-apples comparisons in our numbers versus Arm's internal numbers.

As always we have to disclose that the below figures are merely internal estimates as they’re not official SPEC submissions. SPEC CPU 2006 has also been deprecated in favour of SPEC CPU 2017. Arm stated that they’ve shared SPEC CPU 2006 figures as that’s still the industry standard at the moment which gives users and customers the best context, and in the coming year or so they’ll switch over to also sharing SPEC CPU 2017 numbers. As for us at AnandTech, I’ve prepared SPEC CPU 2017 and Ian and I will be adopting it in our benchmark suites for PC/server CPUs as well as mobile SoCs in the coming weeks and months.

SPECint2006 Speed Base - Estimates (GCC8)

In terms of single-threaded performance, the N1 looks to be outright outstanding. With an estimated score of 37, it would beat the most recent and best-performing Arm server CPU, Cavium’s ThunderX2, by a significant margin. It’s to be noted that the real-world performance difference would be smaller than depicted in the above figures: GCC8 notably improved loop vectorisation in 456.hmmer which will give it a 1-2% overall score boost, ~~and of course we have to take into account 2-3% difference due to Arm’s different compiler flags.~~

Intel’s W-3175X is hardly the most representative hyperscaler CPU, however it gives context as to what Intel’s top-end single-threaded performance in their best multi-core CPUs is. As a reminder, the W-3175X has a single-threaded boost clock of 4.5GHz, significantly above what we see in server SKUs such as the Xeon 8180. AMD’s Epyc 7601 is a more representative data-point against what a hyperscale design such as the N1 would compete against, again as a reminder this is a 3.2GHz single-threaded boost clock on the part of AMD’s first generation Zen core.

What surprised me the most about Arm’s quoted ST score of ~37 is that it’s significantly higher than what we measure on the Cortex A76, which scores in at about ~26 on actual hardware. Software and compiler considerations aside, one of the explanations for this huge 42% performance discrepancy could be the N1’s much better memory and cache system. Here the full system bandwidth is 6x higher than on mobile SoCs, and naturally in a single-threaded workload the thread would have full access to the Neoverse’s N1 64MB SLC cache, a whopping 16x bigger than the L3 in current mobile Cortex A76 designs. If the performance difference is indeed explained by the memory subsystem, it just gives to explain just how important it is to the performance scaling of a core.

SPECint2006 Rate Base - Estimates (GCC8)

Switching over to multi-threaded workloads represented by SPECrate2006, we have to note that this is a best-case scaling scenario for all platforms as there is no serialisation or inter-thread communication, as the test suite simply runs multiple processes in parallel. Even with this in mind, Arm’s projected results for a N1 64 core design are just outright impressive considering the fact that we’re talking about TDPs much smaller than any of AMD and Intel’s solutions, creating a performance and efficiency gap that I have a hard time seeing the x86 solutions being able to compete against.

We have to remember that we’re comparing a 64 core platform against AMD and Intel’s current 32/28 core platforms. A more fair comparison would be AMD’s upcoming Rome with 64 CPU cores, here even if AMD manages to outright double multi-threaded performance and match Arm’s projected MT numbers, I don’t see them be able to at the same time lower the TDP to match Arm’s estimated 105W target (The Epyc 7601 has a TDP of 180W, Rome details haven’t been announced yet).

SPEC’s Rate benchmarking scoring scales linearly with the instance count. In this case, if we divide Arm’s 1310 figure by the 64 cores of the system, we get a per-instance score of around ~20.5, which seems much more realistic and in-line with the Cortex A76 results we measure on current mobile devices.

Arm’s performance predictions for the Cortex A76 were quite spot on to what we measured on actual devices. We thus are more inclined to give Arm credence and the benefit of the doubt in regards to today’s projected Neoverse N1 scores. The figures do make sense, and are in line with what we saw the microarchitecture able to achieve in mobile.

Naturally we shouldn’t come to any conclusion until we actually have the actual hardware in our hands, but the presented figures are certainly promising if they can be realised by vendors implementing Neoverse N1 systems.

N1 Hyperscale Reference Design & Scaling The Neoverse E1 CPU: A small SMT core for the data-plane

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

101 Comments

View All Comments

surt - Thursday, February 21, 2019 - link
That raw power comes at a .... power cost. And as soon as you try to start z-stacking your cpus that power is going to be the most important factor.
peevee - Tuesday, February 26, 2019 - link
"The future is way more related to modularity than the chip architecture."

Debatable. Both ARM and x64 are essentially the same in terms of efficiency if the same levels of performance are required. A breakthrough can only come from in-memory computing, which neither ARM nor x64 can sustain for many reasons.
rahvin - Thursday, February 21, 2019 - link
ARM is not "more efficient at every level". That's just plain fanboi BS. The architecture is the least important aspect of any processor these days.

ARM processors were traditionally designed for power efficiency above all else, now that Intel is designing down for efficiency and ARM is designing up for power there will likely be some real competition but so far ARM has not demonstrated that they can provide equivalent power for the same power budget at the high end and Intel has had difficulty matching the lower power budget and performance on the low end (though this is likely due to them wishing to avoid cannibalizing higher end products with performant low power versions).

As ARM tries to enter the server market we'll finally see if they can provide something equivalent, but it's not been a hopeful showing given that all but one ARM server design has been canceled and it's not equivalent to an x86 server processor of the same character in either power or performance.
Wilco1 - Thursday, February 21, 2019 - link
Today you can buy Arm-based servers like Operon A1100, Centriq, ThunderX, ThunderX2, eMAG and HiSilicon. The first Arm supercomputer entered the TOP500 list recently, and Fujitsu has prototypes of their Post-K computer. You can buy Arm compute time from several cloud vendors today, including AWS. That all adds up to one Arm server in your book?
rahvin - Thursday, February 21, 2019 - link
ThunderX is gone, displaced by the ThunderX2 which is the Centriq processor after it was abandoned by it's creator. eMAG, A1100 and the HiSilicon Last I saw are all canceled.

Commercially you can buy one ARM server, the ThunderX2. Go ahead, TRY to buy one.
Wilco1 - Thursday, February 21, 2019 - link
How could you be so clueless? ThunderX2 is based on Vulcan made by Broadcomm, no relation with Centriq at all. ThunderX is still being used and sold. Centriq is still being sold, a few months ago Gigabyte announced a brand new motherboard for it. eMAG is just announced. HiSilicon/Huawei has 2 generations of Arm servers already and is working on several more. That's the only one that isn't for sale outside of China according to AnandTech.

What's next? Are you going to tell us that Arm servers did not beat Xeon and SkyLake in various benchmarks, eventhough the evidence was published in an article on AnandTech?
rahvin - Thursday, February 21, 2019 - link
Your right I confused the Vulcan and the Centriq. The Centriq is dead, the design teams gone,and there is no plan to even spin the silicon from what I've seen. Qualacom abandoned the product under threat from an activist investor. Yea there was a motherboard at CES but that doesn't mean anything at all and there is literally no way to buy one.

ThunderX is depreciated (show me where you can buy one, they depreciated the silicon over a year ago, there may still be some inventory out there but I seriously doubt it), ThunderX2 is available, and from everything I saw it's awful. The best work case was as a nginx master server because the compute capacity was so awful. Basically you need a workload with a lot of threads and no actual work to even make it worth anything at all, especially considering the price.

The Huawei junk is a nonstarter, you can't buy it anywhere but China that I've seen and it's not exactly flying off the shelves either. I've seen more ARM servers announced and canceled a year later than any that made it off the shelves into an actual product. So there is an eMag, that's great show me where you can buy it.

That's my point, you can't buy them, other than the ThunderX2 or the Huawei if you want to go to china to get it. The Arm server has been a flash in the pan and I have no doubt it will continue to be so.
FunBunny2 - Wednesday, February 20, 2019 - link
one has to wonder: given the existence of C compilers for any ISA, and thus *nix OS for said ISA, when (or already?) will the maths dictate both the 'optimal' ISA and underlying microarch? both, after all, are just maths optimization problems. to some delta, there is a unique solution.
zmatt - Wednesday, February 20, 2019 - link
Baring any major design flaws there shouldn't really be a difference in theoretical performance between ISAs. Its important to note that the ISA isn't the actual logic of the chip, its better thought of as a paper standard a given chip needs to conform to if it wants to be binary compatible. The real determination in performance is the microarchitecture. People conflate this with ISA a lot because they are both architectures but the Micro arch is what describes the actual logic design in the circuit. That is what Intel and AMD apply codenames to. So things like Skylake, Thunderbird, Cortex A53 etc are micro architectures.
Wilco1 - Wednesday, February 20, 2019 - link
There certainly are differences between ISAs which cannot be overcome with micro-architecture no matter how much money, power or transistors you throw at it. Given equal resources, the best possible implementations of various ISAs will exhibit major performance differences.

Arm Announces Neoverse N1 & E1 Platforms & CPUs: Enabling A Huge Jump In Infrastructure Performance

Performance Targets: What Are The Numbers?

Post Your Comment

101 Comments

View All Comments

surt - Thursday, February 21, 2019 - link

peevee - Tuesday, February 26, 2019 - link

rahvin - Thursday, February 21, 2019 - link

Wilco1 - Thursday, February 21, 2019 - link

rahvin - Thursday, February 21, 2019 - link

Wilco1 - Thursday, February 21, 2019 - link

rahvin - Thursday, February 21, 2019 - link

FunBunny2 - Wednesday, February 20, 2019 - link

zmatt - Wednesday, February 20, 2019 - link

Wilco1 - Wednesday, February 20, 2019 - link

Log in

Don't have an account? Sign up now