Eventual Design Performance Projections

Alongside the ISO-process node IPC, power and area projections, Arm also made projection of possible eventual implementations of the V1 and N2. These would naturally no longer be ISO-process, but the company’s expectations of what actual possible products might end up as in future designs.

The most important slide and disclosure in this regard is the fact that a Neoverse N2 design on TSMC’s 5nm is expected to achieve the same power as well as the same area as a TSMC 7nm Neoverse N1 design today.

In general, that’s a relatively large presumption, but could possibly pan out if the vendors are able to achieve a good implementation. We don’t have too many details as to the 7nm node generation of Amazon’s or Ampere’s current N1 chips, but I would assume that they’re baseline N7 – at least similar to that of what AMD uses in their EPYC 7002 and 7003 chips.

Still, a -40% power reduction from N7 to N5 is a very high goal and assumption to make. The only N5 chips we’ve had in-house to date, the Kirin 9000, and Apple A14, showcased only a rough 10% efficiency advantage over their N7P predecessors. N7P being roughly 15% better than N7, that’s still only somewhat 26% better efficiency.

Arm expects that the current generation N1 implementations to day to not have fully achieved their potential as it was the vendor’s first attempt with the IP. Arm expects that the following generations with more experience, better implementations with for example more metal layers, to be able to squeeze out more performance and power efficiency on the N5 node.

In terms of socket performance, Arm is expecting some very large generational gains versus a 64C N1 product today – it’s to be noted that these are Arm pre-silicon figures and not the Graviton2.

The “Traditional 2020” chips are the 24C Xeon 8268 and the 64C EPYC 7742. I would ignore the “Traditional 2021” parts here – Arm was aiming and estimating the performance of Intel’s newest 40C Ice-Lake and 64C Milan, however the presentation and figures here were integrated before AMD and Intel actually launched those systems – we have actual benchmark numbers in a custom graph below.

One metric Arm was focusing on was per-thread performance, where the “traditional” cores from AMD and Intel are falling short of the performance of Arm’s Neoverse cores.

Arm here is being somewhat sneaky in their presentation as they are trying to only focus on per-thread performance in cloud environments, where typically things operate on a vCPU basis, and essentially SMT-enabled designs from AMD and Intel naturally fall behind quite a lot in per-thread performance.

I can’t really blame Arm for depicting the performance figures like this – the cloud vendors today don’t really differentiate between real cores and SMT cores in vCPU environments, even having pricing that’s arguably unfair to SMT-enabled designs, which is why we’ve deemed Amazon’s Graviton2 m6g instances to vastly outperform AMD and Intel instances in terms of perf/W and perf/thread.

I wasn’t happy with Arm’s slides not including 1 thread per core performance figures for the SMT systems, so I included my own chart based on actual measured performance figures on the various platforms. The V1 and N2 figures use Arm’s performance scaling versus the Neoverse N1 datapoint, and I’ve baselined that to the Graviton2 scores we’ve measured earlier last year. Arm uses the same compiler flags as we do and also GCC 10.2, so the scores should also be compatible – with the only discrepancy being that Arm used 2MB page sizes.

The Neoverse V1 system uses 96 cores at 2.7GHz with 1MB L2 per core, on a 128MB 2GHz mesh, with 8 DDR5-4800 memory controllers. The N2 datapoint uses 128 3GHz cores at 1MB per L2, 96MB 2GHz mesh, with 10 DDR5-4800 memory controllers.

Arm’s per-thread performance lead doesn’t look that great here when looking at the 1T/C figures of AMD and Intel, but admittedly when in a vCPU scenario, Arm’s design would vastly outperform the SMT chips.

Generally speaking, the performance figures look good when it comes to per-socket performance, but generally that’s to be expected given the new 5nm process node and the more advanced memory controller technology in the projected figures.

AMD's next-generation Genoa should feature more massive performance jumps through the adoption of N5, DDR5, and transition away from their 14nm IO die. IPC and core count increases should also close the gap that’s depicted today. Intel’s next generation Sapphire Rapids should also improve the situation – albeit how that ends up depends on how much they’ll be able to squeeze out of 10nm SuperFin node in relation to what we’ve seen a few weeks ago on Ice Lake-SP.

Usually, I’m more open to Arm’s performance projections, however this time around the V1 and N2’s performance projections are extremely optimistic, especially since they’re completely dependent on the vendors achieving good implementations on N5 and actually reaching the projected 40% perf/W process node and implementation power efficiency gains. Based on what I’ve seen in the mobile space, I remain quite sceptical, and will be adopting a wait & see approach this time around.

The CMN-700 Mesh Network - Bigger, More Flexible First Thoughts & End Remarks
Comments Locked

95 Comments

View All Comments

  • yeeeeman - Tuesday, April 27, 2021 - link

    what about a cortex a55 successor?
  • SarahKerrigan - Tuesday, April 27, 2021 - link

    I'd expect to see one next month launching alongside Matterhorn.
  • eastcoast_pete - Tuesday, April 27, 2021 - link

    Hi Sarah, can you post any links (including rumors) about that? Given ARM's focus on bigger, high performance-oriented designs, the LITTLE cores haven't gotten a lot of love in recent years. The persistence of the in-order designs for ARM LITTLE cores is one of the reasons why I find the dominance of ARM troubling; that clearly stood still because there is nowhere else to turn to for many, i.e. they didn't have to change it. In x86, at least we have two larger players having their own, yet compatible designs.
  • SarahKerrigan - Tuesday, April 27, 2021 - link

    I've seen it reported in a few places, including on RWT which is a pain to search - but since task migration generally requires compatible instruction sets between big and little cores, it's pretty clear that Matterhorn will bring a small, low-power friend when it arrives.
  • Raqia - Tuesday, April 27, 2021 - link

    I wonder if they could simply repurpose either a refresh of the A73 or A75 as the little core. Surely with the new fabrication processes available, die area relative to a big Matterhorn core should be comparable to A55 vs A78/X1, but the question becomes performance / energy. Integer performance of A75/73 vs. Ice Storm is comparable with the former winning by a bit in FP, but efficiency is light years apart:

    https://images.anandtech.com/doci/13614/SPEC-perf-...

    https://images.anandtech.com/doci/16192/spec2006_A...
  • SarahKerrigan - Tuesday, April 27, 2021 - link

    I think use of a refreshed A65 without multithreading and with the new ops seems more plausible to me.
  • Raqia - Tuesday, April 27, 2021 - link

    That could make sense; there's fairly little information on the micro-architecture of the A65 or A65AE at present except that it does do OoOE, and it's unclear what clocks and efficiency it can achieve as well:

    https://developer.arm.com/documentation/100439/010...

    It does sport a bigger maximum L2 configuration than the A55. They do need to up their game here as the A55 makes a pretty poor showing for efficiency compared to Apple's small core (which got even worse in the A14 generation):

    https://images.anandtech.com/doci/14072/SPEC2006ef...

    At least wattage and hence current draw is low.
  • SarahKerrigan - Tuesday, April 27, 2021 - link

    A65 is E1, which has had a uarch dive on this site.
  • Raqia - Wednesday, April 28, 2021 - link

    Got it, thanks for that! The A65 is interesting, without SMT they are quoting a pretty modest bump in integer performance < 20% at a bit more than half the power of A55 at 7nm:

    https://images.anandtech.com/doci/13959/07_Infra%2...

    https://images.anandtech.com/doci/13959/07_Infra%2...

    They could probably tune this to be better without SMT, but are you against having SMT for security reasons?

    It's still not close to Apple's small cores in performance, but efficiency might be in the same ballpark now. ARM designs are quite good in terms of PPA but even their performance oriented X1 is likely only 70% the die area as a Firestorm core, and their cache hierarchies are more complex as core designs pull double duty for servers parts too.

    It probably made sense to have fewer transistors per CPU core as quite a few Android SoC vendors integrated modems on die, but this may change once Qualcomm digests its Nuvia purchase and move to a smaller node. All parties may hit a wall for per core improvements as slowing SRAM density improvements at new nodes bottleneck what gains are gotten from logic density improvements.
  • Kangal - Thursday, April 29, 2021 - link

    TL;DR - ARM needs to focus on a new product stack. It needs to have a diverse ARMv9 lineup of small, medium, large chipset options. With the small chipset being very scalable down to Tiny IoT Sensor level. Whereas the large chipset being scalable up to large supercomputers and servers. Whilst the medium chipset focusing on phones and tablets. As this covers full SoC, it includes both CPUs and GPUs.

    Long version:
    I know making these architectures is a huge challenge, but ARM has been a little lazy in some scenarios. I know they're basically following the money in the industry, and that means chasing the "phablet" market for CPUs and GPUs. But they've been leaving themselves vulnerable to gaps, in either smaller power or larger power systems, that can be exploited by competitors, such as RISC-V. If not, even x86 might poke some wins here and there.

    Ages ago, like 2013, they had the A7 (tiny), A15 (small), and A57 (medium) core designs. Basically covering most bases. Along with the Mali-400 iGPU, and 1GB-2GB Shared-RAM, to do some compute tasks. To say ARM was innovative would be a disservice to the technology they brought forward. That's in contrast to x86 Intel's Atom (small) and Intel's Core-i7Y/M (large), as well as Intel Iris Pro iGPU with 8GB Shared-RAM in systems of the time. Then ARM made the leap into 64bit processing around 2016. The lineup evolved into the A35 (tiny), A53 (small), A73 (medium) core designs, running with 1GB-2GB-4GB sRAM, and used modest G31 (tiny) to G51 (small) to G71 (medium) iGPU options. Again, this lineup was very innovative and impressive. Contrast that to the new x86 competition in AMD's 16nm Vega Large-iGPU, and Zen1 Large-CPU.

    However... There hasn't been any upgrades for the "tiny" portfolio, being stuck to the offerings of Cortex A35 CPU and G31 GPU ever since. There has been only a slight refresh to the "small" portfolio, upgrading to the Cortex A55 CPU, and the G52 and G57 iGPUs. To the point that they're a joke, and easily surpassable by the competitors. ARM really needs a revolutionary new design here, it needs to be super-efficient. Perhaps something that can scale between both tiny and small categories: with performance ranging from the A55 (or more) at the "tiny" power-level, to the A73 (or more) at the "small" power-level. Basically catching up to Apple, if unable to surpass them.

    Whereas the "medium" portfolio has seen very frequent upgrades, in the CPU-side to the Cortex A75, A76, A77, and A78. In the GPU-side we've seen G72, G76, G77, G78 which have been mostly competitive, surpassing some custom implementations (Samsung/MediaTek) and losing to others (Apple/Qualcomm). Not much needs to change here to be honest. We've also seen the emergence of a new "large" category of ARM processors. Firstly popularised by custom implementations from Apple (A10 and onwards), then Samsung (Mongoose M3, and onwards). Now it's supported officially by ARM in the form of the Cortex A77+ and the Cortex A78X / X1. This has been mostly underwhelming and uncompetitive, with Apple being the only one implementing good designs. There hasn't been any new "large" category for iGPUs from ARM or competitors, with the only Large-iGPU exception actually being inside the Apple Silicon M1. ARM (without counting Apple) needs to do better here, and it looks like ARM might already be focussing here in the future with ARMv9. Again contrast this to the x86 markets offering 7nm Large-CPUs of Zen2 and Zen3, with RDNA-1 and RDNA-2 Large-GPUs.

Log in

Don't have an account? Sign up now