The SVE Factor - More Than Just Vector Size

We’ve talked a lot about SVE (Scalable Vector Extensions) over the past few years, and the new Arm ISA feature has been most known as being employed for the first time in Fujitsu’s A64FX processor core, which now powers the world’s most performance supercomputer.

Traditionally, employing CPU microarchitectures with wider SIMD vector capabilities always came with the caveat that you needed to use a new instruction set to make use of these wider vectors. For example, in the x86 world, we’ve seen the move from 128b (SSE-SSE4.2 & AVX) to 256b (AVX & AVX2) to 512b (AVX512) vectors always be coupled with a need for software to be redesigned and recompiled to make use of newer wider execution capabilities.

SVE on the other hand is hardware vector execution unit width agnostic, meaning that from a software perspective, the programmer doesn’t actually know the length of the vector that the software will end up running at. On the hardware side, CPU designers can implement execution units in 128b increments from 128b to 2048b in width. As noted earlier, the Neoverse N2 uses this smaller implementation of 128b units, while the Neoverse V1 uses 256b implementations.

Generally speaking, the actual execution width of the vector isn’t as important as the total execution width of a microarchitecture, 2x256b isn’t necessarily faster than 4x128b, however it does play a larger role on the software side of things where the same binary and code path can now be deployed to very different target products, which is also very important for Arm and their mobile processor designs.

More important than the actual scalable nature of the vectors in SVE, is the new addition of helper instructions and features such as gather-loads, scatter-stores, per-lane predication, predicate-driven loop control (conditional execution depending on SIMD data), and many other features.

Where these things particularly come into play is for allowing compilers to generate better auto-vectorised code, meaning the compiler would now be capable of emitting SIMD instructions on SVE where previously it wasn’t possible with NEON – regardless of the vector length changes.

Arm here discloses that the performance advantages on auto-vectorizable code can be quite significant. In a 2x128b comparison between the N1 and the N2, we can see around 40th-percentile gains of at least 20% of performance, with some code reaching even much higher gains of up to +90%.

The V1 versus N1 increase being higher comes natural from the fact that the core has double the vector execution capabilities over the N1.

In general, both the N2, but particularly the V1, promise quite large increase in HPC workloads with vector heavy compute characteristics. It’ll definitely be interesting to see how these future designs play out and how SVE auto-vectorisation plays out in more general purpose workloads.

The Neoverse N2 Microarchitecture: First Armv9 For Enterprise PPA & ISO Performance Projections
Comments Locked

95 Comments

View All Comments

  • Dug - Tuesday, April 27, 2021 - link

    Now is when I wish ARM was publicly traded.
  • mode_13h - Tuesday, April 27, 2021 - link

    Well, you could buy NVDA, under the assumption the acquisition will go through.
  • dotjaz - Thursday, April 29, 2021 - link

    SoftBank is already publicly traded on the Tokyo Stock Exchange. Why rely on NVIDIA buyout which for all likelihood won't happen any time soon if at all.
  • mode_13h - Thursday, April 29, 2021 - link

    > SoftBank is already publicly traded on the Tokyo Stock Exchange.

    They also invested heavily in WeWork, when it was highly over-valued. I have no idea what other nutty positions they might've taken, but I think it's not a great proxy for ARM just due to its sheer size.
  • cjcoats - Tuesday, April 27, 2021 - link

    As an environmental modeling (HPCC) developer: what is the chance of putting a V1 machine on my desk in the foreseeable future?
  • Silver5urfer - Tuesday, April 27, 2021 - link

    Never. Since there has to be an OEM for these chips to put in DIY and Consumer machines, so far except the HPE's A64FX ARM there's no way any consumer can buy these ARM processors and that is also highly expensive over 5 digit figure. And then the drivers / sw ecosystem comes into play, there's passion projects like Pi as we all know but they are nowhere near the Desktop class performance.

    ARM Graviton 2 was made because AWS wants to save money on their Infrastructure, that's why their Annapurna design team is working there. Simply because of that reason Amazon put more effort onto it AND the fact that ARM is custom helps them to tailor it to their workloads and spread their cost.

    Altra is niche, Marvell is nowhere near as their plans was to make custom chips on order. And from the coverage above we see India, Korea, EU use custom design licensing for their HPC Supercomputer designs.

    Then there's a rumor that MS is also making their own chips, again custom tailored for their Azure, Google also rumored esp their Whitechapel mobile processor (it won't beat any processor on the market that's my guess) and maybe their GCP oriented own design.

    These numbers projection do look good vs x86 SMT machines finally to me after all these years, BUT have to see how they will compete once they are out vs 2021 HW is the big question, since if these CPUs outperform the EPYC Milan technically AWS should replace all of them right ? since you have Perf / Power improvements by a massive scale. Idk, gotta see. Then the upcoming AMD Genoa and Sapphire Rapids competition will also show how the landscape will be.
  • SarahKerrigan - Tuesday, April 27, 2021 - link

    If they don't replace all the x86 systems in AWS with ARM, that *must* mean Neoverse is somehow secretly inferior, right??

    Or, you know, it could mean that x86 compatibility matters for a fair chunk of the EC2 installed base, especially on the Windows Server side (which is not small) but on Linux too (Oracle DB, for instance, which does not yet run on ARM.)
  • Silver5urfer - Tuesday, April 27, 2021 - link

    That was a joke.
  • Spunjji - Friday, April 30, 2021 - link

    Was it, though? Schrodinger's Joke strikes again.
  • Raqia - Tuesday, April 27, 2021 - link

    Maybe not an V1 but you could probably get a more open high performance ARM core than the Apple MX series pretty soon:

    https://investor.qualcomm.com/news-events/press-re...

    "The first Qualcomm® Snapdragon™ platforms to feature Qualcomm Technologies' new internally designed CPUs are expected to sample in the second half of 2022 and will be designed for high performance ultraportable laptops."

Log in

Don't have an account? Sign up now