Arm Announces Neoverse V1, N2 Platforms & CPUs, CMN-700 Mesh: More Performance, More Cores, More Flexibility

Name: Arm Announces Neoverse V1, N2 Platforms & CPUs, CMN-700 Mesh: More Performance, More Cores, More Flexibility
Item: Arm Announces Neoverse V1, N2 Platforms & CPUs, CMN-700 Mesh: More Performance, More Cores, More Flexibility
Author: Andrei Frumusanu

by Andrei Frumusanu on April 27, 2021 9:00 AM EST

95 Comments | Add A Comment

95 Comments

The SVE Factor - More Than Just Vector Size

We’ve talked a lot about SVE (Scalable Vector Extensions) over the past few years, and the new Arm ISA feature has been most known as being employed for the first time in Fujitsu’s A64FX processor core, which now powers the world’s most performance supercomputer.

Traditionally, employing CPU microarchitectures with wider SIMD vector capabilities always came with the caveat that you needed to use a new instruction set to make use of these wider vectors. For example, in the x86 world, we’ve seen the move from 128b (SSE-SSE4.2 & AVX) to 256b (AVX & AVX2) to 512b (AVX512) vectors always be coupled with a need for software to be redesigned and recompiled to make use of newer wider execution capabilities.

SVE on the other hand is hardware vector execution unit width agnostic, meaning that from a software perspective, the programmer doesn’t actually know the length of the vector that the software will end up running at. On the hardware side, CPU designers can implement execution units in 128b increments from 128b to 2048b in width. As noted earlier, the Neoverse N2 uses this smaller implementation of 128b units, while the Neoverse V1 uses 256b implementations.

Generally speaking, the actual execution width of the vector isn’t as important as the total execution width of a microarchitecture, 2x256b isn’t necessarily faster than 4x128b, however it does play a larger role on the software side of things where the same binary and code path can now be deployed to very different target products, which is also very important for Arm and their mobile processor designs.

More important than the actual scalable nature of the vectors in SVE, is the new addition of helper instructions and features such as gather-loads, scatter-stores, per-lane predication, predicate-driven loop control (conditional execution depending on SIMD data), and many other features.

Where these things particularly come into play is for allowing compilers to generate better auto-vectorised code, meaning the compiler would now be capable of emitting SIMD instructions on SVE where previously it wasn’t possible with NEON – regardless of the vector length changes.

Arm here discloses that the performance advantages on auto-vectorizable code can be quite significant. In a 2x128b comparison between the N1 and the N2, we can see around 40^th-percentile gains of at least 20% of performance, with some code reaching even much higher gains of up to +90%.

The V1 versus N1 increase being higher comes natural from the fact that the core has double the vector execution capabilities over the N1.

In general, both the N2, but particularly the V1, promise quite large increase in HPC workloads with vector heavy compute characteristics. It’ll definitely be interesting to see how these future designs play out and how SVE auto-vectorisation plays out in more general purpose workloads.

The Neoverse N2 Microarchitecture: First Armv9 For Enterprise PPA & ISO Performance Projections

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

95 Comments

View All Comments

nandnandnand - Tuesday, April 27, 2021 - link
Looking at Cortex-X-next. It seems like Arm can put out a new Cortex-X for every new Cortex-A78 successor, since the Cortex-X is very similar but bigger.
mode_13h - Tuesday, April 27, 2021 - link
Form an earlier article:

> The Cortex-X1 was designed within the frame of a new program at Arm,
> which the company calls the “Cortex-X Custom Program”.
> The program is an evolution of what the company had previously
> already done with the “Built on Arm Cortex Technology” program
> released a few years ago. As a reminder, that license allowed
> customers to collaborate early in the design phase of a new
> microarchitecture, and request customizations to the configurations,
> such as a larger re-order buffer (ROB), differently tuned prefetchers,
> or interface customizations for better integrations into the SoC designs.
> Qualcomm was the predominant benefactor of this license,
Alistair - Tuesday, April 27, 2021 - link
I just want to be able to use ARM in standard DIY with an Asus motherboard and a socket, just like AMD and Intel.
mode_13h - Tuesday, April 27, 2021 - link
I wonder if Nvidia will put out a Jetson-style board in something like a mini-ITX form factor.
Alistair - Wednesday, April 28, 2021 - link
i sure hope so, and something not massively overpriced like right now
mode_13h - Thursday, April 29, 2021 - link
Yeah, because Nvidia is known for their bargain pricing!
; )

Although, if they wanted to create a whole new product segment, it's conceivable they might keep prices rather affordable for a couple generations.
nandnandnand - Wednesday, April 28, 2021 - link
I want it. You want it. Some people seem to want it. Maybe demand is forming? Get on it, China.

16-core Cortex-X2 please.
mode_13h - Wednesday, April 28, 2021 - link
They already did, sort of. See: https://e.huawei.com/us/products/servers/kunpeng/k...

Whoops! Had to get this out of Google cache, because the page 404'd:

Board Model D920S10
Processors 1 Kunpeng 920 processor, 4/8 cores, 2.6 GHz
Internal Storage 6 SATA 3.0 hard drive interfaces, 2 M.2 SSD slots
Memory 4 DDR4-2666 UDIMM slots, up to 64 GB
PCIe Expansion 1 PCIe 3.0 x16, 1 PCIe 3.0 x4, and 1 PCIe 3.0 x1 slots
LOM Network Ports 2 LOM NIC, supporting GE network ports or optical ports
USB 4 USB 3.0 and 4 USB 2.0
mode_13h - Tuesday, April 27, 2021 - link
Do any of the current x86 cores pair up SSE operations for >= 4x throughput per cycle?

AVX2 has been around for long enough that a lot of the code which could benefit from it has already been written to do so, yet *most* people are still compiling to baseline x86-64 (or just above that), since Intel is still making low-power cores without any AVX. So, I'm sure there's still *some* code that could benefit from >= 4x SSEn execution.
AntonErtl - Wednesday, April 28, 2021 - link
Zen has 4 128-bit FP units (2 FMA and 2 FADD). Not sure if that's what you are interested in.

Arm Announces Neoverse V1, N2 Platforms & CPUs, CMN-700 Mesh: More Performance, More Cores, More Flexibility

The SVE Factor - More Than Just Vector Size

Post Your Comment

95 Comments

View All Comments

nandnandnand - Tuesday, April 27, 2021 - link

mode_13h - Tuesday, April 27, 2021 - link

Alistair - Tuesday, April 27, 2021 - link

mode_13h - Tuesday, April 27, 2021 - link

Alistair - Wednesday, April 28, 2021 - link

mode_13h - Thursday, April 29, 2021 - link

nandnandnand - Wednesday, April 28, 2021 - link

mode_13h - Wednesday, April 28, 2021 - link

mode_13h - Tuesday, April 27, 2021 - link

AntonErtl - Wednesday, April 28, 2021 - link

Log in

Don't have an account? Sign up now