NVIDIA Unveils Grace: A High-Performance Arm Server CPU For Use In Big AI Systems

Name: NVIDIA Unveils Grace: A High-Performance Arm Server CPU For Use In Big AI Systems
Item: NVIDIA Unveils Grace: A High-Performance Arm Server CPU For Use In Big AI Systems
Author: Ryan Smith

by Ryan Smith on April 12, 2021 12:20 PM EST

119 Comments | Add A Comment

119 Comments

Kicking off another busy Spring GPU Technology Conference for NVIDIA, this morning the graphics and accelerator designer is announcing that they are going to once again design their own Arm-based CPU/SoC. Dubbed Grace – after Grace Hopper, the computer programming pioneer and US Navy rear admiral – the CPU is NVIDIA’s latest stab at more fully vertically integrating their hardware stack by being able to offer a high-performance CPU alongside their regular GPU wares. According to NVIDIA, the chip is being designed specifically for large-scale neural network workloads, and is expected to become available in NVIDIA products in 2023.

With two years to go until the chip is ready, NVIDIA is playing things relatively coy at this time. The company is offering only limited details for the chip – it will be based on a future iteration of Arm’s Neoverse cores, for example – as today’s announcement is a bit more focused on NVIDIA’s future workflow model than it is speeds and feeds. If nothing else, the company is making it clear early on that, at least for now, Grace is an internal product for NVIDIA, to be offered as part of their larger server offerings. The company isn’t directly gunning for the Intel Xeon or AMD EPYC server market, but instead they are building their own chip to complement their GPU offerings, creating a specialized chip that can directly connect to their GPUs and help handle enormous, trillion parameter AI models.

NVIDIA SoC Specification Comparison
	Grace	Xavier	Parker (Tegra X2)
CPU Cores	?	8	2
CPU Architecture	Next-Gen Arm Neoverse (Arm v9?)	Carmel (Custom Arm v8.2)	Denver 2 (Custom Arm v8)
Memory Bandwidth	>500GB/sec LPDDR5X (ECC)	137GB/sec LPDDR4X	60GB/sec LPDDR4
GPU-to-CPU Interface	>900GB/sec NVLink 4	PCIe 3	PCIe 3
CPU-to-CPU Interface	>600GB/sec NVLink 4	N/A	N/A
Manufacturing Process	?	TSMC 12nm	TSMC 16nm
Release Year	2023	2018	2016

More broadly speaking, Grace is designed to fill the CPU-sized hole in NVIDIA’s AI server offerings. The company’s GPUs are incredibly well-suited for certain classes of deep learning workloads, but not all workloads are purely GPU-bound, if only because a CPU is needed to keep the GPUs fed. NVIDIA’s current server offerings, in turn, typically rely on AMD’s EPYC processors, which are very fast for general compute purposes, but lack the kind of high-speed I/O and deep learning optimizations that NVIDIA is looking for. In particular, NVIDIA is currently bottlenecked by the use of PCI Express for CPU-GPU connectivity; their GPUs can talk quickly amongst themselves via NVLink, but not back to the host CPU or system RAM.

The solution to the problem, as was the case even before Grace, is to use NVLink for CPU-GPU communications. Previously NVIDIA has worked with the OpenPOWER foundation to get NVLink into POWER9 for exactly this reason, however that relationship is seemingly on its way out, both as POWER’s popularity wanes and POWER10 is skipping NVLink. Instead, NVIDIA is going their own way by building an Arm server CPU with the necessary NVLink functionality.

The end result, according to NVIDIA, will be a high-performance and high-bandwidth CPU that is designed to work in tandem with a future generation of NVIDIA server GPUs. With NVIDIA talking about pairing each NVIDIA GPU with a Grace CPU on a single board – similar to today’s mezzanine cards – not only does CPU performance and system memory scale up with the number of GPUs, but in a roundabout way, Grace will serve as a co-processor of sorts to NVIDIA’s GPUs. This, if nothing else, is a very NVIDIA solution to the problem, not only improving their performance, but giving them a counter should the more traditionally integrated AMD or Intel try some sort of similar CPU+GPU fusion play.

By 2023 NVIDIA will be up to NVLink 4, which will offer at least 900GB/sec of cummulative (up + down) bandwidth between the SoC and GPU, and over 600GB/sec cummulative between Grace SoCs. Critically, this is greater than the memory bandwidth of the SoC, which means that NVIDIA’s GPUs will have a cache coherent link to the CPU that can access the system memory at full bandwidth, and also allowing the entire system to have a single shared memory address space. NVIDIA describes this as balancing the amount of bandwidth available in a system, and they’re not wrong, but there’s more to it. Having an on-package CPU is a major means towards increasing the amount of memory NVIDIA’s GPUs can effectively access and use, as memory capacity continues to be the primary constraining factors for large neural networks – you can only efficiently run a network as big as your local memory pool.

CPU & GPU Interconnect Bandwidth
	Grace	EPYC 2 + A100	EPYC 1 + V100
GPU-to-CPU Interface (Cummulative, Both Directions)	>900GB/sec NVLink 4	~64GB/sec PCIe 4 x16	~32GB/sec PCIe 3 x16
CPU-to-CPU Interface (Cummulative, Both Directions)	>600GB/sec NVLink 4	304GB/sec Infinity Fabric 2	152GB/sec Infinity Fabric

And this memory-focused strategy is reflected in the memory pool design of Grace, as well. Since NVIDIA is putting the CPU on a shared package with the GPU, they’re going to put the RAM down right next to it. Grace-equipped GPU modules will include a to-be-determined amount of LPDDR5x memory, with NVIDIA targeting at least 500GB/sec of memory bandwidth. Besides being what’s likely to be the highest-bandwidth non-graphics memory option in 2023, NVIDIA is touting the use of LPDDR5x as a gain for energy efficiency, owing to the technology’s mobile-focused roots and very short trace lengths. And, since this is a server part, Grace’s memory will be ECC-enabled, as well.

As for CPU performance, this is actually the part where NVIDIA has said the least. The company will be using a future generation of Arm’s Neoverse CPU cores, where the initial N1 design has already been turning heads. But other than that, all the company is saying is that the cores should break 300 points on the SPECrate2017_int_base throughput benchmark, which would be comparable to some of AMD’s second-generation 64 core EPYC CPUs. The company also isn’t saying much about how the CPUs are configured or what optimizations are being added specifically for neural network processing. But since Grace is meant to support NVIDIA’s GPUs, I would expect it to be stronger where GPUs in general are weaker.

Otherwise, as mentioned earlier, NVIDIA big vision goal for Grace is significantly cutting down the time required for the largest neural networking models. NVIDIA is gunning for 10x higher performance on 1 trillion parameter models, and their performance projections for a 64 module Grace+A100 system (with theoretical NVLink 4 support) would be to bring down training such a model from a month to three days. Or alternatively, being able to do real-time inference on a 500 billion parameter model on an 8 module system.

Overall, this is NVIDIA’s second real stab at the data center CPU market – and the first that is likely to succeed. NVIDIA’s Project Denver, which was originally announced just over a decade ago, never really panned out as NVIDIA expected. The family of custom Arm cores was never good enough, and never made it out of NVIDIA’s mobile SoCs. Grace, in contrast, is a much safer project for NVIDIA; they’re merely licensing Arm cores rather than building their own, and those cores will be in use by numerous other parties, as well. So NVIDIA’s risk is reduced to largely getting the I/O and memory plumbing right, as well as keeping the final design energy efficient.

If all goes according to plan, expect to see Grace in 2023. NVIDIA is already confirming that Grace modules will be available for use in HGX carrier boards, and by extension DGX and all the other systems that use those boards. So while we haven’t seen the full extent of NVIDIA’s Grace plans, it’s clear that they are planning to make it a core part of future server offerings.

First Two Supercomputer Customers: CSCS and LANL

And even though Grace isn’t shipping until 2023, NVIDIA has already lined up their first customers for the hardware – and they’re supercomputer customers, no less. Both the Swiss National Supercomputing Centre (CSCS) and Los Alamos National Laboratory are announcing today that they’ll be ordering supercomputers based on Grace. Both systems will be built by HPE’s Cray group, and are set to come online in 2023.

CSCS’s system, dubbed Alps, will be replacing their current Piz Daint system, a Xeon plus NVIDIA P100 cluster. According to the two companies, Alps will offer 20 ExaFLOPS of AI performance, which is presumably a combination of CPU, CUDA core, and tensor core throughput. When it’s launched, Alps should be the fastest AI-focused supercomputer in the world.

An artist's rendition of the expected Alps system

Interestingly, however, CSCS’s ambitions for the system go beyond just machine learning workloads. The institute says that they’ll be using Alps as a general purpose system, working on more traditional HPC-type tasks as well as AI-focused tasks. This includes CSCS’s traditional research into weather and the climate, which the pre-AI Piz Daint is already used for as well.

As previously mentioned, Alps will be built by HPE, who will be basing on their previously-announced Cray EX architecture. This would make NVIDIA’s Grace the second CPU option for Cray EX, along with AMD’s EPYC processors.

Meanwhile Los Alamos’ system is being developed as part of an ongoing collaboration between the lab and NVIDIA, with LANL set to be the first US-based customer to receive a Grace system. LANL is not discussing the expected performance of their system beyond the fact that it’s expected to be “leadership-class,” though the lab is planning on using it for 3D simulations, taking advantage of the largest data set sizes afforded by Grace. The LANL system is set to be delivered in early 2023.

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

119 Comments

View All Comments

mode_13h - Wednesday, April 14, 2021 - link
One word: China. I'm sure they don't like ARM being under USA ownership, not that they were thrilled about its former Japanese owner.

China can single-handedly build the market for RISC V (or resurrect MIPS, etc). And if you think that doesn't matter to outsiders, look at who's dominating the global electronics market and even things like wi fi routers. They could go top-to-bottom RISC V, establishing it as at least a presence in virtually all markets and tiers.
Santoval - Tuesday, April 13, 2021 - link
If Intel do not manage to take back the performance and efficiency lead in 2023/2024 (as they said they plan) then "peak Intel" was actually in 2015, when Skylake was released; btw, I am not referring to sales but to performance, efficiency and by and large advantages over the competition. I doubt they can recover after that. Since 2015 they have released for the desktop 5 or 6 (I lost count) minor variations of the exact same design fabbed on minor variations of the exact same process node, simply because they crapped the bed of their 10nm node and were too proud, stubborn or secretive to outsource production to third party fabs.

In the laptop space they arguably did a bit better but not with Ice Lake; that was a new design which traded ~18% higher IPC for ~18% lower clocks (lol), so it was roughly as fast as its predecessor. Its Gen11 iGPU was newer and better but still semi-crappy compared to the Vega iGPUs of AMD. Intel's Tiger Lake (released in late 2020) was the first real threat to AMD, since it sported both higher clocks and a spanking new Xe iGPU which -surprise- was *faster* than the equivalent ancient Vega iGPUs (because AMD turned complacent and pulled their own Gen9 UHD crap with Vega..).

Still, Intel as always played games (perhaps more due to low yields than deceptive marketing this time) with pricing and configurations, with the actually fast parts with big fat Xe iGPUs being very rare to find, and very pricey for those who did find them. The CPU cores of Tiger Lake normally beat the equivalent AMD APUs (I only mean the Zen 2 based ones, obviously; Cezanne can decimate them; so Intel's single thread lead lasted about an entire quarter..) but only in single thread performance, since Tiger Lake was capped to 4 cores.

Since Intel (according to their new CEO) plan to surpass AMD in 2023/2024 they obviously do not think Alder Lake will be competitive enough, which is crazy to think about. Maybe its successor Meteor Lake will be (against Zen 4 or 5?), maybe not. Either way I feel that Intel's peak is behind them, not in front of them; unless of course Meteor Lake or its successor deliver..
mode_13h - Wednesday, April 14, 2021 - link
It's a fantasy to think that Intel could ever just pick up the phone and order enough wafer capacity from TSMC to replace their own 10 nm fabs. Intel basically had no choice but to make their 10 nm work, to a reasonable extent.

Going beyond that, they are embracing new options, and they seem to have been quick to move some of their GPUs off their internal production path.
JayNor - Sunday, May 2, 2021 - link
Why would SPR not be considered a performance leader with DDR5 and PCIE5/CXL?
yetanotherhuman - Tuesday, April 13, 2021 - link
Nobody cared about the new Intel CPUs. They're trying to play catch-up with AMD. Not that exciting.
CiccioB - Tuesday, April 13, 2021 - link
Nobody, right, apart those few thousands customers that order some tens of millions of their chips.
The others do not care, and buy AMD. You are right.
Linustechtips12#6900xt - Tuesday, April 13, 2021 - link
I think other than in the sub 250$ basically anyone should get AMD, but for the sub 250$ they really only have the 3300x/3100 and 3600, unless you want something like a 3400g, I think that a 5600/3600 and a Navi GPU similar in size to the vega GPU on 3400g WOULD BE AMAZING at 250$
Linustechtips12#6900xt - Tuesday, April 13, 2021 - link
and there we have it boys the 5000 siries of AMD apus YESSIRRRR
CiccioB - Tuesday, April 13, 2021 - link
I haven't seen the 4000 series yet, and I doubt the 5000 will be more available than the previous one... AMD is really production limited on 7nm. It bet everything on it, but that was too much also for TSMC production capacity, seen that AMD is not the only customer on that node.

Others have exchanged "high performances at lower power consumption" with "higher availability". And seen the period we are witnessing, there is no doubts on who made the best bet.
Linustechtips12#6900xt - Monday, May 3, 2021 - link
i don't think the 4000 series came out of OEM production just for them because they are special

NVIDIA Unveils Grace: A High-Performance Arm Server CPU For Use In Big AI Systems

First Two Supercomputer Customers: CSCS and LANL

Post Your Comment

119 Comments

View All Comments

mode_13h - Wednesday, April 14, 2021 - link

Santoval - Tuesday, April 13, 2021 - link

mode_13h - Wednesday, April 14, 2021 - link

JayNor - Sunday, May 2, 2021 - link

yetanotherhuman - Tuesday, April 13, 2021 - link

CiccioB - Tuesday, April 13, 2021 - link

Linustechtips12#6900xt - Tuesday, April 13, 2021 - link

Linustechtips12#6900xt - Tuesday, April 13, 2021 - link

CiccioB - Tuesday, April 13, 2021 - link

Linustechtips12#6900xt - Monday, May 3, 2021 - link

Log in

Don't have an account? Sign up now