NVIDIA’s DGX-2: Sixteen Tesla V100s, 30 TB of NVMe, only $400K

by Ian Cutress on March 27, 2018 2:00 PM EST

28 Comments | Add A Comment

28 Comments

Ever wondered why the consumer GPU market is not getting much love from NVIDIA’s Volta architecture yet? This is a minefield of a question, nuanced by many different viewpoints and angles – even asking the question will poke the proverbial hornet nest inside my own mind of different possibilities. Here is one angle to consider: NVIDIA is currently loving the data center, and the deep learning market, and making money hand-over-fist. The Volta architecture, with CUDA Tensor cores, is unleashing high performance to these markets, and the customers are willing to pay for it. So introduce the latest monster from NVIDIA: the DGX-2.

DGX-2 builds upon DGX-1 in several ways. Firstly, it introduces NVIDIA’s new NVSwitch, enabling 300 GB/s chip-to-chip communication at 12 times the speed of PCIe. This, with NVLink2, enables sixteen GPUs to be grouped together in a single system, for a total bandwidth going beyond 14 TB/s. Add in a pair of Xeon CPUs, 1.5 TB of memory, and 30 TB of NVMe storage, and we get a system that consumes 10 kW, weighs 350 lbs, but offers easily double the performance of the DGX-1. NVIDIA likes to tout that this means it offers a total of ~2 PFLOPs of compute performance in a single system, when using the tensor cores.

NVIDIA DGX Series (with Volta)
	DGX-2	DGX-1
CPUs	2 x Intel Xeon Platinum	2 x Intel Xeon E5-2600 v4
GPUs	16 x NVIDIA Tesla V100 32GB HBM2	8 x NVIDIA Tesla V100 16 GB HBM2
System Memory	Up to 1.5 TB DDR4	Up to 0.5 TB DDR4
GPU Memory	512 GB HBM2 (16 x 32 GB)	256 GB HBM (8 x 32 GB)
Storage	30 TB NVMe Up to 60 TB	4 x 1.92 TB NVMe
Networking	8 x Infiniband or 8 x 100 GbE	4 x IB + 2 x 10 GbE
Power	10 kW	3.5 kW
Size	350 lbs	134 lbs
GPU Throughput	Tensor: 1920 TFLOPs FP16: 480 TFLOPs FP32: 240 TFLOPs FP64: 120 TFLOPs	Tensor: 960 TFLOPs FP16: 240 TFLOPs FP32: 120 TFLOPs FP64: 60 TFLOPs
Cost	$399,000	$149,000

NVIDIA’s overall topology relies on a dual stacked system. The high level concept photo provided indicates that there are actually 12 NVSwitches (216 ports) in the system in order to maximize the amount of bandwidth available between the GPUs. With 6 ports per Tesla V100 GPU, each running in the larger 32GB of HBM2 configuration, this means that the Teslas alone would be taking up 96 of those ports if NVIDIA has them fully wired up to maximize individual GPU bandwidth within the topology.

AlexNET, the network that 'started' the latest machine learning revolution, now takes 18 minutes

Notably here, the topology of the DGX-2 means that all 16 GPUs are able to pool their memory into a unified memory space, though with the usual tradeoffs involved if going off-chip. Not unlike the Tesla V100 memory capacity increase then, one of NVIDIA’s goals here is to build a system that can keep in-memory workloads that would be too large for an 8 GPU cluster. Providing one such example, NVIDIA is saying that the DGX-2 is able to complete the training process for FAIRSEQ – a neural network model for language translation – 10x faster than a DGX-1 system, bringing it down to less than two days total rather than 15.

Otherwise, similar to its DGX-1 counterpart, the DGX-2 is designed to be a powerful server in its own right. Exact specifications are still TBD, but NVIDIA has already told us that it’s based around a pair of Xeon Platinum CPUs, which in turn can be paired with up to 1.5TB of RAM. On the storage side the DGX-2 comes with 30TB of NVMe-based solid state storage, which can be further expanded to 60TB. And for clustering or further inter-system communications, it also offers InfiniBand and 100GigE connectivity, up to eight of them.

The new NVSwitches means that the PCIe lanes of the CPUs can be redirected elsewhere, most notably towards storage and networking connectivity.

Ultimately the DGX-2 is being pitched at an even higher-end segment of the deep-learning market than the DGX-1 is. Pricing for the system runs at $400k, rather than the $150k for the original DGX-1. For more than double the money, the user gets Xeon Platinums (rather than v4), double the V100 GPUs each with double the HBM2, triple the DRAM, and 15x the NVMe storage by default.

NVIDIA has stated that DGX-2 is already certified for the major cloud providers.

28 Comments

View All Comments

WithoutWeakness - Tuesday, March 27, 2018 - link
But can it run Crysis in 4K?
ToTTenTranz - Tuesday, March 27, 2018 - link
Asking the real questions.
Holliday75 - Tuesday, March 27, 2018 - link
We've moved on from Crysis and 4k. Now its how many coins can it mine?
Notmyusualid - Wednesday, March 28, 2018 - link
@ Holliday75

Indeed, and beat me to it.

I'm guessing ~100Mh/s from each GPU, x16 = 1.6Gh/s

Power will be ~220W each (maybe less for these newer babies) x16 = 3.52kW

Add in a couple of meaty Platinum 8180's, that can draw 205W each, but will likely draw 40 something watts (off the cuff guess), whilst idling away. Couple that will some mammoth m/b that want a couple hundred watts, all that RAM & unused NVME SSDs we'll round that up to 500W.

So my guestimation is ~4kW, lets throw in 5% PSU conversion losses (I suppose they are gonna be good), so we are looking at 4.2kW continuous power draw, for Ethereum mining. Less than 5kW total for sure though.

So $?

Annually:
46.46973093 coins mined.
Power Cost [10c/kWh] (in USD) $3,679.20
Profit (in USD) $17,398.55 per year.
Days to break even: 8391.50 Day(s).

If we re-run this with my UK energy costs:

Days to break even: 9610.94 Day(s).

Clearly I am on holiday too - to even bother with this response.

In addition, someone ran the V100's on an Amazon AWS cluster for nearly an hour, and with all costs considered, came out at *negative* $25k USD/year. Interesting though, and well written up.
SiSiX - Tuesday, March 27, 2018 - link
I don't know about 4k, but I would think it could finally play it at 640x480 at the lowest settings...probably. ;)
Jon Tseng - Wednesday, March 28, 2018 - link
Think you'll struggle. May still have to dial down the FSAA settings. :-p
Santoval - Friday, March 30, 2018 - link
I am pretty sure it can run Crysis 4 at 16K with plenty of GPU and CPU power to spare for other stuff.
THE1ABOVEALL - Thursday, May 3, 2018 - link
Sarcasm got deleted out of your dictionary, I see.
The Hardcard - Tuesday, March 27, 2018 - link
It would be interesting to see a comparison between the DGX-2 and the POWER9 systems with NVLink to the processors. I don’t know offhand how many GPUs you can stuff in the IBM, but it seems like there is a lot more bandwidth.

It is notable that NVIDIA went with Xeons. Because POWER would be redundant, or some combination of price/perfomance/energy advantage.
The Hardcard - Tuesday, March 27, 2018 - link
OK, quick check - 6 GPUs max in IBM. But, if they built an NVLink switch for it, it would attach to PCIe 4 vs. PCIe 3. But again, at what price and energy usage.

Probably also a question of software availability and development. That bandwidth tho.

NVIDIA’s DGX-2: Sixteen Tesla V100s, 30 TB of NVMe, only $400K

Related Reading

Post Your Comment

28 Comments

View All Comments

WithoutWeakness - Tuesday, March 27, 2018 - link

ToTTenTranz - Tuesday, March 27, 2018 - link

Holliday75 - Tuesday, March 27, 2018 - link

Notmyusualid - Wednesday, March 28, 2018 - link

SiSiX - Tuesday, March 27, 2018 - link

Jon Tseng - Wednesday, March 28, 2018 - link

Santoval - Friday, March 30, 2018 - link

THE1ABOVEALL - Thursday, May 3, 2018 - link

The Hardcard - Tuesday, March 27, 2018 - link

The Hardcard - Tuesday, March 27, 2018 - link

Log in

Don't have an account? Sign up now