The NVIDIA Titan V Deep Learning Deep Dive: It's All About The Tensor Cores

Name: The NVIDIA Titan V Deep Learning Deep Dive: It's All About The Tensor Cores
Item: The NVIDIA Titan V Deep Learning Deep Dive: It's All About The Tensor Cores
Author: Nate Oh

by Nate Oh on July 3, 2018 10:15 AM EST

65 Comments | Add A Comment

65 Comments

NVIDIA Caffe2 Docker: ResNet50 and ImageNet

Kernels and deep learning math operations may be useful, but in the end devices are trained with real datasets. Using the standard ILSVRC 2012 pictureset, we run the standard ResNet-50 training implementation that is included in NVIDIA's Caffe2 Docker image. The model trains on ImageNet and gives us some throughput data.

While there were separate switches for FP16 and tensor cores, running FP16 mode with tensors enabled and disabled resulted in identical results for the Titan V.

DL Training: NVIDIA Caffe2 Docker - ResNet-50 with ImageNet Performance
No score indicates card ran out of video memory

In terms of pure throughput, the Titan V takes the lead at all batch sizes. In fact, with tensors enabled it is able to go beyond 64 batches, as opposed to the other cards, even though they all have 12 GBs of VRAM. The reasoning is that FP16 consumes less video memory.

DL Training: NVIDIA Caffe2 - ResNet-50 with ImageNet VRAM Utilization

The issue with raw throughput metrics is that real-world performance for deep learning is never so simple. For one, many models might be optimized for throughput but sacrifice accuracy and/or training time. Peak or even sustained images trained per second may not be useful if the model takes an extended amount of time to converge. This is particularly relevant for Volta with FP16 storage and tensor cores, as there may be a number of necessary mitigations like loss scaling or single precision batch normalization, which wouldn't be directly accounted for in throughput metrics.

That being said, finding modern benchmarks that are Volta-aware, reasonably close to state-of-the-art, provide better metrics, go beyond CNNs for computer vision, and are accessible by non-researchers, has been a struggle. Throughput benchmarks are easier to validate and create, but in many situations they are better suited for identifying bottlenecks, platform differences, and optimization points.

DeepBench Inference: RNN & Sparse GEMM HPE DLBS Caffe2: ResNet50 and ImageNet

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

65 Comments

View All Comments

SirPerro - Thursday, July 5, 2018 - link
The fancy girls with the macbook in starbucks are not exactly the target demographics for a deep learning desktop card. Or if they are, their laptop plays no part in that.

This card is meant for professionals who can spend more than its price in google cloud training neural networks. For everyone else it makes absolutely no sense.
tipoo - Tuesday, July 3, 2018 - link
That Vega 56/64 in the iMac Pro pitched for deep learning also looks pretty underwhelming...
Demiurge - Wednesday, July 4, 2018 - link
First of all, who "pitched" a iMac Pro for Deep Learning? Why would Apple put a $3k GPU in a model that is typically selling for $4-5k?

Second, what model are you training on Vega that isn't sufficient with 2-4x the FP16/FP8/INT8 throughput of a 1080 Ti? How is that underwhelming?
mode_13h - Wednesday, July 4, 2018 - link
The GP102 in the 1080 Ti and Titan Xp doesn't support double-rate fp16. Just 4x int8 dot product, AFAIK, which you can't really use for training.
Demiurge - Friday, July 6, 2018 - link
My point exactly since Vega does support double-rate FP16, among other things that the consumer GPU's typically don't support.

As for the DP4A instruction, it is very much used in training.

INT8 datatype support is becoming more important, as is FP8 for reducing training time. Two more features Vega supports free of additional charge.
mode_13h - Friday, July 6, 2018 - link
If 8-bit int were acceptable for training, then why would anyone bother with fp16?

Vega 10 doesn't have meaningful packed 8-bit support of any kind. It has only a couple such instructions that are intended for video compression. Vega 20 will change that, even adding support for packed 4-bit. But your comment seems oriented towards the current Vega.
Demiurge - Sunday, July 8, 2018 - link
Here's some reading (read the first line of the conclusion of the paper: Dettmers, 8-Bit Approximations for Parallelism in Deep Learning, ICLR 2016 https://arxiv.org/pdf/1511.04561.pdf):
https://www.xilinx.com/support/documentation/white...

We shall disagree on "Vega 10 does not have meaningful packed 8-bit support". I'll let someone else argue with you, but I know what you mean. I don't agree, but I think I understand where you are coming from.
mode_13h - Monday, July 9, 2018 - link
Aww... don't pick a fight, then walk away!

Here's the current Vega ISA doc. The only 8-bit packed arithmetic I see is unweighted blending and sum of absolute differences. AFAIK, this is not a useful degree of functionality for deep learning. If I'm wrong, show me.

http://developer.amd.com/wordpress/media/2013/12/V...

As for your first link, that deals with a *custom* 8-bit datatype, from what I can tell - not the int8 supported by Nvidia's DP4A or Vega 20 (from what we know).

Finally, your second link appears to deal *exclusively* with inference. Just like I said.
Nate Oh - Tuesday, July 10, 2018 - link
AMD says in the Vega whitepaper that INT8 SAD is applicable to several machine learning applications. It’s not new to Vega though. Various types of INT8 SAD dates back to earlier versions of GCN, and Kepler/Maxwell have single cycle packed INT8 SAD anyhow. But technically, according to AMD it is a useful degree of functionality.
The real caveat to this is real-world DL performance has never been about raw operations per seconds, new instructions or not. This is one of the main points I wanted to convey with the article. (And to ward off any concerns, DeepBench does not fall under that because A) it uses DL kernels representative of DL applications and B) results are all in microseconds that are converted to TFLOPS using the kernel dimensions; TFLOPS is much is easier to present as a measure of performance.)
These instructions are only as good as their DL support. Even with ROCm/HIP/etc, Vega isn’t a drop-in replacement for a 1080 Ti, where you expect the hardware advantage to ‘just work’ in training. You have to port the model and retune with ROCm/MIOpen, HIP or OpenCL, etc., troubleshoot and make sure the hardware features you want to use is actually supported in MIOpen (if MIOpen support is even production-ready in your framework of choice), and the list goes on. Tuning guides for AMD architectures are not yet filled out on their ROCm documentation, and I couldn’t find out if packed INT8 xSAD is well-integrated as some DL primitive in MIOpen. MIOpen also still doesn’t support FP16 for RNNs, or training with CNNs, so no Rapid Packed Math there. Unless you implement these things yourself. I hope you know your GCN assembly.
What I’m trying to say is that for DL hardware (especially GPUs), software support and ecosystem are basically more important right now. If you’re more focused on the DL and not on the GPU side, then you’re more interested in the models and neural networking, less interested in low-level GPU tuning for a new architecture, only familiar with the CUDA ecosystem, and less willing to be a ROCm adventurer without immediate results that you need to publish or use.
So it’s true that Vega brings features to consumer GPUs that they don’t usually support, but using them for DL is not trivial. It’s easy to just say that it’ll work; I can tell you that sitting back in my chair I am super curious about Radeon SSG and DL’s perennial main memory bandwidth/size issue, but somebody has to go develop that implementation.
Which is a long way of stating, Vega doesn’t have packed 8-bit support as influential as Demiurge has claimed.

Links/References
https://radeon.com/_downloads/vega-whitepaper-11.6...
https://www.amd.com/Documents/GCN_Architecture_whi... (SAD for pixel shaders introduced)
http://rocm-documentation.readthedocs.io/en/latest...
http://rocm-documentation.readthedocs.io/en/latest...
https://www.hotchips.org/wp-content/uploads/hc_arc...
https://devtalk.nvidia.com/default/topic/966491/te...
mode_13h - Tuesday, July 10, 2018 - link
> AMD says in the Vega whitepaper that INT8 SAD is applicable to several machine learning applications.

"The NCU also supports a set of 8-bit integer SAD (Sum of Absolute Differences) operations. These operations are important for a wide range of video and image processing algorithms, including image classification for machine learning, motion detection, gesture recognition, stereo depth extraction, and computer vision."

Eh, I still don't think they mean deep learning. Probably, they're referring to some classical image processing techniques, or maybe preprocessing prior to feeding a CNN. Ideally, you might ask them for examples of where it's used, or maybe at least paper citations (which are conspicuously absent from that part of their whitepaper). But I know this was an ambitious article, so maybe it's something to keep in mind for your coverage of Vega 20.

> results are all in microseconds that are converted to TFLOPS using the kernel dimensions

Dude, that's messed up. At the very least, it shouldn't be TFLOPS unless you're actually using floating-point arithmetic. And if you're using a framework that optimizes your model (such as TensorRT, I think), I wouldn't report end performance as if it were actually using the unoptimized kernel.

> Tuning guides for AMD architectures are not yet filled out on their ROCm documentation, and I couldn’t find out if packed INT8 xSAD is well-integrated as some DL primitive in MIOpen.

Well, "Open" means open source, in this case. One could try and have a look. That said, it's a huge article and nobody could've reasonably expected you to do more. It'd be more digestible if broken into a couple installments, actually.

Anyway, the rough state of MIOpen is actually one of the main reasons we don't currently use AMD. I hope the situation changes by the time Vega 20 launches.

Anyhow, thanks for the comprehensive reply, not to mention the article. There are still a few parts I need to go back & read!

The NVIDIA Titan V Deep Learning Deep Dive: It's All About The Tensor Cores

NVIDIA Caffe2 Docker: ResNet50 and ImageNet

Post Your Comment

65 Comments

View All Comments

SirPerro - Thursday, July 5, 2018 - link

tipoo - Tuesday, July 3, 2018 - link

Demiurge - Wednesday, July 4, 2018 - link

mode_13h - Wednesday, July 4, 2018 - link

Demiurge - Friday, July 6, 2018 - link

mode_13h - Friday, July 6, 2018 - link

Demiurge - Sunday, July 8, 2018 - link

mode_13h - Monday, July 9, 2018 - link

Nate Oh - Tuesday, July 10, 2018 - link

mode_13h - Tuesday, July 10, 2018 - link

Log in

Don't have an account? Sign up now