The NVIDIA Titan V Deep Learning Deep Dive: It's All About The Tensor Cores

Name: The NVIDIA Titan V Deep Learning Deep Dive: It's All About The Tensor Cores
Item: The NVIDIA Titan V Deep Learning Deep Dive: It's All About The Tensor Cores
Author: Nate Oh

by Nate Oh on July 3, 2018 10:15 AM EST

65 Comments | Add A Comment

65 Comments

A Look at the Deep Learning Benchmark Landscape

For a new and sometimes impenetrable field like deep learning, where so much is customized to the hardware-at-hand – from frameworks and models to APIs and libraries – it's no surprise that there is little in the way of industry-accepted and publicly accessible benchmarking tools. Like HPC, much of its roots are in academic research, but deep learning's GPU-led arrival into the workstation-class hardware space is new. In a short time, we've heard of and seen deep learning datacenters, deep learning software, and basically every hardware implementation, running the gamut from CPUs, GPUs, and SoCs, to ASICs, FPGAs, and just about anything else you can fab on silicon.

So the applications of deep learning are less familiar to end-users, except when used as buzzwords to describe future products or current devices with mediocre inferencing capabilities. But ultimately, because deep learning encapsulates both training and inferencing, it has legitimate reason to include all types of hardware. That's partially what makes it so enticing, though the situation is somewhat of a chicken-and-egg scenario; cryptomining and blockchains were treated very differently before the latest surge in popularity (and infamy).

In terms of benchmarking GPUs for traditional HPC and workstation performance, there are several standardized suites (e.g. SPECviewperf, SiSoftSandra) that produce relatively consumer-accessible data, not to mention direct comparisons to real-world performance in ISV workstation software. This is not the case here.

Modern DL Benchmarking

The past couple years has seen a renewed effort to create a type of external benchmark suite, but the mainstays have been many of the reference implementations of DL frameworks like TensorFlow. With the impact of ImageNet and some of the models that have emerged from it (AlexNet, VGGNet, Inception, Resnet, to name a few), training on the Imagenet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) image dataset is considered to be an industry-standard task of sorts.

As it plays out in the media, reference models in framework repositories are often run in isolation and are offered up as raw peak throughput numbers; for image recognition, this would be 'images trained per second.' Though given the amount of configuration sometimes needed, this is understandable.

The recent releases of third-party deep learning benchmark suites very much look to solve that issue with standardization and accessible data. Of these, Fathom and TBD are more conventional benchmark suites with tests configured for specific frameworks and models, covering many of the different machine learning applications. Meanwhile, the recent Deep Learning Frameworks focuses on comparing performance for a given model and dataset across frameworks.

Comparison of Selected Deep Learning Benchmark Suites
Test Suite	Published	Collaborators
Fathom	9/2016	Harvard University C-FAR
Baidu DeepBench	9/2016	Baidu Research NVIDIA, Intel, Arm, AMD
Stanford DAWN Deep Learning Benchmark (DAWNBench)	11/2017	Stanford DAWN Project (incl. Intel, Microsoft, and Google)
HPE Deep Learning Benchmark Suite (DLBS)	11/2017	HPE
Training Benchmark for DNNs (TBD)	3/2018	University of Toronto Microsoft Research
Deep Learning Frameworks Comparison	3/2018	Microsoft Machine Learning
MLPerf	5/2018 ("Alpha")	Harvard, Stanford, Berkeley, University of Minnesota, University of Toronto Google, Baidu, Intel, AMD, and others

As for the bulk of our results today, DeepBench does not use frameworks per se, instead using low-level libraries to evaluate performance of machine learning operations across devices and machines with various preset kernels. On its own, while it does not directly implicate framework/model/application performance as other tests, instead it provides metrics that are representative of mathmatical operations and hardware capability as optimized by vendors; the binaries for each product are compiled with libraries that the hardware vendors (NVIDIA, Intel, Arm, AMD) provide and implement. This allows us to have a point-of-comparison between devices independent of frameworks and datasets.

One of the more different ones is DAWNBench, which is not so much a benchmark suite as it is a competition-like reporting of training and inference results for three datasets: ImageNet, CIFAR10, and SQuAD. The focus here is on real-world applicable data, namely end-to-end metrics of computation time-to-accuracy and cost, as opposed to raw accuracy or thoroughput.

For HPE DLBS, as part of HPE's Deep Learning Cookbook, it is largely GPU-focused and sticks to TensorFlow, MXNet, PyTorch, and Caffe-type frameworks, and additionally includes TensorRT testing. While the implementation has well-featured multi-test batching, logging, monitoring, and reporting, it outputs purely performance and time metrics, without any end-to-end measurements of time-to-accuracy or cost.

The most recent high profile benchmark suite, MLPerf, includes researchers and engineers previously working on DAWNBench and other suites; for all intents and purposes, the DAWNBench project has now been superseded by MLPerf. Explicitly aspiring to do for machine learning what SPEC does for general-purpose compute and TPC does for database systems, MLPerf is looking to include Fathom's approach with cross-domain ML tests, as well as DAWNBench's focus on end-to-end computation time of a model above a threshold accuracy. Being so new, however, it is currently on an alpha release, and the reference benchmarks are stated as not suitable for accurate hardware comparisons. For that reason, we have not incorporated any MLPerf testing in this review.

Modern DL being such a new and rapidly changing field, new benchmarks appear quickly, perhaps even as recent as last week. And old ones drift away, like the defunct DeepMark (also created by Chintala) and BenchIP. But for MLPerf, it does seem to be building off of all the lessons learned prior.

Benchmark Accuracy and Metrics

The Deep Learning Frameworks Comparison benchmark allows us to bring up a useful point: differences between frameworks can easily lead to unintended consequences, and thus invalid benchmarks, which affected the Deep Learning Frameworks Comparison as mentioned by Yuxin Wu (creator of tensorpack for TensorFlow). And in general, most DL benchmarks are invalid or less accurate in this way – something that Soumith Chintala (creator of convnet-benchmarks and PyTorch) noted. In the end, without a background in machine learning, there is no easy way of independently validating the accuracy and scope of DL benchmarks, which the MLPerf project appears to try to address.

Another issue is the difficulty in tracking down model variants or reproducing published results; many times, benchmark implementations originate from publications, reference model implementations, or otherwise ML competitions like Kaggle.

For our purposes though, the situation is slightly different, as we are testing GPU performance rather than framework or model performance. But ultimately un-optimized benchmarks would skew GPU performance results anyhow. For these reasons, micro-benchmarks such as DeepBench and 32-bit CNN benchmarks can still be useful in comparing performance between GPUs and between hardware vendors.

Models, Frameworks, and Datasets

The other factor is the sheer amount of deep learning models, frameworks, and datasets. Fortunately, benchmark suites tend to use the same models and datasets, and with competition-style suites like DAWNBench, forgo a mandated framework or model altogether.

As far as frameworks go, essentially all modern DL frameworks support CUDA and cuDNN. For Volta, all frameworks with FP16 storage support also support tensor core acceleration; if FP16 storage is enabled, tensor core acceleration is automatically enabled as well. We will want to utilize these frameworks in order to look at tensor core performance.

Comparison of Selected Deep Learning Benchmark Frameworks
Framework	Support for cuDNN	Support for FP16 Storage	Support for Tensor Core Math
NVCaffe	Yes	Yes	Yes
Caffe2	Yes	Yes	Yes
MXNet	Yes	Yes	Yes
PyTorch	Yes	Yes	Yes
Torch	Yes	No	No
Chainer	Yes	No Yes	No Yes
TensorFlow	Yes	Yes	Yes
Theano	Yes	Yes	Yes
Microsoft Cognitive Toolkit (formerly CNTK)	Yes	No Yes	No Yes

Update (7/16/2018): Microsoft reached out to clarify that CNTK has supported FP16 and tensor cores since 2.4, which released in January 2018. The information was originally sourced to NVIDIA's Mixed Precision Training Guide, and Microsoft is working with NVIDIA to correct this. In light of this, we have found that Chainer 4 supports FP16/tensor cores to some degree since at least April 2018.

That being said, just because a framework can exploit FP16 storage and tensor cores, doesn't mean it will; the mixed precision guidelines we discussed earlier are very much applicable. A benchmark or test on a given model is not necessarily configured to utilize FP16 and tensor cores out-of-the-box, even if it is built on a compatible framework. And even if it is, the model may not converge without further modification.

In the future, we can look forward to interoperable framework formats like ONNX and NNEF as another datapoint.

Revisiting Volta: How to Accelerate Deep Learning Methodology & Testing: Deep Learning Edition

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

65 Comments

View All Comments

SirPerro - Thursday, July 5, 2018 - link
The fancy girls with the macbook in starbucks are not exactly the target demographics for a deep learning desktop card. Or if they are, their laptop plays no part in that.

This card is meant for professionals who can spend more than its price in google cloud training neural networks. For everyone else it makes absolutely no sense.
tipoo - Tuesday, July 3, 2018 - link
That Vega 56/64 in the iMac Pro pitched for deep learning also looks pretty underwhelming...
Demiurge - Wednesday, July 4, 2018 - link
First of all, who "pitched" a iMac Pro for Deep Learning? Why would Apple put a $3k GPU in a model that is typically selling for $4-5k?

Second, what model are you training on Vega that isn't sufficient with 2-4x the FP16/FP8/INT8 throughput of a 1080 Ti? How is that underwhelming?
mode_13h - Wednesday, July 4, 2018 - link
The GP102 in the 1080 Ti and Titan Xp doesn't support double-rate fp16. Just 4x int8 dot product, AFAIK, which you can't really use for training.
Demiurge - Friday, July 6, 2018 - link
My point exactly since Vega does support double-rate FP16, among other things that the consumer GPU's typically don't support.

As for the DP4A instruction, it is very much used in training.

INT8 datatype support is becoming more important, as is FP8 for reducing training time. Two more features Vega supports free of additional charge.
mode_13h - Friday, July 6, 2018 - link
If 8-bit int were acceptable for training, then why would anyone bother with fp16?

Vega 10 doesn't have meaningful packed 8-bit support of any kind. It has only a couple such instructions that are intended for video compression. Vega 20 will change that, even adding support for packed 4-bit. But your comment seems oriented towards the current Vega.
Demiurge - Sunday, July 8, 2018 - link
Here's some reading (read the first line of the conclusion of the paper: Dettmers, 8-Bit Approximations for Parallelism in Deep Learning, ICLR 2016 https://arxiv.org/pdf/1511.04561.pdf):
https://www.xilinx.com/support/documentation/white...

We shall disagree on "Vega 10 does not have meaningful packed 8-bit support". I'll let someone else argue with you, but I know what you mean. I don't agree, but I think I understand where you are coming from.
mode_13h - Monday, July 9, 2018 - link
Aww... don't pick a fight, then walk away!

Here's the current Vega ISA doc. The only 8-bit packed arithmetic I see is unweighted blending and sum of absolute differences. AFAIK, this is not a useful degree of functionality for deep learning. If I'm wrong, show me.

http://developer.amd.com/wordpress/media/2013/12/V...

As for your first link, that deals with a *custom* 8-bit datatype, from what I can tell - not the int8 supported by Nvidia's DP4A or Vega 20 (from what we know).

Finally, your second link appears to deal *exclusively* with inference. Just like I said.
Nate Oh - Tuesday, July 10, 2018 - link
AMD says in the Vega whitepaper that INT8 SAD is applicable to several machine learning applications. It’s not new to Vega though. Various types of INT8 SAD dates back to earlier versions of GCN, and Kepler/Maxwell have single cycle packed INT8 SAD anyhow. But technically, according to AMD it is a useful degree of functionality.
The real caveat to this is real-world DL performance has never been about raw operations per seconds, new instructions or not. This is one of the main points I wanted to convey with the article. (And to ward off any concerns, DeepBench does not fall under that because A) it uses DL kernels representative of DL applications and B) results are all in microseconds that are converted to TFLOPS using the kernel dimensions; TFLOPS is much is easier to present as a measure of performance.)
These instructions are only as good as their DL support. Even with ROCm/HIP/etc, Vega isn’t a drop-in replacement for a 1080 Ti, where you expect the hardware advantage to ‘just work’ in training. You have to port the model and retune with ROCm/MIOpen, HIP or OpenCL, etc., troubleshoot and make sure the hardware features you want to use is actually supported in MIOpen (if MIOpen support is even production-ready in your framework of choice), and the list goes on. Tuning guides for AMD architectures are not yet filled out on their ROCm documentation, and I couldn’t find out if packed INT8 xSAD is well-integrated as some DL primitive in MIOpen. MIOpen also still doesn’t support FP16 for RNNs, or training with CNNs, so no Rapid Packed Math there. Unless you implement these things yourself. I hope you know your GCN assembly.
What I’m trying to say is that for DL hardware (especially GPUs), software support and ecosystem are basically more important right now. If you’re more focused on the DL and not on the GPU side, then you’re more interested in the models and neural networking, less interested in low-level GPU tuning for a new architecture, only familiar with the CUDA ecosystem, and less willing to be a ROCm adventurer without immediate results that you need to publish or use.
So it’s true that Vega brings features to consumer GPUs that they don’t usually support, but using them for DL is not trivial. It’s easy to just say that it’ll work; I can tell you that sitting back in my chair I am super curious about Radeon SSG and DL’s perennial main memory bandwidth/size issue, but somebody has to go develop that implementation.
Which is a long way of stating, Vega doesn’t have packed 8-bit support as influential as Demiurge has claimed.

Links/References
https://radeon.com/_downloads/vega-whitepaper-11.6...
https://www.amd.com/Documents/GCN_Architecture_whi... (SAD for pixel shaders introduced)
http://rocm-documentation.readthedocs.io/en/latest...
http://rocm-documentation.readthedocs.io/en/latest...
https://www.hotchips.org/wp-content/uploads/hc_arc...
https://devtalk.nvidia.com/default/topic/966491/te...
mode_13h - Tuesday, July 10, 2018 - link
> AMD says in the Vega whitepaper that INT8 SAD is applicable to several machine learning applications.

"The NCU also supports a set of 8-bit integer SAD (Sum of Absolute Differences) operations. These operations are important for a wide range of video and image processing algorithms, including image classification for machine learning, motion detection, gesture recognition, stereo depth extraction, and computer vision."

Eh, I still don't think they mean deep learning. Probably, they're referring to some classical image processing techniques, or maybe preprocessing prior to feeding a CNN. Ideally, you might ask them for examples of where it's used, or maybe at least paper citations (which are conspicuously absent from that part of their whitepaper). But I know this was an ambitious article, so maybe it's something to keep in mind for your coverage of Vega 20.

> results are all in microseconds that are converted to TFLOPS using the kernel dimensions

Dude, that's messed up. At the very least, it shouldn't be TFLOPS unless you're actually using floating-point arithmetic. And if you're using a framework that optimizes your model (such as TensorRT, I think), I wouldn't report end performance as if it were actually using the unoptimized kernel.

> Tuning guides for AMD architectures are not yet filled out on their ROCm documentation, and I couldn’t find out if packed INT8 xSAD is well-integrated as some DL primitive in MIOpen.

Well, "Open" means open source, in this case. One could try and have a look. That said, it's a huge article and nobody could've reasonably expected you to do more. It'd be more digestible if broken into a couple installments, actually.

Anyway, the rough state of MIOpen is actually one of the main reasons we don't currently use AMD. I hope the situation changes by the time Vega 20 launches.

Anyhow, thanks for the comprehensive reply, not to mention the article. There are still a few parts I need to go back & read!

The NVIDIA Titan V Deep Learning Deep Dive: It's All About The Tensor Cores

A Look at the Deep Learning Benchmark Landscape

Modern DL Benchmarking

Benchmark Accuracy and Metrics

Models, Frameworks, and Datasets

Post Your Comment

65 Comments

View All Comments

SirPerro - Thursday, July 5, 2018 - link

tipoo - Tuesday, July 3, 2018 - link

Demiurge - Wednesday, July 4, 2018 - link

mode_13h - Wednesday, July 4, 2018 - link

Demiurge - Friday, July 6, 2018 - link

mode_13h - Friday, July 6, 2018 - link

Demiurge - Sunday, July 8, 2018 - link

mode_13h - Monday, July 9, 2018 - link

Nate Oh - Tuesday, July 10, 2018 - link

mode_13h - Tuesday, July 10, 2018 - link

Log in

Don't have an account? Sign up now