A Look at the Deep Learning Benchmark Landscape

For a new and sometimes impenetrable field like deep learning, where so much is customized to the hardware-at-hand – from frameworks and models to APIs and libraries – it's no surprise that there is little in the way of industry-accepted and publicly accessible benchmarking tools. Like HPC, much of its roots are in academic research, but deep learning's GPU-led arrival into the workstation-class hardware space is new. In a short time, we've heard of and seen deep learning datacenters, deep learning software, and basically every hardware implementation, running the gamut from CPUs, GPUs, and SoCs, to ASICs, FPGAs, and just about anything else you can fab on silicon.

So the applications of deep learning are less familiar to end-users, except when used as buzzwords to describe future products or current devices with mediocre inferencing capabilities. But ultimately, because deep learning encapsulates both training and inferencing, it has legitimate reason to include all types of hardware. That's partially what makes it so enticing, though the situation is somewhat of a chicken-and-egg scenario; cryptomining and blockchains were treated very differently before the latest surge in popularity (and infamy).

In terms of benchmarking GPUs for traditional HPC and workstation performance, there are several standardized suites (e.g. SPECviewperf, SiSoftSandra) that produce relatively consumer-accessible data, not to mention direct comparisons to real-world performance in ISV workstation software. This is not the case here.

Modern DL Benchmarking

The past couple years has seen a renewed effort to create a type of external benchmark suite, but the mainstays have been many of the reference implementations of DL frameworks like TensorFlow. With the impact of ImageNet and some of the models that have emerged from it (AlexNet, VGGNet, Inception, Resnet, to name a few), training on the Imagenet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) image dataset is considered to be an industry-standard task of sorts.

As it plays out in the media, reference models in framework repositories are often run in isolation and are offered up as raw peak throughput numbers; for image recognition, this would be 'images trained per second.' Though given the amount of configuration sometimes needed, this is understandable.

The recent releases of third-party deep learning benchmark suites very much look to solve that issue with standardization and accessible data. Of these, Fathom and TBD are more conventional benchmark suites with tests configured for specific frameworks and models, covering many of the different machine learning applications. Meanwhile, the recent Deep Learning Frameworks focuses on comparing performance for a given model and dataset across frameworks.

Comparison of Selected Deep Learning Benchmark Suites
Test Suite Published Collaborators
Fathom 9/2016 Harvard University
C-FAR
Baidu DeepBench 9/2016 Baidu Research
NVIDIA, Intel, Arm, AMD
Stanford DAWN Deep Learning Benchmark (DAWNBench) 11/2017 Stanford DAWN Project
(incl. Intel, Microsoft, and Google)
HPE Deep Learning Benchmark Suite (DLBS) 11/2017 HPE
Training Benchmark for DNNs (TBD) 3/2018 University of Toronto
Microsoft Research
Deep Learning Frameworks Comparison 3/2018 Microsoft Machine Learning
MLPerf 5/2018
("Alpha")
Harvard, Stanford, Berkeley, University of Minnesota, University of Toronto
Google, Baidu, Intel, AMD, and others

As for the bulk of our results today, DeepBench does not use frameworks per se, instead using low-level libraries to evaluate performance of machine learning operations across devices and machines with various preset kernels. On its own, while it does not directly implicate framework/model/application performance as other tests, instead it provides metrics that are representative of mathmatical operations and hardware capability as optimized by vendors; the binaries for each product are compiled with libraries that the hardware vendors (NVIDIA, Intel, Arm, AMD) provide and implement. This allows us to have a point-of-comparison between devices independent of frameworks and datasets.

One of the more different ones is DAWNBench, which is not so much a benchmark suite as it is a competition-like reporting of training and inference results for three datasets: ImageNet, CIFAR10, and SQuAD. The focus here is on real-world applicable data, namely end-to-end metrics of computation time-to-accuracy and cost, as opposed to raw accuracy or thoroughput.

For HPE DLBS, as part of HPE's Deep Learning Cookbook, it is largely GPU-focused and sticks to TensorFlow, MXNet, PyTorch, and Caffe-type frameworks, and additionally includes TensorRT testing. While the implementation has well-featured multi-test batching, logging, monitoring, and reporting, it outputs purely performance and time metrics, without any end-to-end measurements of time-to-accuracy or cost.

The most recent high profile benchmark suite, MLPerf, includes researchers and engineers previously working on DAWNBench and other suites; for all intents and purposes, the DAWNBench project has now been superseded by MLPerf. Explicitly aspiring to do for machine learning what SPEC does for general-purpose compute and TPC does for database systems, MLPerf is looking to include Fathom's approach with cross-domain ML tests, as well as DAWNBench's focus on end-to-end computation time of a model above a threshold accuracy. Being so new, however, it is currently on an alpha release, and the reference benchmarks are stated as not suitable for accurate hardware comparisons. For that reason, we have not incorporated any MLPerf testing in this review.

Modern DL being such a new and rapidly changing field, new benchmarks appear quickly, perhaps even as recent as last week. And old ones drift away, like the defunct DeepMark (also created by Chintala) and BenchIP. But for MLPerf, it does seem to be building off of all the lessons learned prior.

Benchmark Accuracy and Metrics

The Deep Learning Frameworks Comparison benchmark allows us to bring up a useful point: differences between frameworks can easily lead to unintended consequences, and thus invalid benchmarks, which affected the Deep Learning Frameworks Comparison as mentioned by Yuxin Wu (creator of tensorpack for TensorFlow). And in general, most DL benchmarks are invalid or less accurate in this way – something that Soumith Chintala (creator of convnet-benchmarks and PyTorch) noted. In the end, without a background in machine learning, there is no easy way of independently validating the accuracy and scope of DL benchmarks, which the MLPerf project appears to try to address.

Another issue is the difficulty in tracking down model variants or reproducing published results; many times, benchmark implementations originate from publications, reference model implementations, or otherwise ML competitions like Kaggle.

For our purposes though, the situation is slightly different, as we are testing GPU performance rather than framework or model performance. But ultimately un-optimized benchmarks would skew GPU performance results anyhow. For these reasons, micro-benchmarks such as DeepBench and 32-bit CNN benchmarks can still be useful in comparing performance between GPUs and between hardware vendors.

Models, Frameworks, and Datasets

The other factor is the sheer amount of deep learning models, frameworks, and datasets. Fortunately, benchmark suites tend to use the same models and datasets, and with competition-style suites like DAWNBench, forgo a mandated framework or model altogether.

As far as frameworks go, essentially all modern DL frameworks support CUDA and cuDNN. For Volta, all frameworks with FP16 storage support also support tensor core acceleration; if FP16 storage is enabled, tensor core acceleration is automatically enabled as well. We will want to utilize these frameworks in order to look at tensor core performance.

Comparison of Selected Deep Learning Benchmark Frameworks
Framework Support for cuDNN Support for FP16 Storage Support for Tensor Core Math
NVCaffe Yes Yes Yes
Caffe2 Yes Yes Yes
MXNet Yes Yes Yes
PyTorch Yes Yes Yes
Torch Yes No No
Chainer Yes No
Yes
No
Yes
TensorFlow Yes Yes Yes
Theano Yes Yes Yes
Microsoft Cognitive Toolkit
(formerly CNTK)
Yes No
Yes
No
Yes

Update (7/16/2018): Microsoft reached out to clarify that CNTK has supported FP16 and tensor cores since 2.4, which released in January 2018. The information was originally sourced to NVIDIA's Mixed Precision Training Guide, and Microsoft is working with NVIDIA to correct this. In light of this, we have found that Chainer 4 supports FP16/tensor cores to some degree since at least April 2018.

That being said, just because a framework can exploit FP16 storage and tensor cores, doesn't mean it will; the mixed precision guidelines we discussed earlier are very much applicable. A benchmark or test on a given model is not necessarily configured to utilize FP16 and tensor cores out-of-the-box, even if it is built on a compatible framework. And even if it is, the model may not converge without further modification.

In the future, we can look forward to interoperable framework formats like ONNX and NNEF as another datapoint.

Revisiting Volta: How to Accelerate Deep Learning Methodology & Testing: Deep Learning Edition
Comments Locked

65 Comments

View All Comments

  • mode_13h - Wednesday, July 4, 2018 - link

    It's not that hard, really. They're just saying Nvidia made a library (cuDNN), so that different deep learning frameworks don't each have to hand-optimize code for things like its exotic tensor cores.

    For their part, AMD has a similar library they call MIOpen.
  • philehidiot - Wednesday, July 4, 2018 - link

    Why thank you. That now does help it make a little more sense. The maths does make sense but the computer science is generally beyond me.
  • aelizo - Wednesday, July 4, 2018 - link

    At that price point, I would have liked to see some comparison to 2xTinan Xp, or even some comparison to 3x1080Ti's.
    Last year I saw some comparison between this sets on pytorch:
    https://medium.com/@u39kun/titan-v-vs-1080-ti-head...
  • mode_13h - Wednesday, July 4, 2018 - link

    I'm suspicious that he's not actually using the tensor cores. The V100/GV100 also has double-rate fp16, like the P100/GP100 before it. So, a < 2x improvement from going to 16-bit suggests it might only be using the packed half-precision instructions, rather than the tensor cores.

    Either that or he's not using batching and is completely limited by memory bottlenecks.
  • aelizo - Wednesday, July 4, 2018 - link

    I suspect something similar, that is Why Nate could have done a great job with a similar comparison.
  • Nate Oh - Monday, July 9, 2018 - link

    Unfortunately, we only have 1 Titan Xp, which is actually on loan from TH. These class of devices are (usually) not sampled by NVIDIA so we could not have pursued what you suggest. We split custody of Titan V, and that alone was not an insignificant investment.

    Additionally, mGPU DL analysis introduces a whole new can of worms. As some may have noticed, I have not mentioned NCCL/MPI, NVLink, Volta mGPU enhancements, All Reduce, etc. It's definitely a topic for further investigation if the demand and resources match.
  • mode_13h - Tuesday, July 10, 2018 - link

    Multi-GPU scaling is getting somewhat esoteric, but perhaps a good topic for future articles.

    Would be cool to see the effect of NVLink, if you can get access to such a system in the cloud. Maybe Nvidia will give you some sort of "press" access to their cloud?
  • ballsystemlord - Saturday, July 7, 2018 - link

    Here are some spelling/grammar corrections. You write far fewer than most of the other authors at anandtech ( If Ian had written this I would have need 2 pages for all the corrections :) ). Good job!

    "And Volta does has those separate INT32 units."
    You mean "have".
    And Volta does have those separate INT32 units.

    "For our purposes, the tiny image dataset of CIFAR10 works fine as running a single-node on a dataset like ImageNet with non-professional hardware that could be old as Kepler"...
    Missing "as".
    For our purposes, the tiny image dataset of CIFAR10 works fine as running a single-node on a dataset like ImageNet with non-professional hardware that could be as old as Kepler...

    "Moving forward, we're hoping that MLPerf and similar efforts make good headway, so that we can tease out a bit more secrets from GPUs."
    Grammar error.
    Moving forward, we're hoping that MLPerf and similar efforts make good headway, so that we can tease out a bit more of the secrets from GPUs.
  • mode_13h - Saturday, July 7, 2018 - link

    Yeah, if that's the worst you found, no one would even *suspect* him for being a lolcat.
  • Vanguarde - Monday, July 9, 2018 - link

    I purchased this card to get better frames in Witcher 3 at 4K everything maxed out, heavily modded. Never dips below 60fps and usually near 80-100fps

Log in

Don't have an account? Sign up now