Convolutional Neural Network Training

For a long time, the way forward in CNNs was to increase the number of layers – increasing the network depth for "even deeper learning". As you can probably guess, this resulted in diminishing returns and made the already complex neural networks even harder to tune, leading to more training errors.

The ResNet-50 benchmark is based upon residual networks (hence ResNet), which have the merit of fewer training errors as the network gets deeper.

Meanwhile, as a little bit of internal housekeeping here, for regular readers I’ll note that the benchmark below is not directly comparable to the one that Nate ran for our Titan V review. It is the same benchmark, but Nate ran the standard ResNet-50 training implementation that is included in NVIDIA's Caffe2 Docker image. However, since my group is mostly using TensorFlow as a deep learning framework, we tend to with stick with it. All benchmarking

tf_cnn_benchmarks.py --num_gpus=1  --model=resnet50 --variable_update=parameter_server

The model trains on ImageNet and gives us throughput data.

DL Training: ResNet50 Tensorflow

Several benchmarks are missing, and for a good reason. Running a batch size of 512 training samples at FP32 precision on the Titan RTX results in an "out of memory" error, as the card "only" has 24 GB available.

Meanwhile on the Intel CPUs, half precision (FP16) is not (yet) available. AVX512_BF16 (bfloat16) will be available in the Cascade Lake's successor, Cooper Lake

It has been observed that using a larger batch can causes significant degradation in the quality of the model, as measured by its ability to generalize. So although larger batch sizes (512) make better use of the massive parallelism inside the GPU, the results with the lower batch sizes (128) are useful too. The accuracy of the model loses only a few percent, but in many applications a loss of even a few percent is a significant.

So while you could quickly conclude that Titan RTX is seven times faster than the best CPU, it is more accurate to say that it is between 4.5 and 7 times faster depending on the accuracy you want.

Inception (v3)

Inception is based upon GoogLeNet. Contrary to the earlier dense neural networks, GoogLeNet was based on the idea that neural networks can be much more efficient if you do not connect every neuron in every layer to the next one. The downside of this optimization is that this results in sparse matrices, which are far from optimal for the typical SIMD/GPU architectures and their BLAS software.

Overall, the main goal of "Inception" was to turn GoogLeNet into a neural network that would result in dense matrix multiplication. Or in other words, something that ran a lot faster on a GPU or SIMD hardware. In the end, version 3 of this neural network has proven to be even more accurate than ResNet-50.

DL Training: Inceptionv3 Tensorflow

This time, the GPU is about 3 to 5 times faster, depending on the batch size. It is interesting to note that ResNet is more GPU friendly than Inception. But of course, this only matters for academics and hardware enthusiasts.

Software engineers who have to build AI models will however remark quickly that a $3k GPU is at least 3 times faster than a $20k+ (or worse) CPU configuration. And they are right: there is no contest. When it comes to Convolutional Neural Networks, the rock stars of AI, a good GPU (with a good software stack) will mop the floor with even the best CPUs. In a datacenter you typically encounter the NVIDIA Tesla GPUs which cost around four times more, but offer anywhere from 1.5x to 2x the performance of similar Titan cards.

Big Data Benchmarking: Apache Spark 2.1 Recurrent Neural Networks: LSTM
Comments Locked

56 Comments

View All Comments

  • C-4 - Monday, July 29, 2019 - link

    It's interesting that optimizations did so much for the Intel processors (but relatively less for the AMD ones). Who made these optimizations? How much time was devoted to doing this? How close are the algorithms to being "fully optimized" for the AMD and nVidia chips?
  • quorm - Monday, July 29, 2019 - link

    I believe these optimizations largely take advantage of AVX512, and are therefore intel specific, as amd processors do not incorporate this feature.
  • RSAUser - Monday, July 29, 2019 - link

    As quorm said, I'd assume it's due to AVX512 optimizations, the next generation of AMD Epyc CPU's should support it, and I am hoping closer to 3GHz clock speeds on the 64 core chips, since it seems the new ceiling is around the 4GHz mark for 16 all-core.

    It will be an interesting Q3/Q4 for Intel in the server market this year.
  • SarahKerrigan - Monday, July 29, 2019 - link

    Next generation? You mean Rome? Zen2 doesn't have any AVX512.
  • HStewart - Tuesday, July 30, 2019 - link

    I believe AMD AVX 2 is dual-128 bit instead of 256bit - so AVX 512 would probably be quad 128bit .
  • jospoortvliet - Tuesday, July 30, 2019 - link

    That’s not really how it works, in the sense that you explicitly need to support the new instructions... and amd doesn’t (plan to, as far as we know).
  • Qasar - Tuesday, July 30, 2019 - link

    from wikipedia :
    " AVX2 is now fully supported, with an increase in execution unit width from 128-bit to 256-bit. "

    " AMD has increased the execution unit width from 128-bit to 256-bit, allowing for single-cycle AVX2 calculations, rather than cracking the calculation into two instructions and two cycles."
    which is from here : https://www.anandtech.com/show/14525/amd-zen-2-mic...

    looks like AVX2 is single 256 bit :-)
  • name99 - Monday, July 29, 2019 - link

    Regarding the limits of large batches: while this is true in principle, the maximum size of those batches can be very large, is hard to predict (at leas right now) and there is on-going work to increase the sizes, This link describes some of the issue and what’s known:

    http://ai.googleblog.com/2019/03/measuring-limits-...

    I think Intel would be foolish to pin many hopes on the assumption that batch scaling will soon end the superior performance of GPUs and even more specialized hardware...
  • brunohassuna - Monday, July 29, 2019 - link

    Some information about energy consumption would very useful in comparisons like that
  • ozzuneoj86 - Monday, July 29, 2019 - link

    My first thought when clicking this article was how much more visibly-complex CPUs have gotten in the past ~35 years.

    Compare the bottom of that Xeon to the bottom of a CLCC package 286:
    https://en.wikipedia.org/wiki/Intel_80286#/media/F...

    And that doesn't even touch the difference internally... 134,000 transistors to 8 million and from 16Mhz to 4,000Mhz. The mind boggles.

Log in

Don't have an account? Sign up now