Convolutional Neural Network Training

For a long time, the way forward in CNNs was to increase the number of layers – increasing the network depth for "even deeper learning". As you can probably guess, this resulted in diminishing returns and made the already complex neural networks even harder to tune, leading to more training errors.

The ResNet-50 benchmark is based upon residual networks (hence ResNet), which have the merit of fewer training errors as the network gets deeper.

Meanwhile, as a little bit of internal housekeeping here, for regular readers I’ll note that the benchmark below is not directly comparable to the one that Nate ran for our Titan V review. It is the same benchmark, but Nate ran the standard ResNet-50 training implementation that is included in NVIDIA's Caffe2 Docker image. However, since my group is mostly using TensorFlow as a deep learning framework, we tend to with stick with it. All benchmarking

tf_cnn_benchmarks.py --num_gpus=1  --model=resnet50 --variable_update=parameter_server

The model trains on ImageNet and gives us throughput data.

DL Training: ResNet50 Tensorflow

Several benchmarks are missing, and for a good reason. Running a batch size of 512 training samples at FP32 precision on the Titan RTX results in an "out of memory" error, as the card "only" has 24 GB available.

Meanwhile on the Intel CPUs, half precision (FP16) is not (yet) available. AVX512_BF16 (bfloat16) will be available in the Cascade Lake's successor, Cooper Lake

It has been observed that using a larger batch can causes significant degradation in the quality of the model, as measured by its ability to generalize. So although larger batch sizes (512) make better use of the massive parallelism inside the GPU, the results with the lower batch sizes (128) are useful too. The accuracy of the model loses only a few percent, but in many applications a loss of even a few percent is a significant.

So while you could quickly conclude that Titan RTX is seven times faster than the best CPU, it is more accurate to say that it is between 4.5 and 7 times faster depending on the accuracy you want.

Inception (v3)

Inception is based upon GoogLeNet. Contrary to the earlier dense neural networks, GoogLeNet was based on the idea that neural networks can be much more efficient if you do not connect every neuron in every layer to the next one. The downside of this optimization is that this results in sparse matrices, which are far from optimal for the typical SIMD/GPU architectures and their BLAS software.

Overall, the main goal of "Inception" was to turn GoogLeNet into a neural network that would result in dense matrix multiplication. Or in other words, something that ran a lot faster on a GPU or SIMD hardware. In the end, version 3 of this neural network has proven to be even more accurate than ResNet-50.

DL Training: Inceptionv3 Tensorflow

This time, the GPU is about 3 to 5 times faster, depending on the batch size. It is interesting to note that ResNet is more GPU friendly than Inception. But of course, this only matters for academics and hardware enthusiasts.

Software engineers who have to build AI models will however remark quickly that a $3k GPU is at least 3 times faster than a $20k+ (or worse) CPU configuration. And they are right: there is no contest. When it comes to Convolutional Neural Networks, the rock stars of AI, a good GPU (with a good software stack) will mop the floor with even the best CPUs. In a datacenter you typically encounter the NVIDIA Tesla GPUs which cost around four times more, but offer anywhere from 1.5x to 2x the performance of similar Titan cards.

Big Data Benchmarking: Apache Spark 2.1 Recurrent Neural Networks: LSTM
Comments Locked

56 Comments

View All Comments

  • ballsystemlord - Saturday, August 3, 2019 - link

    Spelling and grammar errors:

    "But it will have a impact on total energy consumption, which we will discuss."
    "An" not "a":
    "But it will have an impact on total energy consumption, which we will discuss."

    "We our newest servers into virtual clusters to make better use of all those core."
    Missing "s" and missing word. I guessed "combine".
    "We combine our newest servers into virtual clusters to make better use of all those cores."

    "For reasons unknown to us, we could get our 2.7 GHz 8280 to perform much better than the 2.1 GHz Xeon 8176."
    The 8280 is only slightly faster in the table than the 8176. It is the 8180 that is missing from the table.

    "However, since my group is mostly using TensorFlow as a deep learning framework, we tend to with stick with it."
    Excess "with":
    "However, since my group is mostly using TensorFlow as a deep learning framework, we tend to stick with it."

    "It has been observed that using a larger batch can causes significant degradation in the quality of the model,..."
    Remove plural form:
    "It has been observed that using a larger batch can cause significant degradation in the quality of the model,..."

    "...but in many applications a loss of even a few percent is a significant."
    Excess "a":
    "...but in many applications a loss of even a few percent is significant."

    "LSTM however come with the disadvantage that they are a lot more bandwidth intensive."
    Add an "s":
    "LSTMs however come with the disadvantage that they are a lot more bandwidth intensive."

    "LSTMs exhibit quite inefficient memory access pattern when executed on mobile GPUs due to the redundant data movements and limited off-chip bandwidth."
    "pattern" should be plural because "LSTMs" is plural, I choose an "s":
    "LSTMs exhibit quite inefficient memory access patterns when executed on mobile GPUs due to the redundant data movements and limited off-chip bandwidth."

    "Of course, you have the make the most of the available AVX/AVX2/AVX512 SIMD power."
    "to" not "the":
    "Of course, you have to make the most of the available AVX/AVX2/AVX512 SIMD power."

    "Also, this another data point that proves that CNNs might be one of the best use cases for GPUs."
    Missing "is":
    "Also, this is another data point that proves that CNNs might be one of the best use cases for GPUs."

    "From a high-level workflow perfspective,..."
    A joke, or a misspelling?

    "... it's not enough if the new chips have to go head-to-head with a GPU in a task the latter doesn't completely suck at."
    Traditionally, AT has had no language.
    "... it's not enough if the new chips have to go head-to-head with a GPU in a task the latter is good at."

    "It is been going on for a while,..."
    "has" not "is":
    "It has been going on for a while,..."
  • ballsystemlord - Saturday, August 3, 2019 - link

    Thanks for the cool article!
  • tmnvnbl - Tuesday, August 6, 2019 - link

    Great read, especially liked the background and perspective next to the benchmark details
  • dusk007 - Tuesday, August 6, 2019 - link

    Great Article.
    I wouldn't call Apache Arrow a database though. It is a data format more akin to a file format like csv or parquet. It is not something that stores data for you and gives it to you. It is the how to store data in memory. Like CSV or Parquet are a "how to" store data in Files. More efficient less redundancy less overhead when access from different runtimes (Tensorflow, Spark, Pandas,..).

    Love the article, I hope we get more of those. Also that huge performance optimizations are possible in this field just in software. Often renting compute in the cloud is cheaper than the man hours required to optimize though.
  • Emrickjack - Thursday, August 8, 2019 - link

    Johan's new piece in 14 months! Looking forward to your Rome review
  • Emrickjack - Thursday, August 8, 2019 - link

    It More Information http://americanexpressconfirmcard.club/

Log in

Don't have an account? Sign up now