Inference: ResNet-50

After training your model on training data, the real test awaits. Your AI model should now be able to apply those learnings in the real world and do the same for new real-world data. That process is called inference. Inference requires no back propagation as the model is already trained – the model has already determined the weights. Inference also can make use of lower numerical precision, and it has been shown that even the accuracy from using 8-bit integers is sometimes acceptable. 

From a high-level workflow perfspective, a working AI model is basically controlled by a service that, in turn, is called from another software service. So the model should respond very quickly, but the total latency of the application will be determined by the different services. To cut a long story short: if inference performance is high enough, the perceived latency might shift to another software component. As a result, Intel's task is to make sure that Xeons can offer high enough inference performance. 

DL Inference: ResNet50

Intel has a special "recipe" for reaching top inference performance on the Cascade Lake, courtesy of the DL Boost technology. DLBoost includes the Vector Neural Network Instructions, which allows the use of INT8 ops instead of FP32. Integer operations are intrinsically faster, and by using only 8 bits, you get a theoretical peak, which is four times higher. 

Complicating matters, we were experimenting with inference when our Cascade Lake server crashed. For what it is worth, we never reached more than 2000 images per second. But since we could not experiment any further, we gave Intel the benefit of the doubt and used their numbers.

Meanwhile the publication of the 9282 caused quite a stir, as Intel claimed that the latest Xeons outperformed NVIDIA's flagship accelerator (Tesla V100) by a small margin: 7844 vs 7636 images per second. NVIDIA reacted immediately by emphasizing performance/watt/dollar and got a lot of coverage in the press. However, the most important point in our humble opinion is that the Tesla V100 results are not comparable, as those 7600 images per second were obtained in mixed mode (FP32/16) and not INT8.

Once we enable INT8, the $2500 Titan RTX is no less than 3 times faster than a pair of $10k Xeons 8280s.

Intel cannot win this fight, not by a long shot. Still, Intel's efforts and NIVIDA’s poking in response show how important it is for Intel to improve both inference and training performance; to convince people to invest in high end Xeons instead of a low end Xeon with a Tesla V100. In some cases, 3 times slower than NVIDIA's offering might be good enough as the inference software component is just one part of the software stack. 

In fact, to really analyze all of the angles of the situation, we should also measure the latency on a full-blown AI application instead of just measuring inference throughput. But that will take us some more time to get that one right....

Recurrent Neural Networks: LSTM Exploring Parallel HPC
Comments Locked

56 Comments

View All Comments

  • ballsystemlord - Saturday, August 3, 2019 - link

    Spelling and grammar errors:

    "But it will have a impact on total energy consumption, which we will discuss."
    "An" not "a":
    "But it will have an impact on total energy consumption, which we will discuss."

    "We our newest servers into virtual clusters to make better use of all those core."
    Missing "s" and missing word. I guessed "combine".
    "We combine our newest servers into virtual clusters to make better use of all those cores."

    "For reasons unknown to us, we could get our 2.7 GHz 8280 to perform much better than the 2.1 GHz Xeon 8176."
    The 8280 is only slightly faster in the table than the 8176. It is the 8180 that is missing from the table.

    "However, since my group is mostly using TensorFlow as a deep learning framework, we tend to with stick with it."
    Excess "with":
    "However, since my group is mostly using TensorFlow as a deep learning framework, we tend to stick with it."

    "It has been observed that using a larger batch can causes significant degradation in the quality of the model,..."
    Remove plural form:
    "It has been observed that using a larger batch can cause significant degradation in the quality of the model,..."

    "...but in many applications a loss of even a few percent is a significant."
    Excess "a":
    "...but in many applications a loss of even a few percent is significant."

    "LSTM however come with the disadvantage that they are a lot more bandwidth intensive."
    Add an "s":
    "LSTMs however come with the disadvantage that they are a lot more bandwidth intensive."

    "LSTMs exhibit quite inefficient memory access pattern when executed on mobile GPUs due to the redundant data movements and limited off-chip bandwidth."
    "pattern" should be plural because "LSTMs" is plural, I choose an "s":
    "LSTMs exhibit quite inefficient memory access patterns when executed on mobile GPUs due to the redundant data movements and limited off-chip bandwidth."

    "Of course, you have the make the most of the available AVX/AVX2/AVX512 SIMD power."
    "to" not "the":
    "Of course, you have to make the most of the available AVX/AVX2/AVX512 SIMD power."

    "Also, this another data point that proves that CNNs might be one of the best use cases for GPUs."
    Missing "is":
    "Also, this is another data point that proves that CNNs might be one of the best use cases for GPUs."

    "From a high-level workflow perfspective,..."
    A joke, or a misspelling?

    "... it's not enough if the new chips have to go head-to-head with a GPU in a task the latter doesn't completely suck at."
    Traditionally, AT has had no language.
    "... it's not enough if the new chips have to go head-to-head with a GPU in a task the latter is good at."

    "It is been going on for a while,..."
    "has" not "is":
    "It has been going on for a while,..."
  • ballsystemlord - Saturday, August 3, 2019 - link

    Thanks for the cool article!
  • tmnvnbl - Tuesday, August 6, 2019 - link

    Great read, especially liked the background and perspective next to the benchmark details
  • dusk007 - Tuesday, August 6, 2019 - link

    Great Article.
    I wouldn't call Apache Arrow a database though. It is a data format more akin to a file format like csv or parquet. It is not something that stores data for you and gives it to you. It is the how to store data in memory. Like CSV or Parquet are a "how to" store data in Files. More efficient less redundancy less overhead when access from different runtimes (Tensorflow, Spark, Pandas,..).

    Love the article, I hope we get more of those. Also that huge performance optimizations are possible in this field just in software. Often renting compute in the cloud is cheaper than the man hours required to optimize though.
  • Emrickjack - Thursday, August 8, 2019 - link

    Johan's new piece in 14 months! Looking forward to your Rome review
  • Emrickjack - Thursday, August 8, 2019 - link

    It More Information http://americanexpressconfirmcard.club/

Log in

Don't have an account? Sign up now