Inference: ResNet-50

After training your model on training data, the real test awaits. Your AI model should now be able to apply those learnings in the real world and do the same for new real-world data. That process is called inference. Inference requires no back propagation as the model is already trained – the model has already determined the weights. Inference also can make use of lower numerical precision, and it has been shown that even the accuracy from using 8-bit integers is sometimes acceptable. 

From a high-level workflow perfspective, a working AI model is basically controlled by a service that, in turn, is called from another software service. So the model should respond very quickly, but the total latency of the application will be determined by the different services. To cut a long story short: if inference performance is high enough, the perceived latency might shift to another software component. As a result, Intel's task is to make sure that Xeons can offer high enough inference performance. 

DL Inference: ResNet50

Intel has a special "recipe" for reaching top inference performance on the Cascade Lake, courtesy of the DL Boost technology. DLBoost includes the Vector Neural Network Instructions, which allows the use of INT8 ops instead of FP32. Integer operations are intrinsically faster, and by using only 8 bits, you get a theoretical peak, which is four times higher. 

Complicating matters, we were experimenting with inference when our Cascade Lake server crashed. For what it is worth, we never reached more than 2000 images per second. But since we could not experiment any further, we gave Intel the benefit of the doubt and used their numbers.

Meanwhile the publication of the 9282 caused quite a stir, as Intel claimed that the latest Xeons outperformed NVIDIA's flagship accelerator (Tesla V100) by a small margin: 7844 vs 7636 images per second. NVIDIA reacted immediately by emphasizing performance/watt/dollar and got a lot of coverage in the press. However, the most important point in our humble opinion is that the Tesla V100 results are not comparable, as those 7600 images per second were obtained in mixed mode (FP32/16) and not INT8.

Once we enable INT8, the $2500 Titan RTX is no less than 3 times faster than a pair of $10k Xeons 8280s.

Intel cannot win this fight, not by a long shot. Still, Intel's efforts and NIVIDA’s poking in response show how important it is for Intel to improve both inference and training performance; to convince people to invest in high end Xeons instead of a low end Xeon with a Tesla V100. In some cases, 3 times slower than NVIDIA's offering might be good enough as the inference software component is just one part of the software stack. 

In fact, to really analyze all of the angles of the situation, we should also measure the latency on a full-blown AI application instead of just measuring inference throughput. But that will take us some more time to get that one right....

Recurrent Neural Networks: LSTM Exploring Parallel HPC
Comments Locked

56 Comments

View All Comments

  • tipoo - Monday, July 29, 2019 - link

    Fyi, when on page 2 and clicking "convolutional, etc" for page 3, it brings me back to the homepage
  • Ryan Smith - Monday, July 29, 2019 - link

    Fixed. Sorry about that.
  • Eris_Floralia - Monday, July 29, 2019 - link

    Johan's new piece in 14 months! Looking forward to your Rome review :)
  • JohanAnandtech - Monday, July 29, 2019 - link

    Just when you think nobody noticed you were gone. Great to come home again. :-)
  • Eris_Floralia - Tuesday, July 30, 2019 - link

    Your coverage on server processors are great!
    Can still well remember Nehalem, Barcelona, and especially Bulldozer aftermath articles
  • djayjp - Monday, July 29, 2019 - link

    Not having a Tesla for such an article seems like a glaring omission.
  • warreo - Monday, July 29, 2019 - link

    Doubt Nvidia is sourcing AT these cards, so it's likely an issue of cost and availability. Titan is much cheaper than a Tesla, and I'm not even sure you can get V100's unless you're an enterprise customer ordering some (presumably large) minimum quantity.
  • olafgarten - Monday, July 29, 2019 - link

    It is available https://www.scan.co.uk/products/32gb-pny-nvidia-te...
  • abufrejoval - Tuesday, July 30, 2019 - link

    Those bottlenecks are over now and P100, V100 can be bought pretty freely, as well as RTX6000/8000 (Turings). Actually the "T100" is still missing and the closest siblings (RTX 6000/8000) might never get certified for rackmount servers, because they have active fans while the P100/V100 are designed to be cooled by server fans. I operate a handful of each and getting budget is typically the bigger hurdle than purchasing.
  • SSNSeawolf - Monday, July 29, 2019 - link

    I've been trying to find more information on Cascade Lake's AI/VNNI performance, but came up dry. Thanks, Johan. Eagerly putting this aside for my lunch reading today.

Log in

Don't have an account? Sign up now