Intel’s View on AI: Do What NV Doesn't

On the whole, Intel has a good point that there is "a wide range of AI applications", e.g. there is AI life beyond CNNs. In many real-life scenarios, traditional machine learning techniques outperform CNNs, and not all deep learning is done with the ultra-scalable CNNs. And in other real-world cases, having massive amounts of RAM is another big performance advantage, both while training the model and using it to infer new data. 

So despite NVIDIA’s massive advantage in running CNNs, high end Xeons can offer a credible alternative in the data analytics market. To be sure, nobody expects the new Cascade Lake Xeons to outperform NVIDIA GPUs in CNN training, but there are lots of cases where Intel might be able to convince customers to invest in a more potent Xeon instead of an expensive Tesla accelerator:

  • Inference of AI models that require a lot of memory
  • "Light" AI models that do not require long training times.
  • Data architectures where the batch or stream processing time is more important than the model training time.
  • AI models that depend on traditional “non-neural network” statistical models

As result, there might be an opportunity for Intel to keep NVIDIA at bay until they have a reasonable alternative for NVIDIA’s GPUs in CNN workloads. Intel has been feverishly adding features to the Xeons Scalable family and optimizing its software stack to combat NVIDIA AI hegemony. Optimized AI software like Intel’s own distribution for Python, the Intel Math Kernel Library for Deep Learning, and even the Intel Data Analytics Acceleration Library – mostly for traditional machine learning... 

All told then, for the second generation of Intel’s Xeon Scalable processors, the company has added new AI hardware features under the Deep Learning (DL) Boost name. This primarily includes the Vector Neural Network Instruction (VNNI) set, which can do in one instruction what would have previously taken three. However even farther down the line, Cooper Lake, the third-generation Xeon Scalable processor, will add support for bfloat16, further improving training performance.

In summary, Intel trying to recapture the market for “lighter AI workloads” while making a firm stand in the rest of data analytics market, all the while adding very specialized hardware (FPGA, ASICs) to their portfolio. This is of critical importance to Intel's competitiveness in the IT market. Intel has repeatedly said that the data center group (DCG) or “enterprise part” is expected to be the company's main growth engine in the years ahead.

Convolutional, Recurrent, & Scalability NVIDIA’s Answer: Bringing GPUs to More Than CNNs
Comments Locked

56 Comments

View All Comments

  • C-4 - Monday, July 29, 2019 - link

    It's interesting that optimizations did so much for the Intel processors (but relatively less for the AMD ones). Who made these optimizations? How much time was devoted to doing this? How close are the algorithms to being "fully optimized" for the AMD and nVidia chips?
  • quorm - Monday, July 29, 2019 - link

    I believe these optimizations largely take advantage of AVX512, and are therefore intel specific, as amd processors do not incorporate this feature.
  • RSAUser - Monday, July 29, 2019 - link

    As quorm said, I'd assume it's due to AVX512 optimizations, the next generation of AMD Epyc CPU's should support it, and I am hoping closer to 3GHz clock speeds on the 64 core chips, since it seems the new ceiling is around the 4GHz mark for 16 all-core.

    It will be an interesting Q3/Q4 for Intel in the server market this year.
  • SarahKerrigan - Monday, July 29, 2019 - link

    Next generation? You mean Rome? Zen2 doesn't have any AVX512.
  • HStewart - Tuesday, July 30, 2019 - link

    I believe AMD AVX 2 is dual-128 bit instead of 256bit - so AVX 512 would probably be quad 128bit .
  • jospoortvliet - Tuesday, July 30, 2019 - link

    That’s not really how it works, in the sense that you explicitly need to support the new instructions... and amd doesn’t (plan to, as far as we know).
  • Qasar - Tuesday, July 30, 2019 - link

    from wikipedia :
    " AVX2 is now fully supported, with an increase in execution unit width from 128-bit to 256-bit. "

    " AMD has increased the execution unit width from 128-bit to 256-bit, allowing for single-cycle AVX2 calculations, rather than cracking the calculation into two instructions and two cycles."
    which is from here : https://www.anandtech.com/show/14525/amd-zen-2-mic...

    looks like AVX2 is single 256 bit :-)
  • name99 - Monday, July 29, 2019 - link

    Regarding the limits of large batches: while this is true in principle, the maximum size of those batches can be very large, is hard to predict (at leas right now) and there is on-going work to increase the sizes, This link describes some of the issue and what’s known:

    http://ai.googleblog.com/2019/03/measuring-limits-...

    I think Intel would be foolish to pin many hopes on the assumption that batch scaling will soon end the superior performance of GPUs and even more specialized hardware...
  • brunohassuna - Monday, July 29, 2019 - link

    Some information about energy consumption would very useful in comparisons like that
  • ozzuneoj86 - Monday, July 29, 2019 - link

    My first thought when clicking this article was how much more visibly-complex CPUs have gotten in the past ~35 years.

    Compare the bottom of that Xeon to the bottom of a CLCC package 286:
    https://en.wikipedia.org/wiki/Intel_80286#/media/F...

    And that doesn't even touch the difference internally... 134,000 transistors to 8 million and from 16Mhz to 4,000Mhz. The mind boggles.

Log in

Don't have an account? Sign up now