Convolutional, Recurrent, & Scalability: Finding a Balance

Despite the fact that Intel's Xeon Phi was a market failure as an accelerator and has been discontinued, Intel has not given up on the concept. The company still wants a bigger piece of the AI market, including pieces that may otherwise be going to NVIDIA.

To quote Intel’s Naveen Rao:

Customers are discovering that there is no single “best” piece of hardware to run the wide variety of AI applications, because there’s no single type of AI.

And Naveen makes a salient point. Because although NVIDIA has never claimed that they provide the best hardware for all types of AI, superficially looking at the most cited benchmarks in press releases across the industry (ResNet, Inception, etc) you would almost believe there was only one type of AI that matters. Convolutional Neural Networks (CNNs or ConvNets) dominate the benchmarks and product presentations, as they are the most popular technology for analyzing images and video. Anything that can be expressed as “2D input” is a potential candidate for the input layers of these popular neural networks.

Some of the most spectacular breakthroughs in recent years have been made with the CNNs. It’s no mistake that ResNet performance has become so popular, for example. The associated ImageNet database, a collaboration between Stanford University and Princeton University, contains fourteen million images; and until the last decade, AI performance on recognizing those images was very poor. CNNs changed that in quick order, and it has been one of the most popular AI challenges ever since, as companies look to outdo each out in categorizing this database faster and more accurately than ever before.

To put all of this on a timeline, as early as 2012, AlexNet, a relatively simple neural network, achieved significantly better accuracy than the traditional machine learning techniques in an ImageNet classification competition. In that test, it achieved an 85% accuracy rate, which is almost half of the error rate of more traditional approaches, which achieved 73% accuracy.

In 2015, the famous Inception V3 achieved a 3,58% error rate in classifying the images, which is similar to (or even slightly better than) a human. The ImageNet challenge got harder, but CNNs got better even without increasing the number of layers, courtesy of residual learning. This led to the famous “ResNet” CNN, now one of the most popular AI benchmarks. To cut a long story short, CNNs are the rockstars of the AI Universe. They get by far most of the attention, testing, and research.

CNNs are also very scalable: adding more GPUs scales (almost) linearly in lowering a network’s training time. Put bluntly, CNNs are a gift from the heavens for NVIDIA. CNNs are the most common reason for why people invest in NVIDIAs expensive DGX servers ($400k) or buy multiple Tesla GPUs ($7k+).

Still, there is more to AI than CNNs. Recurrent Neural Networks for example are also popular for speech recognition, language translation, and time series.

This is why the MLperf benchmark initiative is so important. For the first time, we are getting a benchmark that is not dominated completely by CNNs.

Taking a quick look at MLperf, the Image and object classification benchmarks are CNNs of course, but RNNs (via Neural machine translation) and collaborative filtering are also represented. Meanwhile, even the recommendation engine test is based on a neural network; so technically speaking there is no "traditional" machine learning test included, which is unfortunate. But as this is version 0.5 and the organization is inviting more feedback, it sure is promising and once it matures, we expect it to be the best benchmark available.

Looking at some of the first data, however, via Dell’s benchmarks, it is crystal clear that not all neural networks are as scalable as CNNs. While the ResNet CNN easily quadruples when you move to four times the number of GPUs (and add a second CPU), the collaborative filtering method offers only 50% higher performance.

In fact, quite a bit of academic research revolves around optimizing and adapting CNNs so they handle these sequence modelling workloads just as well as RNNs, and as result can replace the less scalable RNNs.

More Than Deep Learning Intel’s View on AI: Do What NV Doesn't
Comments Locked

56 Comments

View All Comments

  • Drumsticks - Monday, July 29, 2019 - link

    It's an interesting, valuable take on the challenges of responding to many of the ML workloads of today with a general purpose CPU, thanks! A third party review of Intel's latest against Nvidia, and even throwing AMD in to the mix, is pretty helpful as the two companies have been going at it for a while now.

    Intel has a lot of stuff going that should make the next few years quite interesting. If they manage to follow through on the Nervana Coprocessor/NNP-I that Toms talked about, or on their discrete GPUs, they'll have a potent lineup. The execution definitely isn't guaranteed, especially given the software reliance these products will have, but if Intel really can manage to transform their product stack, and do it in the next few years, they'll be well on their way to competing in a much larger market, and defending their current one.

    OTOH, if they fail with all of them, it'll definitely be bad news for their future. They obviously won't go bankrupt (they'll continue to be larger than AMD for the foreseeable future), but it'll be exponentially harder if not impossible to get back into those markets they missed.
  • JohanAnandtech - Monday, July 29, 2019 - link

    Thanks! Indeed, Nervana coprocessors are indeed Intel's most promising technology in this area.
  • p1esk - Monday, July 29, 2019 - link

    No one in their right mind would think "gee, should I get CPU or GPU for my DL app?" More concerning for Intel should be the fact that I bought a Threadripper for my latest DL build.
  • Smell This - Monday, July 29, 2019 - link

    You gotta Radeon VII ?

    I'm thinking Intel, and to a lesser extent, nVidia, is waiting for the next shoe(s) to drop in **Big Compute** --- Cascade Lake has been left at the starting gate.

    An AMD Radeon Instinct 'cluster' on a dense specialized 'chiplet' server with hundreds of CPU cores/threads is where this train is headed ...
  • JohanAnandtech - Monday, July 29, 2019 - link

    Spinning up a GPU based instance on Amazon is much more expensive than a CPU one. So for development purposes, this question is asked.
  • p1esk - Tuesday, July 30, 2019 - link

    Then you should be answering precisely that question: which instance should I spin up? Your article does not help with that because the CPU you test is more expensive than the GPU.
  • JohnnyClueless - Monday, July 29, 2019 - link

    Really surprised Intel, and to a lesser extent AMD, are even trying to fight this battle with nVidia on these terms. It’s a lot like going to a gun fight and developing an extra sharp samurai sword rather than bringing the usual switchblade knife. The sword may be awesome, but it’s always going to be the wrong tool for the gun fight.

    IMO, a better approach to capture market share in DL/AI/HPC might be to develop a low core count (by 2019 standards) CPU that excelled at sequential single threaded performance. Something like 6-10 GHz. That would provide a huge and tangible boost to any workload that is at least partially single core frequency limited, and that is most DL/AI/HPC workloads. Leave the parallel computing to chips and devices designed to excel at such workloads!
  • Eris_Floralia - Monday, July 29, 2019 - link

    Still living in early 2000s?
  • FunBunny2 - Monday, July 29, 2019 - link

    "Something like 6-10 GHz. "

    IIRC, all the chip tried to get near that, but couldn't. it's not nice to fool Mother Nature.
  • Santoval - Monday, July 29, 2019 - link

    "Something like 6-10 GHz."
    Google "Dennard scaling" (which ended in ~2005) to find out why this is impossible, at least with silicon based MOSFET transistors (including the GAA-FET based ones of the next decade). Wikipedia has a very informative page with multiple links to various sources for even more. The gist of the end of Dennard scaling is that single core clocks higher than ~5 GHz (at a reasonable TDP of up to ~100W) are explicitly forbidden at *any* node.

    When Dennard scaling ended -in combination with the slowing down of Moore's Law- there was another, related consequence : Koomey's law started to slow down. Koomey's law is all about power efficiency, i.e. how many computations you can extract from each Wh or kWh.

    Before the early 2000s the number of computations per x unit of energy doubled on average every 1.57 years. In 2011 Koomey himself re-evaluated his law and got an average doubling of computations every 2.6 years for the previous decade, a substantial collapse of power efficiency. Since 2011 Koomey's law has obviously slowed down further.

    To make a long story short Moore's law puts a limit to the number of transistors we can fit in each mm^2, and that limit is not too far away. Dennard scaling once allowed us to raise clocks with each new node at the same TDP, and this is ancient history in computing terms. Koomey's law, finally, puts a limit to the power efficiency of our CPUs/GPUs, and this continues to slow down due to the slowing down of Moore's Law (when Moore's Law ends Koomey's law will also end, thus all three fundamental computing laws will be "dead").

    Unless we ditch silicon (and even CMOS transistors, if required) and adopt a new computing paradigm we will have neither 6 - 10 GHz clocked CPUs in a couple of decades nor will we able to speed up CPUs, GPUs and computers at all.

Log in

Don't have an account? Sign up now