Recurrent Neural Networks: LSTM 

Our loyal readers know that we love real-world enterprise benchmarks. So in our quest for better benchmarks and better data, Pieter Bovijn, the head of research at the MCT IT Bachelor (dutch), turned a real-world AI model into a benchmark.  

The input of the model is time series data, which is used to make predictions on how the time series will behave in the future. As this is a typical sequence prediction problem, we used a Long Short-Term Memory (LSTM) network as neural network. A type of RNN, LSTM selectively "remembers" patterns over a certain duration of time.

LSTM however come with the disadvantage that they are a lot more bandwidth intensive. We quote a recent paper on the topic:

LSTMs exhibit quite inefficient memory access pattern when executed on mobile GPUs due to the redundant data movements and limited off-chip bandwidth.

So we were very curious about how the LSTM network would behave. After all, our server Xeons have ample bandwidth, with a massive 38.5 MB of L3 and six channels of DDR4-2666/2933 (128-141 GB/s per socket). We run this test with 50 GB of data, and train the model for 5 epochs.

Of course, you have the make the most of the available AVX/AVX2/AVX512 SIMD power. That is why we tested with 3 different setups

  1. We used out of the box TensorFlow with conda
  2. We tested with the Intel optimized TensorFlow from PyPi repo
  3. We optimized from source using Bazel. This allowed us to use the very latest version of TensorFlow.

The results are very interesting.

LSTM MCT Benchmark

The most intensive TensorFlow applications are typically run on GPUs, so extra care must be taken when you test on a CPU. AMD's Zen core only has two 128-bit FMACs, and is limited to (256-bit) AVX2. Intel's high-end Xeons have two 256-bit FMACs and one 512-bit FMAC. In other words, on paper Intel's Xeon can deliver four times more FLOPs per clock cycle than AMD. But only if the software is right. Intel has been working intensively with Google to optimize TensorFlow for Intel new Xeons out of necessity: it has to offer a credible alternative in those situations where an NVIDIA Tesla is simply too expensive. Meanwhile, AMD hopes that ROCm catches on and that in the future software engineers run TensorFlow on a Radeon Pro. 

Of course, the big question is how this compares to a GPU. Let us see how our NVIDIA Titan RTX deals with this workload.

LSTM MCT Benchmark (vs GPU)

First of all, we noticed that FP16 did not make much of a difference. Secondly, we were quite amazed that our Titan RTX was less than 3 times faster than our dual Xeon setup.

Investigating further with NVIDIA's System Management Interface (SMI), we found out that GPU did run at a its highest turbo speed: 1.9 GHz, which is higher than the expected 1.775 GHz. Meanwhile utilization dropped to 40% from time to time.

Ultimately this is another example of how real-world applications behave differently from benchmarks, and how important software optimization is. If we would have just used conda, the results above would be very different. Using the right optimized software made the application run 2 to 6 times faster. Also, this another data point that proves that CNNs might be one of the best use cases for GPUs. You should use a GPU to decrease training times of complex LSTMs of course. Still, this kind of neural network is a bit more tricky - you cannot simply add more GPUs to further decrease training time. 

Convolutional Neural Network Training Inference: ResNet-50
POST A COMMENT

56 Comments

View All Comments

  • C-4 - Monday, July 29, 2019 - link

    It's interesting that optimizations did so much for the Intel processors (but relatively less for the AMD ones). Who made these optimizations? How much time was devoted to doing this? How close are the algorithms to being "fully optimized" for the AMD and nVidia chips? Reply
  • quorm - Monday, July 29, 2019 - link

    I believe these optimizations largely take advantage of AVX512, and are therefore intel specific, as amd processors do not incorporate this feature. Reply
  • RSAUser - Monday, July 29, 2019 - link

    As quorm said, I'd assume it's due to AVX512 optimizations, the next generation of AMD Epyc CPU's should support it, and I am hoping closer to 3GHz clock speeds on the 64 core chips, since it seems the new ceiling is around the 4GHz mark for 16 all-core.

    It will be an interesting Q3/Q4 for Intel in the server market this year.
    Reply
  • SarahKerrigan - Monday, July 29, 2019 - link

    Next generation? You mean Rome? Zen2 doesn't have any AVX512. Reply
  • HStewart - Tuesday, July 30, 2019 - link

    I believe AMD AVX 2 is dual-128 bit instead of 256bit - so AVX 512 would probably be quad 128bit . Reply
  • jospoortvliet - Tuesday, July 30, 2019 - link

    That’s not really how it works, in the sense that you explicitly need to support the new instructions... and amd doesn’t (plan to, as far as we know). Reply
  • Qasar - Tuesday, July 30, 2019 - link

    from wikipedia :
    " AVX2 is now fully supported, with an increase in execution unit width from 128-bit to 256-bit. "

    " AMD has increased the execution unit width from 128-bit to 256-bit, allowing for single-cycle AVX2 calculations, rather than cracking the calculation into two instructions and two cycles."
    which is from here : https://www.anandtech.com/show/14525/amd-zen-2-mic...

    looks like AVX2 is single 256 bit :-)
    Reply
  • name99 - Monday, July 29, 2019 - link

    Regarding the limits of large batches: while this is true in principle, the maximum size of those batches can be very large, is hard to predict (at leas right now) and there is on-going work to increase the sizes, This link describes some of the issue and what’s known:

    http://ai.googleblog.com/2019/03/measuring-limits-...

    I think Intel would be foolish to pin many hopes on the assumption that batch scaling will soon end the superior performance of GPUs and even more specialized hardware...
    Reply
  • brunohassuna - Monday, July 29, 2019 - link

    Some information about energy consumption would very useful in comparisons like that Reply
  • ozzuneoj86 - Monday, July 29, 2019 - link

    My first thought when clicking this article was how much more visibly-complex CPUs have gotten in the past ~35 years.

    Compare the bottom of that Xeon to the bottom of a CLCC package 286:
    https://en.wikipedia.org/wiki/Intel_80286#/media/F...

    And that doesn't even touch the difference internally... 134,000 transistors to 8 million and from 16Mhz to 4,000Mhz. The mind boggles.
    Reply

Log in

Don't have an account? Sign up now