Intel's Xeon Cascade Lake vs. NVIDIA Turing: An Analysis in AI
by Johan De Gelas on July 29, 2019 8:30 AM ESTConvolutional Neural Network Training
For a long time, the way forward in CNNs was to increase the number of layers – increasing the network depth for "even deeper learning". As you can probably guess, this resulted in diminishing returns and made the already complex neural networks even harder to tune, leading to more training errors.
The ResNet-50 benchmark is based upon residual networks (hence ResNet), which have the merit of fewer training errors as the network gets deeper.
Meanwhile, as a little bit of internal housekeeping here, for regular readers I’ll note that the benchmark below is not directly comparable to the one that Nate ran for our Titan V review. It is the same benchmark, but Nate ran the standard ResNet-50 training implementation that is included in NVIDIA's Caffe2 Docker image. However, since my group is mostly using TensorFlow as a deep learning framework, we tend to with stick with it. All benchmarking
tf_cnn_benchmarks.py --num_gpus=1 --model=resnet50 --variable_update=parameter_server
The model trains on ImageNet and gives us throughput data.
Several benchmarks are missing, and for a good reason. Running a batch size of 512 training samples at FP32 precision on the Titan RTX results in an "out of memory" error, as the card "only" has 24 GB available.
Meanwhile on the Intel CPUs, half precision (FP16) is not (yet) available. AVX512_BF16 (bfloat16) will be available in the Cascade Lake's successor, Cooper Lake.
It has been observed that using a larger batch can causes significant degradation in the quality of the model, as measured by its ability to generalize. So although larger batch sizes (512) make better use of the massive parallelism inside the GPU, the results with the lower batch sizes (128) are useful too. The accuracy of the model loses only a few percent, but in many applications a loss of even a few percent is a significant.
So while you could quickly conclude that Titan RTX is seven times faster than the best CPU, it is more accurate to say that it is between 4.5 and 7 times faster depending on the accuracy you want.
Inception (v3)
Inception is based upon GoogLeNet. Contrary to the earlier dense neural networks, GoogLeNet was based on the idea that neural networks can be much more efficient if you do not connect every neuron in every layer to the next one. The downside of this optimization is that this results in sparse matrices, which are far from optimal for the typical SIMD/GPU architectures and their BLAS software.
Overall, the main goal of "Inception" was to turn GoogLeNet into a neural network that would result in dense matrix multiplication. Or in other words, something that ran a lot faster on a GPU or SIMD hardware. In the end, version 3 of this neural network has proven to be even more accurate than ResNet-50.
This time, the GPU is about 3 to 5 times faster, depending on the batch size. It is interesting to note that ResNet is more GPU friendly than Inception. But of course, this only matters for academics and hardware enthusiasts.
Software engineers who have to build AI models will however remark quickly that a $3k GPU is at least 3 times faster than a $20k+ (or worse) CPU configuration. And they are right: there is no contest. When it comes to Convolutional Neural Networks, the rock stars of AI, a good GPU (with a good software stack) will mop the floor with even the best CPUs. In a datacenter you typically encounter the NVIDIA Tesla GPUs which cost around four times more, but offer anywhere from 1.5x to 2x the performance of similar Titan cards.
56 Comments
View All Comments
Bp_968 - Tuesday, July 30, 2019 - link
Oh no, not 8 million, 8 *billion* (for the 8180 xeon), and 19.2 *billion* for the last gen AMD 32 core epyc! I don't think they have released much info on the new epyc yet buy its safe to assume its going to be 36-40 billion! (I dont know how many transistors are used in the I/O controller).And like you said, the connections are crazy! The xeon has a 5903 BGA connection so it doesn't even socket, its soldered to the board.
ozzuneoj86 - Sunday, August 4, 2019 - link
Doh! Thanks for correcting the typo!Yes, 8 BILLION... it's incredible! It's even more difficult to fathom that these things, with billions of "things" in such a small area are nowhere near as complex or versatile as a similarly sized living organism.
s.yu - Sunday, August 4, 2019 - link
Well the current magnetic storage is far from the storage density of DNA, in this sense.FunBunny2 - Monday, July 29, 2019 - link
"As a single SQL query is nowhere near as parallel as Neural Networks – in many cases they are 100% sequential "hogwash. SQL, or rather the RM which it purports to implement, is embarrassingly parallel; these are set operations which care not a fig for order. the folks who write SQL engines, OTOH, are still stuck in C land. with SSD seq processing so much faster than HDD, app developers are reverting to 60s tape processing methods. good for them.
bobhumplick - Tuesday, July 30, 2019 - link
so cpus will become more gpu like and gpus will become more cpu like. you got your avx in my cuda core. no, you got your cuda core in my avx......mmmmmmbobhumplick - Tuesday, July 30, 2019 - link
intel need to get those gpus out quickAmiba Gelos - Tuesday, July 30, 2019 - link
LSTM in 2019?At least try GRU or transformer instead.
LSTM is notorious for its non-parallelizablity, skewing the result toward cpu.
Rudde - Tuesday, July 30, 2019 - link
I believe that's why they benchmarked LSTM. They benchmarked gpu stronghold CNNs to show great gpu performance and benchmarked LSTM to show great cpu performance.Amiba Gelos - Tuesday, July 30, 2019 - link
Recommendation pipeline already demonstrates the necessity of good cpus for ML.Imho benching LSTM to showcase cpu perf is misleading. It is slow, performing equally or worse than alts, and got replaced by transformer and cnn in NMT and NLP.
Heck why not wavenet? That's real world app.
I bet cpu would perform even "better" lol.
facetimeforpcappp - Tuesday, July 30, 2019 - link
A welcome will show up on their screen which they have to acknowledge to make a call.So there you go; Mac to PC, PC to iPhone, iPad to PC or PC to iPod, the alternatives are various, you need to pick one that suits your needs. Facetime has magnificent video calling quality than other best video calling applications.
https://facetimeforpcapp.com/