The NVIDIA Titan V Preview - Titanomachy: War of the Titans

Name: The NVIDIA Titan V Preview - Titanomachy: War of the Titans
Item: The NVIDIA Titan V Preview - Titanomachy: War of the Titans

by Ryan Smith & Nate Oh on December 20, 2017 11:30 AM EST

111 Comments | Add A Comment

111 Comments

Compute Performance

Diving in, I want to start with a look at compute performance. Titan V is first and foremost a compute product, so what makes or breaks the card – what justifies its existence – is how well it does at compute.

This is actually a more challenging aspect to test than it may first appear, as NVIDIA’s top-down architecture launch strategy means that there’s not a lot of commercial or otherwise widely-used software written that can take full advantage of the Volta architecture. Even in the major deep learning frameworks, their support for Volta is still in the early phases or still under active development. So I’ve opted to excise the deep learning tests for the moment, while we get a better handle on those for our full review. Meanwhile software that can take advantage of Volta’s fast FP16 and FP64 modes is also quite rare, especially as fast FP16 is not a consumer NVIDIA feature at this time.

Compute overall is harder to test, as unlike games or even professional visualization tasks, there’s far fewer standard workloads. Even tasks using major compute frameworks are often heavily tailored towards a specific need. So while our compute benchmarks are meant to be enlightening, software developers and compute users will want to carefully study how their workloads map to GPU hardware.

Anyhow, before we get too deep here, I want to take a quick look at FMA latency. Over successive generations of GPU architectures NVIDIA has continued to find ways to reduce the latency of what’s perhaps the most critical instruction in the GPU space, and Volta continues this trend.

Synthetic: Beyond3D Suite - Estimated FMA Latency

With Volta NVIDIA has been able to knock another two clock cycles off of the latency for an FMA, reducing it from 6 cycles on Pascal/Maxwell to 4 cycles. This is thanks to NVIDIA’s separating the INT32 and FP32 units in the Volta architecture, which allows more of the overall FMA operation to be performed in parallel. The end result is that for any instructions that depend on the result of an FMA operation, they can kick off 33% sooner, improving overall compute/shader efficiency.

Interestingly this was one area of compute where NVIDIA was trailing AMD – who has an average latency of around 4.5 cycles – so it’s great to see NVIDIA close the gap here. And this shouldn’t be considered a trivial change, as reducing the number of clock cycles for any fundamental operation is quite difficult. Case in point, I’m not sure just how NVIDIA could get FMAs any lower than 4 cycles.

General Matrix Multiply (GEMM) Performance

Starting things off, let’s take a look at General Matrix Multiply (GEMM) performance. GEMM measures the performance of dense matrix multiplication, which is an important operation in a variety of scientific fields. GEMM is highly parallel and typically compute heavy, and one of the first tests of performance and efficiency on any parallel architecture geared towards HPC workloads.

Since matrix multiplication is such a fundamental operation, it’s one of the operations included in NVIIDA’s CUBLAS library, meaning NVIDIA offers a highly optimized version of the function. Importantly, it also means that NVIDIA offers Volta-optimized implementations of the function, including operations that run on the densest matrix multiplication core of them all, the tensor core. So GEMM provides us a fantastic opportunity to look at the performance of not only the usual FP16/FP32/FP64 operations, but also the FP16 tensor cores themselves. This is very much an ideal case, but it gives us the means to see what kind of maximum throughput Titan V can reach.

Compute: General Matrix Multiply Performance (GEMM)

Burying the lede here, let’s start with single precision (FP32) performance. As mentioned briefly in our look at the specifications of the Titan V, the Titan V is not immensely more powerful than the Titan Xp when it comes to standard FP32 operations. It has a much greater number of FP32 CUDA cores, but the compute-centric chip trades that off with lower average clockspeeds, which dulls the difference some.

None the less, the Titan V is 23% faster in FP32 GEMM, which is actually a greater improvement than the 14% increase that we’d expect going by the specifications. In fact at 13.4 TFLOPS, the Titan V is at 97% of its rated 13.8 TFLOPS FP32 performance, an amazingly efficient metric. The Titan Xp isn’t as efficient here – hitting 10.9 of 12.1 TFLOPs for a 90% rating – underscoring the important architectural improvements Volta brings. I am not 100% sure on just which factors lead to this improvement, but the separation of INT32 and FP32 cores along with the scheduling changes are likely candidates. Titan V also benefits from improvements in caching and more memory bandwidth, which further reduces any potential bottlenecks in feeding the beast.

As for double precision (FP64) and half precision (FP16) performance, these are in line with our expectations. The official FP64 and FP16 rates for Titan V are 1:2 and 2:1 FP32 respectively, and this is what we find on GEMM. Compared to the Titan Xp, which didn’t offer fast FP64 or FP16 modes, the difference is night and day: Titan V is 16x and 121x faster respectively.

Last but not least, let’s talk tensors. I almost removed the tensor core results from this graph, but ultimately I left them in to highlight just what the tensor cores mean for compute developers who have tasks well-suited for them. When configured to use the tensor cores, our GEMM benchmark runs hit 97.5 TFLOPs, 3.7x faster than Titan V’s already fast FP16 performance. Or to put this another way, this one card is doing 97 trillion floating point operations in a single second.

Relative to the high efficiency marks we saw earlier with the CUDA cores, the tensor cores don’t come quite as close. 97.5 TFLOPs is about 89% of the Titan V’s rated 110 TFLOPs tensor core performance. Besides the fact that the tensor cores are extremely dense – they’ll give your power and cooling a good run for their money – I suspect that Titan V is feeling the pinch of memory bandwidth. The loss of a stack of HBM2 means that it’s trying to perform trillions of operations on only billions of bytes of memory bandwidth.

Finally, I can’t reiterate enough that when looking at tensor core performance, context is king here. The tensor cores are very fast, but they are highly specialized. GPUs in general are only good for a subset of overall computing tasks, and the tensor cores take that one step further. In super dense matrix multiplication scenarios they excel, but they have little to offer outside of that. Thankfully for NVIDIA and its newly found deep learning empire, neural networks map very well to tensor operations.

SiSoftware Sandra

Moving on from our low-level GEMM benchmark, let’s move up a level to something a bit more practical with some of SiSoftware’s Sandra’s GPU compute benchmarks. Unfortunately for reasons I haven’t yet been able to determine, Sandra’s CUDA benchmarks aren’t working with NVIDIA’s latest drivers, which limits us to working with OpenCL and DirectX compute shaders. Still, this gives us a chance a look at performance outside of specialized CUDA applications.

Compute: SiSoft Sandra 2017 - GP Processing (Compute Shader)

Relatively speaking, Titan V once again punches above its weight. In FP32 mode Sandra’s compute shader benchmark clocks the card as processing 25 Gigapixels per second, versus 18.4 for the Titan Xp, a 37% difference and once again exceeding the on-paper differences between these products. I suggest caution in reading too much into these synthetic compute results, as real-world workloads aren’t nearly as neat and tidy. But it continues to point to the Volta architecture having improved things by quite a bit under the hood.

Meanwhile, because NVIDA doesn’t expose fast FP16 mode outside of CUDA, the Titan V’s FP16 performance doesn’t get used to its fullest extent here. Fast FP64 on the other hand is exposed, and Titan V leaves every other Titan in the dust.

Compute: SiSoftware Sandra 2017 - Video Shader Compute (DX11)

The story is much the same using DX11 video (pixel) shaders. Though it’s very interesting that Titan V isn’t showing the same kind of sizable performance gains here as it did using a compute shader.

Compute: SiSoft Sandra 2017 - GP Financial Analysis (OpenCL)

A bit more real-world is Sandra’s financial analysis benchmark, which performs a series of operations including the ever-important Monte Carlo simulation. Titan V is once again well in the lead here, punching above its weight relative to the Titan Xp. FP64 performance is also much stronger than the other Titans, though on a relative basis, the performance drop-off for moving to FP64 is much more than the 50% performance hit that we’d expect on paper.

Meet Titan V, A Note on Graphics, & the Test Compute Performance: Geekbench 4, Folding @ Home, & CompuBench

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

111 Comments

View All Comments

Ryan Smith - Wednesday, December 20, 2017 - link
The "But can it run Crysis" joke started with the original Crysis in 2007. So it was only appropriate that we use it for that test. Especially since it let us do something silly like running 4x supersample anti-aliasing.
crysis3? - Wednesday, December 20, 2017 - link
ah
SirPerro - Wednesday, December 20, 2017 - link
They make it pretty clear everywhere this card is meant for ML training.

It's the only scenario where it makes sense financially.

Gaming is a big NO at 3K dollars per card. Mining is a big NO with all the cheaper specific chips for the task.

On ML it may mean halving or cutting by 4 the training time on a workstation, and if you have it running 24/7 for hyperparameter tuning it pays itself compared to the accumulated costs of Amazon or Google cloud machines.

An SLI of titans and you train huge models under a day in a local machine. That's a great thing to have.
mode_13h - Wednesday, December 27, 2017 - link
The FP64 performance indicates it's also aimed at HPC. One has to wonder how much better it could be at each, if it didn't also have to do the other.

And for multi-GPU, you really want NVlink - not SLI.
takeshi7 - Wednesday, December 20, 2017 - link
Will game developers be able to use these tensor cores to make the AI in their games smarter? That would be cool if AI shifted from the CPU to the GPU.
DanNeely - Wednesday, December 20, 2017 - link
First and formost, that depends if mainstream Volta cards get tensor cores.

Beyond that I'm not sure how much it'd help there directly, AFAIK what Google/etc are doing with machine learning and neural networks is very different from typical game AI.
tipoo - Wednesday, December 20, 2017 - link
They're more for training the neural nets than actually executing a games AI routine.
hahmed330 - Wednesday, December 20, 2017 - link
Finally a card that can properly nail Crysis!
crysis3? - Wednesday, December 20, 2017 - link
closer to 55fps if it were crysis 3 maxed out
crysis3? - Wednesday, December 20, 2017 - link
because he benchmarked the first crysis

The NVIDIA Titan V Preview - Titanomachy: War of the Titans

Compute Performance

General Matrix Multiply (GEMM) Performance

SiSoftware Sandra

Post Your Comment

111 Comments

View All Comments

Ryan Smith - Wednesday, December 20, 2017 - link

crysis3? - Wednesday, December 20, 2017 - link

SirPerro - Wednesday, December 20, 2017 - link

mode_13h - Wednesday, December 27, 2017 - link

takeshi7 - Wednesday, December 20, 2017 - link

DanNeely - Wednesday, December 20, 2017 - link

tipoo - Wednesday, December 20, 2017 - link

hahmed330 - Wednesday, December 20, 2017 - link

crysis3? - Wednesday, December 20, 2017 - link

crysis3? - Wednesday, December 20, 2017 - link

Log in

Don't have an account? Sign up now