The NVIDIA Titan V Preview - Titanomachy: War of the Titansby Ryan Smith & Nate Oh on December 20, 2017 11:30 AM EST
Diving in, I want to start with a look at compute performance. Titan V is first and foremost a compute product, so what makes or breaks the card – what justifies its existence – is how well it does at compute.
This is actually a more challenging aspect to test than it may first appear, as NVIDIA’s top-down architecture launch strategy means that there’s not a lot of commercial or otherwise widely-used software written that can take full advantage of the Volta architecture. Even in the major deep learning frameworks, their support for Volta is still in the early phases or still under active development. So I’ve opted to excise the deep learning tests for the moment, while we get a better handle on those for our full review. Meanwhile software that can take advantage of Volta’s fast FP16 and FP64 modes is also quite rare, especially as fast FP16 is not a consumer NVIDIA feature at this time.
Compute overall is harder to test, as unlike games or even professional visualization tasks, there’s far fewer standard workloads. Even tasks using major compute frameworks are often heavily tailored towards a specific need. So while our compute benchmarks are meant to be enlightening, software developers and compute users will want to carefully study how their workloads map to GPU hardware.
Anyhow, before we get too deep here, I want to take a quick look at FMA latency. Over successive generations of GPU architectures NVIDIA has continued to find ways to reduce the latency of what’s perhaps the most critical instruction in the GPU space, and Volta continues this trend.
With Volta NVIDIA has been able to knock another two clock cycles off of the latency for an FMA, reducing it from 6 cycles on Pascal/Maxwell to 4 cycles. This is thanks to NVIDIA’s separating the INT32 and FP32 units in the Volta architecture, which allows more of the overall FMA operation to be performed in parallel. The end result is that for any instructions that depend on the result of an FMA operation, they can kick off 33% sooner, improving overall compute/shader efficiency.
Interestingly this was one area of compute where NVIDIA was trailing AMD – who has an average latency of around 4.5 cycles – so it’s great to see NVIDIA close the gap here. And this shouldn’t be considered a trivial change, as reducing the number of clock cycles for any fundamental operation is quite difficult. Case in point, I’m not sure just how NVIDIA could get FMAs any lower than 4 cycles.
General Matrix Multiply (GEMM) Performance
Starting things off, let’s take a look at General Matrix Multiply (GEMM) performance. GEMM measures the performance of dense matrix multiplication, which is an important operation in a variety of scientific fields. GEMM is highly parallel and typically compute heavy, and one of the first tests of performance and efficiency on any parallel architecture geared towards HPC workloads.
Since matrix multiplication is such a fundamental operation, it’s one of the operations included in NVIIDA’s CUBLAS library, meaning NVIDIA offers a highly optimized version of the function. Importantly, it also means that NVIDIA offers Volta-optimized implementations of the function, including operations that run on the densest matrix multiplication core of them all, the tensor core. So GEMM provides us a fantastic opportunity to look at the performance of not only the usual FP16/FP32/FP64 operations, but also the FP16 tensor cores themselves. This is very much an ideal case, but it gives us the means to see what kind of maximum throughput Titan V can reach.
Burying the lede here, let’s start with single precision (FP32) performance. As mentioned briefly in our look at the specifications of the Titan V, the Titan V is not immensely more powerful than the Titan Xp when it comes to standard FP32 operations. It has a much greater number of FP32 CUDA cores, but the compute-centric chip trades that off with lower average clockspeeds, which dulls the difference some.
None the less, the Titan V is 23% faster in FP32 GEMM, which is actually a greater improvement than the 14% increase that we’d expect going by the specifications. In fact at 13.4 TFLOPS, the Titan V is at 97% of its rated 13.8 TFLOPS FP32 performance, an amazingly efficient metric. The Titan Xp isn’t as efficient here – hitting 10.9 of 12.1 TFLOPs for a 90% rating – underscoring the important architectural improvements Volta brings. I am not 100% sure on just which factors lead to this improvement, but the separation of INT32 and FP32 cores along with the scheduling changes are likely candidates. Titan V also benefits from improvements in caching and more memory bandwidth, which further reduces any potential bottlenecks in feeding the beast.
As for double precision (FP64) and half precision (FP16) performance, these are in line with our expectations. The official FP64 and FP16 rates for Titan V are 1:2 and 2:1 FP32 respectively, and this is what we find on GEMM. Compared to the Titan Xp, which didn’t offer fast FP64 or FP16 modes, the difference is night and day: Titan V is 16x and 121x faster respectively.
Last but not least, let’s talk tensors. I almost removed the tensor core results from this graph, but ultimately I left them in to highlight just what the tensor cores mean for compute developers who have tasks well-suited for them. When configured to use the tensor cores, our GEMM benchmark runs hit 97.5 TFLOPs, 3.7x faster than Titan V’s already fast FP16 performance. Or to put this another way, this one card is doing 97 trillion floating point operations in a single second.
Relative to the high efficiency marks we saw earlier with the CUDA cores, the tensor cores don’t come quite as close. 97.5 TFLOPs is about 89% of the Titan V’s rated 110 TFLOPs tensor core performance. Besides the fact that the tensor cores are extremely dense – they’ll give your power and cooling a good run for their money – I suspect that Titan V is feeling the pinch of memory bandwidth. The loss of a stack of HBM2 means that it’s trying to perform trillions of operations on only billions of bytes of memory bandwidth.
Finally, I can’t reiterate enough that when looking at tensor core performance, context is king here. The tensor cores are very fast, but they are highly specialized. GPUs in general are only good for a subset of overall computing tasks, and the tensor cores take that one step further. In super dense matrix multiplication scenarios they excel, but they have little to offer outside of that. Thankfully for NVIDIA and its newly found deep learning empire, neural networks map very well to tensor operations.
Moving on from our low-level GEMM benchmark, let’s move up a level to something a bit more practical with some of SiSoftware’s Sandra’s GPU compute benchmarks. Unfortunately for reasons I haven’t yet been able to determine, Sandra’s CUDA benchmarks aren’t working with NVIDIA’s latest drivers, which limits us to working with OpenCL and DirectX compute shaders. Still, this gives us a chance a look at performance outside of specialized CUDA applications.
Relatively speaking, Titan V once again punches above its weight. In FP32 mode Sandra’s compute shader benchmark clocks the card as processing 25 Gigapixels per second, versus 18.4 for the Titan Xp, a 37% difference and once again exceeding the on-paper differences between these products. I suggest caution in reading too much into these synthetic compute results, as real-world workloads aren’t nearly as neat and tidy. But it continues to point to the Volta architecture having improved things by quite a bit under the hood.
Meanwhile, because NVIDA doesn’t expose fast FP16 mode outside of CUDA, the Titan V’s FP16 performance doesn’t get used to its fullest extent here. Fast FP64 on the other hand is exposed, and Titan V leaves every other Titan in the dust.
The story is much the same using DX11 video (pixel) shaders. Though it’s very interesting that Titan V isn’t showing the same kind of sizable performance gains here as it did using a compute shader.
A bit more real-world is Sandra’s financial analysis benchmark, which performs a series of operations including the ever-important Monte Carlo simulation. Titan V is once again well in the lead here, punching above its weight relative to the Titan Xp. FP64 performance is also much stronger than the other Titans, though on a relative basis, the performance drop-off for moving to FP64 is much more than the 50% performance hit that we’d expect on paper.