Compute Performance

Diving in, I want to start with a look at compute performance. Titan V is first and foremost a compute product, so what makes or breaks the card – what justifies its existence – is how well it does at compute.

This is actually a more challenging aspect to test than it may first appear, as NVIDIA’s top-down architecture launch strategy means that there’s not a lot of commercial or otherwise widely-used software written that can take full advantage of the Volta architecture. Even in the major deep learning frameworks, their support for Volta is still in the early phases or still under active development. So I’ve opted to excise the deep learning tests for the moment, while we get a better handle on those for our full review. Meanwhile software that can take advantage of Volta’s fast FP16 and FP64 modes is also quite rare, especially as fast FP16 is not a consumer NVIDIA feature at this time.

Compute overall is harder to test, as unlike games or even professional visualization tasks, there’s far fewer standard workloads. Even tasks using major compute frameworks are often heavily tailored towards a specific need. So while our compute benchmarks are meant to be enlightening, software developers and compute users will want to carefully study how their workloads map to GPU hardware.

Anyhow, before we get too deep here, I want to take a quick look at FMA latency. Over successive generations of GPU architectures NVIDIA has continued to find ways to reduce the latency of what’s perhaps the most critical instruction in the GPU space, and Volta continues this trend.

Synthetic: Beyond3D Suite - Estimated FMA Latency

With Volta NVIDIA has been able to knock another two clock cycles off of the latency for an FMA, reducing it from 6 cycles on Pascal/Maxwell to 4 cycles. This is thanks to NVIDIA’s separating the INT32 and FP32 units in the Volta architecture, which allows more of the overall FMA operation to be performed in parallel. The end result is that for any instructions that depend on the result of an FMA operation, they can kick off 33% sooner, improving overall compute/shader efficiency.

Interestingly this was one area of compute where NVIDIA was trailing AMD – who has an average latency of around 4.5 cycles – so it’s great to see NVIDIA close the gap here. And this shouldn’t be considered a trivial change, as reducing the number of clock cycles for any fundamental operation is quite difficult. Case in point, I’m not sure just how NVIDIA could get FMAs any lower than 4 cycles.

General Matrix Multiply (GEMM) Performance

Starting things off, let’s take a look at General Matrix Multiply (GEMM) performance. GEMM measures the performance of dense matrix multiplication, which is an important operation in a variety of scientific fields. GEMM is highly parallel and typically compute heavy, and one of the first tests of performance and efficiency on any parallel architecture geared towards HPC workloads.

Since matrix multiplication is such a fundamental operation, it’s one of the operations included in NVIIDA’s CUBLAS library, meaning NVIDIA offers a highly optimized version of the function. Importantly, it also means that NVIDIA offers Volta-optimized implementations of the function, including operations that run on the densest matrix multiplication core of them all, the tensor core. So GEMM provides us a fantastic opportunity to look at the performance of not only the usual FP16/FP32/FP64 operations, but also the FP16 tensor cores themselves. This is very much an ideal case, but it gives us the means to see what kind of maximum throughput Titan V can reach.

Compute: General Matrix Multiply Performance (GEMM)

Burying the lede here, let’s start with single precision (FP32) performance. As mentioned briefly in our look at the specifications of the Titan V, the Titan V is not immensely more powerful than the Titan Xp when it comes to standard FP32 operations. It has a much greater number of FP32 CUDA cores, but the compute-centric chip trades that off with lower average clockspeeds, which dulls the difference some.

None the less, the Titan V is 23% faster in FP32 GEMM, which is actually a greater improvement than the 14% increase that we’d expect going by the specifications. In fact at 13.4 TFLOPS, the Titan V is at 97% of its rated 13.8 TFLOPS FP32 performance, an amazingly efficient metric. The Titan Xp isn’t as efficient here – hitting 10.9 of 12.1 TFLOPs for a 90% rating – underscoring the important architectural improvements Volta brings. I am not 100% sure on just which factors lead to this improvement, but the separation of INT32 and FP32 cores along with the scheduling changes are likely candidates. Titan V also benefits from improvements in caching and more memory bandwidth, which further reduces any potential bottlenecks in feeding the beast.

As for double precision (FP64) and half precision (FP16) performance, these are in line with our expectations. The official FP64 and FP16 rates for Titan V are 1:2 and 2:1 FP32 respectively, and this is what we find on GEMM. Compared to the Titan Xp, which didn’t offer fast FP64 or FP16 modes, the difference is night and day: Titan V is 16x and 121x faster respectively.

Last but not least, let’s talk tensors. I almost removed the tensor core results from this graph, but ultimately I left them in to highlight just what the tensor cores mean for compute developers who have tasks well-suited for them. When configured to use the tensor cores, our GEMM benchmark runs hit 97.5 TFLOPs, 3.7x faster than Titan V’s already fast FP16 performance. Or to put this another way, this one card is doing 97 trillion floating point operations in a single second.

Relative to the high efficiency marks we saw earlier with the CUDA cores, the tensor cores don’t come quite as close. 97.5 TFLOPs is about 89% of the Titan V’s rated 110 TFLOPs tensor core performance. Besides the fact that the tensor cores are extremely dense – they’ll give your power and cooling a good run for their money – I suspect that Titan V is feeling the pinch of memory bandwidth. The loss of a stack of HBM2 means that it’s trying to perform trillions of operations on only billions of bytes of memory bandwidth.

Finally, I can’t reiterate enough that when looking at tensor core performance, context is king here. The tensor cores are very fast, but they are highly specialized. GPUs in general are only good for a subset of overall computing tasks, and the tensor cores take that one step further. In super dense matrix multiplication scenarios they excel, but they have little to offer outside of that. Thankfully for NVIDIA and its newly found deep learning empire, neural networks map very well to tensor operations.

SiSoftware Sandra

Moving on from our low-level GEMM benchmark, let’s move up a level to something a bit more practical with some of SiSoftware’s Sandra’s GPU compute benchmarks. Unfortunately for reasons I haven’t yet been able to determine, Sandra’s CUDA benchmarks aren’t working with NVIDIA’s latest drivers, which limits us to working with OpenCL and DirectX compute shaders. Still, this gives us a chance a look at performance outside of specialized CUDA applications.

Compute: SiSoft Sandra 2017 - GP Processing (Compute Shader)

Relatively speaking, Titan V once again punches above its weight. In FP32 mode Sandra’s compute shader benchmark clocks the card as processing 25 Gigapixels per second, versus 18.4 for the Titan Xp, a 37% difference and once again exceeding the on-paper differences between these products. I suggest caution in reading too much into these synthetic compute results, as real-world workloads aren’t nearly as neat and tidy. But it continues to point to the Volta architecture having improved things by quite a bit under the hood.

Meanwhile, because NVIDA doesn’t expose fast FP16 mode outside of CUDA, the Titan V’s FP16 performance doesn’t get used to its fullest extent here. Fast FP64 on the other hand is exposed, and Titan V leaves every other Titan in the dust.

Compute: SiSoftware Sandra 2017 - Video Shader Compute (DX11)

The story is much the same using DX11 video (pixel) shaders. Though it’s very interesting that Titan V isn’t showing the same kind of sizable performance gains here as it did using a compute shader.

Compute: SiSoft Sandra 2017 - GP Financial Analysis (OpenCL)

A bit more real-world is Sandra’s financial analysis benchmark, which performs a series of operations including the ever-important Monte Carlo simulation. Titan V is once again well in the lead here, punching above its weight relative to the Titan Xp. FP64 performance is also much stronger than the other Titans, though on a relative basis, the performance drop-off for moving to FP64 is much more than the 50% performance hit that we’d expect on paper.

Meet Titan V, A Note on Graphics, & the Test Compute Performance: Geekbench 4, Folding @ Home, & CompuBench
Comments Locked


View All Comments

  • mode_13h - Wednesday, December 27, 2017 - link

    I don't know if you've heard of OpenCL, but there's not reason why a GPU needs to be programmed in a proprietary language.

    It's true that OpenCL has some minor issues with performance portability, but the main problem is Nvidia's stubborn refusal to support anything past version 1.2.

    Anyway, lots of businesses know about vendor lock-in and would rather avoid it, so it sounds like you have some growing up to do if you don't understand that.
  • CiccioB - Monday, January 1, 2018 - link

    Grow up.
    I repeat. None is wasting millions in using not certified, supported libraries. Let's avoid talking about entire frameworks.
    If you think that researches with budgets of millions are nerds working in a garage with avoiding lock-in strategies as their first thought in the morning, well, grow up kid.
    Nvidia provides the resources to allow them to exploit their expensive HW at the most of its potential reducing time and other associated costs. Also when upgrading the HW with a better one. That's what counts when investing millions for a job.
    For you kid's home made AI joke, you can use whatever alpha library with zero support and certification. Others have already grown up.
  • mode_13h - Friday, January 5, 2018 - link

    No kid here. I've shipped deep-learning based products to paying customers for a major corporation.

    I've no doubt you're some sort of Nvidia shill. Employee? Maybe you bought a bunch of their stock? Certainly sounds like you've drunk their kool aid.

    Your line of reasoning reminds me of how people used to say businesses would never adopt Linux. Now, it overwhelmingly dominates cloud, embedded, and underpins the Android OS running on most of the world's handsets. Not to mention it's what most "researchers with budgets of millions" use.
  • tuxRoller - Wednesday, December 20, 2017 - link

    "The integer units have now graduated their own set of dedicates cores within the GPU design, meaning that they can be used alongside the FP32 cores much more freely."

    Yay! Nvidia caught up to gcn 1.0!
    Seriously, this goes to show how good the gcn arch was. It was probably too ambitious for its time as those old gpus have aged really well it took a long time for games to catch up.
  • CiccioB - Thursday, December 21, 2017 - link

    <blockquote>Nvidia caught up to gcn 1.0!</blockquote>
    Yeah! It is known to the entire universe that it is nvidia that trails AMD performances.
    Luckly they managed to get this Volta out in time before the bankruptcy.
  • tuxRoller - Wednesday, December 27, 2017 - link

    I'm speaking about architecture not performance.
  • CiccioB - Monday, January 1, 2018 - link

    New bigger costier architectures with lower performance = fail
  • tuxRoller - Monday, January 1, 2018 - link

    Ah, troll.
  • CiccioB - Wednesday, December 20, 2017 - link

    Useless card
    Vega = #poorvolta
  • StrangerGuy - Thursday, December 21, 2017 - link

    AMD can pay me half their marketing budget and I will still do better than doing exactly nothing. Their marketing is worse than being in a state of non-existence.

Log in

Don't have an account? Sign up now