The Turing Architecture: Volta in Spirit

Diving straight into the microarchitecture, the new Turing SM looks very different to the Pascal SM, but those who’ve been keeping track of Volta will notice a lot of similarities to the NVIDIA’s more recent microarchitecture. In fact, on a high-level, the Turing SM is fundamentally the same, with the notable exception of a new IP block: the RT Core. Putting the RT Cores and Tensor Cores aside for now, the most drastic changes from Pascal are same ones that differentiated Volta from Pascal. Turing’s advanced shading features are also in the same bucket in needing explicit developer support.

Like Volta, the Turing SM is partitioned into 4 sub-cores (or processing blocks) with each sub-core having a single warp scheduler and dispatch unit, as opposed Pascal’s 2 partition setup with two dispatch ports per sub-core warp scheduler. There are some fairly major implications with change, and broadly-speaking this means that Volta/Turing loses the capability to issue a second, non-dependent instruction from a thread for a single clock cycle. Turing is presumably identical to Volta performing instructions over two cycles but with schedulers that can issue an independent instruction every cycle, so ultimately Turing can maintain 2-way instruction level parallelism (ILP) this way, while still having twice the amount of schedulers over Pascal.

Like we saw in Volta, these changes go hand-in-hand with the new scheduling/execution model with independent thread scheduling that Turing also has, though differences were not disclosed at this time. Rather than per-warp like Pascal, Volta and Turing have per-thread scheduling resources, with a program counter and stack per-thread to track thread state, as well as a convergence optimizer to intelligently group active same-warp threads together into SIMT units. So all threads are equally concurrent, regardless of warp, and can yield and reconverge.

In terms of the CUDA cores and ALUs, the Turing sub-core has 16 INT32 cores, 16 FP32 cores, and 2 Tensor Cores, the same setup as the Volta sub-core. With the split INT/FP datapath model like Volta, Turing can also concurrently execute FP and INT instructions, which as we will see, is much more relevant with the RT cores involved. Where Turing differs is in lacking Volta’s full complement of FP64 cores, instead having a token amount (2 per SM) for compatibility reasons and resulting in FP64 throughput being 1/32 the TFLOP rate of FP32. Maimed FP64 is standard for NVIDIA’s consumer GPUs, but what has not been standard until now is Turing’s full 2x FP16 throughput, which was available in GP100 but was crippled in the other Pascal GPUs.

While these details may be more on the technical side of things, in Volta this design seemed inextricably linked to maximizing the most amount of performance from tensor cores, but minimizing disrupting parallelism or coordination with other compute workloads. The same is most likely true with Turing’s 2nd generation tensor cores and RT cores, where 4 independently scheduled sub-cores and granular thread manipulation would be very useful in extracting the most performance out of mixed gaming-oriented workloads, where rendering a single frame would be pulling in multiple blocks of the GPU to work in conjunction. This is actually a concept that circumscribes the RTX-OPS metric, and we will revisit that in depth later.

Memory-wise, every sub-core now has an L0 instruction cache like Volta, with identically sized 64 KB register file. In Volta, this was important in reducing latency when the tensor cores were in play, and in Turing this likely benefits RT cores similarly, which we will discuss in a later section. Otherwise, the Turing SM also has 4 load/store units per sub-core, down from 8 in Volta, but still maintains 4 texture units.

Further up the memory hierarchy is the new L1 data cache and Shared Memory (SMEM) that has been revamped and unified into a single partitionable memory block, another Volta innovation. For Turing, this is looking to be a combined 96 KB L1/SMEM, which traditional graphics workloads divide as 64KB for dedicated graphics shader RAM and as 32 KB for texture cache and register file spill area. Meanwhile, compute workloads can partition the L1/SMEM with up to 64 KB as L1 with the remaining 32 KB as SMEM, or vice versa. For Volta, SMEM can be configured up to 96 KB.

Though many of these details are only of value to developers, there are several important points to make here. One is simply how similar Turing and Volta are, as opposed to ; after all, they are in the same generational compute family. Another is how compute-oriented Volta – and by extension, Turing – are, and the fact that this is being brought to consumers as part of NVIDIA’s proclaimed ‘future of gaming.’ Part of that is, of course, permitting fast FP16 in potential gaming workloads, but Turing goes far beyond that. At the low level, Turing is less about maximizing traditional gaming, and more about maximizing gaming with special technologies such as real-time raytracing.

For their part, NVIDIA points to Turing’s leap in performance from Pascal, from memory hierarchy bandwidth uplifts to 50% more shader performance per core, but unfortunately for today we can’t connect this with any real world data or performance. With concurrent FP/INT execution in gaming, the company is keen to point out that around 36 INT instructions could be freed up by moving to its own pipe, which nevertheless doesn’t describe Turing performance, only the applicability of its concurrent execution feature in games.

It becomes a bit of a complex scenario, as we know that Volta already improved on Pascal in these aspects with concurrent execution, a brand new ISA, and reworked SM. And it doesn’t seem to involve architectural changes for significant clockspeed enhancements a la Pascal from Maxwell, though of course on the process side the 12nm FFN is a factor. So it comes down to special gaming workloads and real-world performance. The latter is not available today, but the former is so important to Turing that it merited dropping ‘GTX’ for ‘RTX’. And of those special workloads, real-time raytracing and RT cores take center stage.

Bounding Volume Hierarchy Turing RT Cores: Hybrid Rendering and Real Time Raytracing
Comments Locked

111 Comments

View All Comments

  • BurntMyBacon - Monday, September 17, 2018 - link

    Good article. I would have been nice to get more information as to exactly what nVidia is doing with the RT cores to optimize ray tracing, but I can understand why they would want to keep that a secret at this point. One oversight in an otherwise excellent article:

    @Nate Oh (article): "The net result is that with nearly every generation, the amount of memory bandwidth available per FLOP, per texture lookup, and per pixel blend has continued to drop. ... Turing, in turn, is a bit of an interesting swerve in this pattern thanks to its heavy focus on ray tracing and neural network inferencing. If we're looking at memory bandwidth merely per CUDA core FLOP, then bandwidth per FLOP has actually gone up, since RTX 2080 doesn't deliver a significant increase in (on-paper) CUDA core throughput relative to GTX 1080."

    The trend has certainly been downward, but I was curious as to why the GTX 780 wasn't listed. When I checked it out, I found that it is another "swerve" in the pattern similar to the RTX2080. The specifications for the NVIDIA Memory Bandwidth per FLOP (In Bits) chart are:
    GTX 780 - 0.58 bits | 3.977 TFLOPS | 288GB/sec

    This is easily found information and its omission is pretty noticeable (at least to me), so I assume it got overlooked (easy to do in an article this large). While it doesn't match your initial always downward observation, it also clearly doesn't change the trend. It just means the trend is not strictly monotonic.
  • nboelter - Tuesday, September 18, 2018 - link

    I had to solve the problem of “random memory accesses from the graphics card memory are the main bottleneck for the performance of the molecular dynamics simulation” when i did some physics on CUDA, and got great results with Hilbert space-filling curves (there is a fabulous german paper from 1891 about this newfangled technology) to - essentially - construct BVHs. Only difference really is that i had grains of sand instead of photons. Now i really wonder if these RT cores could be used for physics simulations!
  • webdoctors - Tuesday, September 18, 2018 - link

    This will likely get lost in the 100 comments, but this is really huge and getting ignored by the pricing.

    I've often wondered and complained for years to my friends why we keep going to higher resolutions from 720p to 4K rather than actually improving the graphics. Look at a movie on DVD from 20 yrs ago at 480p resolution, and the graphics are so much more REALISTSIC than the 4K stuff you see in games today because its either real ppl on film or if CG raytraced offline with full lighting. Imagine getting REAL TIME renders that look like real life video, that's a huge breakthrough. Sure we've raytracing for decades, but never real time on non-datacenter size clusters.

    Rasterization 4K or 8K content will never look as REAL as 1080p raytraced content. It might look nicer, but it won't look REAL. Its great we'll have hardware where we can choose whether we want to use the fake rasterization cartoony path or the REAL path.

    A 2080TI that costs $1200 will be $120 in 10 years, but it won't change the fact that now you're getting REAL vs fake. 2 years ago, you didn't have the option, you couldn't say I'll pay you $5k to give me the ray traced option in the game, now we'll get (hopefully) developer support and see this mainstream. Probably can use AWS to gamestream this instead of buying a video card and than get the raytrace now too.

    If you're happy with non-ray tracing, just buy a 1070 and stick to playing games in 1080p. You'll never be perf limited for any games and move on.
  • eddman - Wednesday, September 19, 2018 - link

    You are not getting REAL with 20 series, not even close.
  • MadManMark - Wednesday, September 19, 2018 - link

    His point is that we are getting CLOSER to "real," not that it is CLOSE or IS real. Would have thought that was obvious, but guess ti isn't to everyone.
  • eddman - Thursday, September 20, 2018 - link

    It seems you are the one who misread. From his comment: "it won't change the fact that now you're getting REAL vs fake"

    So, yes, he does think with 20 series you get the REAL thing.
  • sudz - Wednesday, September 19, 2018 - link

    "as opposed Pascal’s 2 partition setup with two dispatch ports per sub-core warp scheduler."

    So in conclusion: RTX has more warp cores.

    Engage!
  • ajp_anton - Friday, September 21, 2018 - link

    This comment is a bit late, but your math for memory efficiency is wrong.

    If bandwidth+compression gives a 50% increase, and bandwidth alone is a 27% increase, you can't just subtract them to get the compression increase. In this example, compression increase is 1.5/1.27 = 1,18, or 18%. Not the 23% that you get by subtracting.

    This also means you have to re-write the text where you think it's weird how this is higher than the last generation increase, because it no longer is higher.
  • Overmind - Thursday, September 27, 2018 - link

    There are many inconsistencies in the article.
  • Overmind - Thursday, September 27, 2018 - link

    If the 102 with 12 complete functional modules has 72 RTCs (RTX-ops) how can the 2080 Ti with 11 functional modules has 78 RTCs ? The correct value is clearly 68.

Log in

Don't have an account? Sign up now