TU116: When Turing Is Turing… And When It Isn’t

Diving a bit deeper into matters, we have the new TU116 GPU at the heart of the GTX 1660 Ti. While NVIDIA does not announce future products in advance, you can expect that this will be the first of at least a couple of GPUs in what’s now the TU11x family, as NVIDIA is going to want to follow the same streamlined strategy for the eventual successors to GP107 and possible GP108.

TU116 is an interesting piece of kit, both because of the decisions that lead to this point and because of both the drawbacks and advantages of excising some of Turing’s functionality. As mentioned earlier in this article, NVIDIA has made a very deliberate decision to cut out their RTX functionality – the ray tracing cores and tensor cores – in order to produce a GPU that’s better suited for traditional rendering. The end result is a smaller, cheaper to produce GPU. But it also means that NVIDIA has to change how they go about promoting cards based on this GPU.

With a die size of 284mm2, TU116 tells a story in and of itself. This makes it 40% smaller than the next-smallest Turing GPU, TU106. Similarly, the transistor count has come down from 10.8 billion to 6.6 billion. This greatly improves the manufacturability of the GPU and drives down its costs, especially since NVIDIA will be going into more competitive markets with it than the other TU10x GPUs. Still, TU116 is some 42% bigger than the 200mm2 GP106 die that it replaces, so even though it’s more efficient, NVIDIA is still dealing with a significant increase in die size on a generation-by-generation basis.

Unfortunately, TU116 doesn’t give us a terribly good baseline for determining how much of a TU10x SM was composed of RTX hardware. TU116 doesn’t just drop the RTX hardware in its SMs, but it’s a smaller design overall; fewer SMs, fewer memory channels, and fewer ROPs. So we can’t fully separate the savings of dropping RTX from the savings of making a lighter GPU in general. However it’s interesting to note that on a relative basis, the transistor count difference between TU116 and TU106 is almost exactly the same as GP106 and GP104: there are 39% fewer transistors when stepping down. So later on it will give us an opportunity to look at performance and see if the performance gap between the GTX 1660 Ti and RTX 2070 – the full-fat cards of their respective GPUs – is anything like the sizable gap between the GTX 1060 6GB and the GTX 1080.

NVIDIA Turing GPU Comparison
  TU102 TU104 TU106 TU116
CUDA Cores 4608 3072 2304 1536
SMs 72 48 36 24
Texture Units 288 192 144 96
RT Cores 72 48 36 N/A
Tensor Cores 576 384 288 N/A
ROPs 96 64 64 48
Memory Bus Width 384-bit 256-bit 256-bit 192-bit
L2 Cache 6MB 4MB 4MB 1.5MB
Register File (Total) 18MB 12MB 9MB 6MB
Architecture Turing Turing Turing Turing
Manufacturing Process TSMC 12nm "FFN" TSMC 12nm "FFN" TSMC 12nm "FFN" TSMC 12nm "FFN"
Die Size 754mm2 545mm2 445mm2 284mm2

But getting back to architecture, this launch is one of a handful of times we’ve seen NVIDIA use dissimilar GPUs in their consumer cards, and it’s a situation without a good parallel. NVIDIA had done plenty of non-homogenous families in the past, but typically the black sheep of the family is the high-end server GPU, e.g. GP100, where it gets additional features not found in the consumer lineup. Instead the Turing family ends up having a split right down the middle.

The good news for consumers is that, outside of RTX functionality, TU116 and its ilk – which for the sake of simplicity I’m going to call Turing Minor from here on out – is functionally equivalent to TU102/TU104/TU106 (Turing Major). Turing Minor has the exact same DirectX feature set, the exact same core compute architecture (right on down to cache sizes), the exact same video and display blocks, etc. The RT and tensor cores really are the only thing that’s changed.

The situation looks much the same for programmers & developers as well: on the current press drivers the GTX 1660 TI reports itself as a Compute Capability 7.5 card – the same CC version as all of the Turing Major cards – so developers won’t have to even compile separate code for Turing Minor cards. So long as their code can handle a lack of tensor cores, at least.

(As a brief aside, as a performance exercise we ran the tensor version of our HGEMM benchmark on the GTX 1660 Ti. And it completed?! Performance was a bit lower, at 10.8 TFLOPS versus 11 TFLOPS with tensors disabled, but it did complete. Which indicates that either NVIDIA has been less than forthcoming on TU116, or in order to keep all Turing parts on CC 7.5, they are sending tensor ops through the CUDA cores on Turing Minor cards)

Looking at the TU116 SM, what we find is something almost identical to the SM diagrams used for Turing Major, with the SM arranged into 4 partitions, each with their own warp schedule and set of CUDA cores, while all 4 partitions share the L1 cache and texture units. Cache sizes and register file sizes are all unchanged here, so average throughput and register pressure are similarly unchanged as well. The one standout is that in replacing the tensor cores in their diagram, NVIDIA has opted to draw in FP16 cores, which is a bit of a stretch given what we know about the Turing architecture. NVIDIA only sent out this diagram yesterday, so I’m still checking with them to see if this is the company taking a creative liberty to highlight Turing’s other functionality, or if there’s more to it that NVIDIA is downplaying to keep things simple (ala Kepler and GK104).

Update: NVIDIA has gotten back to me this morning. As it turns out, the FP16 cores in the diagram are quite literal. For more information, please see below.

The Curious Case of FP16: Tensor Cores vs. Dedicated Cores

Even though Turing-based video cards have been out for over 5 months now, every now and then I’m still learning something new about the architecture. And today is one of those days.

Something that escaped my attention with the original TU102 GPU and the RTX 2080 Ti was that for Turing, NVIDIA changed how standard FP16 operations were handled. Rather than processing it through their FP32 CUDA cores, as was the case for GP100 Pascal and GV100 Volta, NVIDIA instead started routing FP16 operations through their tensor cores.

The tensor cores are of course FP16 specialists, and while sending standard (non-tensor) FP16 operations through them is major overkill, it’s certainly a valid route to take with the architecture. In the case of the Turing architecture, this route offers a very specific perk: it means that NVIDIA can dual-issue FP16 operations with either FP32 operations or INT32 operations, essentially giving the warp scheduler a third option for keeping the SM partition busy. Note that this doesn’t really do anything extra for FP16 performance – it’s still 2x FP32 performance – but it gives NVIDIA some additional flexibility.

Of course, as we just discussed, the Turing Minor does away with the tensor cores in order to allow for a learner GPU. So what happens to FP16 operations? As it turns out, NVIDIA has introduced dedicated FP16 cores!

These FP16 cores are brand new to Turing Minor, and have not appeared in any past NVIDIA GPU architecture. Their purpose is functionally the same as running FP16 operations through the tensor cores on Turing Major: to allow NVIDIA to dual-issue FP16 operations alongside FP32 or INT32 operations within each SM partition. And because they are just FP16 cores, they are quite small. NVIDIA isn’t giving specifics, but going by throughput alone they should be a fraction of the size of the tensor cores they replace.

To users and developers this shouldn’t make a difference – CUDA and other APIs abstract this and FP16 operations are simply executed wherever the GPU architecture intends for them to go – so this is all very transparent. But it’s a neat insight into how NVIDiA has optimized Turing Minor for die size while retaining the basic execution flow of the architecture.

Now the bigger question in my mind: why is it so important to NVIDIA to be able to dual-issue FP32 and FP16 operations, such that they’re willing to dedicate die space to fixed FP16 cores? Are they expecting these operations to be frequently used together within a thread? Or is it just a matter of execution ports and routing? But that is a question we’ll have to save for another day.

Turing Minor: Turing Sans RTX

For better or worse, the launch of GTX 1660 Ti and Turing Minor means that NVIDIA has needed to adjust how they go about promoting the new cards and the Turing architecture. While Turing launched with a laundry list of features, most of which had nothing to do with RTX, the broader consumer zeitgeist definitely focused on RTX and for good reason: compared to all of the low-level architectural changes under the hood, ray tracing, DLSS, and other RTX features are a lot more visible, and for NVIDIA they were easier to promote. This means that for Turing Minor NVIDIA instead has to focus on the low-level architectural improvements in Turing, which I think is great since these were largely overlooked at the Turing launch.

While I won’t recap our entire Turing deep dive here, relative to Pascal The big difference here is the numerous steps NVIDIA has taken to improve their IPC and overall efficiency. For example, Turing made the surprising move to ditch regular forms of Instruction Level Parallelism (ILP) by dropping the second warp scheduler dispatch port. Instead, each warp scheduler fires off a single set of instructions on each clock, taking advantage of the fact that it takes 2 (or more) clocks to issue a full warp in order to interleave a second instruction in.

This ILP change goes hand-in-hand with partitioning the SM into 4 blocks instead of 2, which serves to help better control resource contention among the warps and CUDA cores. In fact at a high level, a Turing SM looks a lot more like some of NVIDIA’s server-focused GPUs than their consumer-focused GPUs; there’s a lot more plumbing here in various forms to support the CUDA cores and to help them achieve better performance, rather than just throwing more CUDA cores at the problem. The net result is that while we don’t have metrics from NVIDIA, I fully expect that the ratio of supporting hardware and glue logic to CUDA cores is significantly higher on Turing than it was GP106 Pascal. Though by the same token, I expect the SMs as a whole are larger than Pascal’s as well, which is certainly reflected in the die size.

A big part of this change, in turn, is the fact that NVIDIA broke out their Integer cores into their own block. Previously a separate branch of the FP32 CUDA cores, the INT32 cores can now be addressed separately from the FP32 cores, which combined with instruction interleaving allows NVIDIA to keep both occupied at the same time. Now make no mistake: floating point math is still the heart and soul of shading and GPU compute, however integer performance has been slowly increasing in importance over time as well, especially as shaders get more complex and there’s increased usage of address generation and other INT32 functions. This change is a big part of the IPC gains NVIDIA is claiming for Turing architecture.

Speaking of CUDA cores, like all other Turing parts, TU116 and Turing Minor get NVIDIA’s fast FP16 functionality. This means that these GPUs can process FP16 operations at twice the rate of FP32 operations – via the GPU’s dedicated FP16 cores – which for GTX 1660 Ti works out to 11 TFLOPS of performance. Using FP16 shaders in PC games is still relatively new – the baseline 8th gen consoles don’t support it and NVIDIA previously limited this feature to server parts – but it’s more widely used in mobile games where FP16 support is common. There, as it will be in the PC space, FP16 shaders allow for developers to trade off between performance and shader precision by using a lower precision format; not all shader programs require a full FP32’s worth of precision, and when done right it can improve performance and reduce memory bandwidth needs without any real image quality impact.

Meanwhile, looking at the rest of the GPU, the memory and cache system is a bit of a grab bag. On the one hand, Turing implements NVIDIA’s latest lossless memory compression technology. This has proven to be one of NVIDIA’s bigger advantages over AMD, and continues to allow them to get away with less memory bandwidth than we’d otherwise expect some of their GPUs to need. The actual savings vary from game to game, but for the GTX 2080 Ti launch, NVIDIA reported that they were seeing reductions in traffic between 18% and 33%


From the RTX 2080 Ti Launch

However, distinct to TU116 versus its Turing Major siblings, the latest GPU has a less L2 cache per ROP partition. Turing Major GPUs all have 512KB of L2 cache per partition, giving TU106 a total of 4MB of L2, for example. TU116 on the other hand has just 256KB of L2 per partition for a total of 1.5MB of L2, which happens to be the same amount of cache and cache ratios as on GP106. The performance impact of this is hard to measure given all of the other changes in the GPU, but clearly NVIDIA had traded off some die size at the cost of some increases in cache misses. The wildcard in all of this being how much the additional bandwidth of GDDR6 helps to offset those misses.

Finally on the graphics front, Turing Minor also retains Turing’s adaptive shading capabilities. Not unlike RTX, this is a new feature that is going to take some time to get adopted, so we’ve only seen a handful of games (such as Wolfenstein II) implement it thus far. But by reducing the pixel shader granularity/rate used at various points in a scene, the technology makes it possible to improve performance by reducing the overall shading workload.

The trick with adaptive shading – and why it’s a feature rather than an immediate and transparent means of improving performance – is that it’s making a very direct quality/speed tradeoff with pixel shaders; the reduced shading rate can reduce the overall image quality by reducing clarity and creating aliasing artifacts. So developers are still in their infancy playing with the technology to figure out where they can use it without noticeably hurting image quality. In practice I expect we’re going to see it more widely deployed in VR games at first, as the tech is much easier to use there (reduce the rate anywhere the user isn’t looking), as opposed to traditional games.

The end result of all of this is that while Turing Minor has some very important feature differences from Turing Major, at the end of the day it’s still Turing. NVIDIA for their part is going to have to grapple with the fact that not all of their current-generation cards feature RTX functionality, but that’s going to be marketing’s problem. As for consumers, unless you’re specifically seeking out NVIDIA’s ray tracing and tensor core functionality, GTX 1660 Ti is just another Turing.

The NVIDIA GeForce GTX 1660 Ti Review: Featuring EVGA Meet the EVGA GeForce GTX 1660 Ti XC Black
POST A COMMENT

157 Comments

View All Comments

  • Retycint - Tuesday, February 26, 2019 - link

    AMD selling overpriced cards does not subtract from the point that Nvidia is also attempting to raise the price as well. Both companies have put out underwhelming products this gen Reply
  • Rocket321 - Friday, February 22, 2019 - link

    "finally puts a Turing card in competition with their Pascal cards" should say Polaris. Reply
  • Ryan Smith - Friday, February 22, 2019 - link

    Boy I can't wait for Navi, since it sounds nothing like Turing...

    Thanks!
    Reply
  • Kogan - Friday, February 22, 2019 - link

    Aww, I was hoping this release would lower the price on those used 1070's. Oh well. I'll still probably go for a used 1070 over this one. Nearly identical in every way and can be found for as low as $200. Reply
  • Hamm Burger - Friday, February 22, 2019 - link

    Reading "Turing Sheds" in the headline makes me wonder what he could have done with a couple of these at Bletchley Park (which, for anybody passing, is well worth the steep entry fee — see bletchleypark.org.uk).

    Sorry for the interruption. I'll return you to the normal service.
    Reply
  • Colin1497 - Friday, February 22, 2019 - link

    "Now the bigger question in my mind: why is it so important to NVIDIA to be able to dual-issue FP32 and FP16 operations, such that they’re willing to dedicate die space to fixed FP16 cores? Are they expecting these operations to be frequently used together within a thread? Or is it just a matter of execution ports and routing?"

    It seems pretty likely that they added the FP16 cores because it simplified design, drivers, etc. It was easier to just drop in a few (as you mentioned) tiny FP16 cores than it was to change behavior of the architecture.
    Reply
  • CiccioB - Friday, February 22, 2019 - link

    FP16 is a way to simplify shading computing over the common used FP32.
    They allow for higher bandwidth (x2) and higher speed (x2, so half the energy for the same work) with the same HW space occupation. It was a feature used in HPC where bandwidth, power consumption and of course computation time are quite critical. They then ended in game class architecture just because they have find a way to exploit it there too.
    Some games have started using FP16 for their shading. On AMD fence, only Vega class cards support packed FP16 math.

    The use of a INT ALU that executes integer instructions together with the FP ones is instead an exclusive feature that can really improve shading performance much more than any other complex feature like high threaded (constantly interrupted) mechanism that is needed on architectures that cannot keep the ALUs feed.
    In fact we see that with less CUDA cores Turing can do the same work of Pascal even using less energy. And no magic ACE is present.
    Reply
  • Yojimbo - Friday, February 22, 2019 - link

    They didn't just drop in a few. It seems they have enough for 2x FP32 performance. Why are they dual issue? My guess is it is because that is what's necessary for Tensor Core operation. I think NVIDIA is being a bit secretive about the Tensor Cores. It's clear they took the RT Core circuitry out of the Turing minor die. As far as the Tensor Cores, I'm not so sure. Think about it this way: suppose Tensor Cores really are specialized separate cores. Then they also happen to have the capability of non tensor FP16 operation in dual issue with FP32 CUDA cores? Because if they don't then whatever functionality NVIDIA has planned for the FP16 cores on Turing minor would be incompatible with Turing major and Volta. I don't see how that can be the case, however, because, according to this review, Turing major is listed as the same CUDA compute generation as Turing minor. Now if the Tensor Cores can double as general purpose FP16 CUDA cores, then what's to say that FP16 and FP32 CUDA cores can't double as Tensor Cores? That is, if the Tensor Core can be made with two data flow paths, one following general purpose FP16 operations and one following Tensor Core instruction operations, then commutatively a general purpose CUDA core can be made with two data flow paths, one following general purpose operations and one following Tensor Core instruction operations.

    When Turing came out with Tensor Core operations but with FP64 cores cut from the die and no increase in FP32 CUDA cores per SM over Volta I was surprised. But with this new information from the Turing Minor launch it makes more sense to me. I don't know if they have the dedicated FP16 cores on Volta. If they do then the FP64 cores don't need to play the following role, but if they are able to use the FP64 cores as FP16 cores then hypothetically they have enough cores to account for the 64 FMA operations per clock per SM of the 8 Tensor Cores per SM. But on Turing major they just didn't have the cores to account for the Tensor Core performance. These FP16 cores on Turing minor seem to be exactly what would be necessary to make up for the shortfall. So, my guess is that Turing major also has these same cores. The difference is either entirely one of firmware/drivers that allows the Tensor Core data path to be operated on Turing major but not Turing minor or Turing major has some extra circuitry that allows the CUDA cores to be lashed together with an alternate data flow path that doesn't exist in Turing minor.
    Reply
  • GreenReaper - Friday, February 22, 2019 - link

    Agreed. It seems likely that most of the hardware is present, just not active.

    Frankly, it's not clear why these couldn't be binned versions of the higher-level chips that haven't met the QA requirements, which would be one reason it took this long to release - you need enough stock to be able to distribute it. If it's planned out in advance, they just need X good CUDA cores and Y ROPs that run at Z Mhz, combined with at least [n] MB of cache. Fuse off the bad or unwanted portions to save on power and you're good.

    Of course it *could* be like Intel, which truly make smaller derivatives. If so that suggests they'll be selling a lot of these cards. Even then, though, Yojimbo's supposition about the core design being essentially the same is likely to be true.
    Reply
  • Yojimbo - Saturday, February 23, 2019 - link

    Yeah the die size and transistor count is still large for the number of CUDA cores, being that this review claims the 1660Ti has all SMs on the TU116 enabled. I said it was clear they took RT circuitry out. But I was wrong, that's not clear. It seems the die area per CUDA core and transistors per CUDA core of the TU116 are extremely close to the TU106, which is fully-enabled in RTX 2070. If this is the result of the INT32 and FP16 cores of the TU116 then where exactly do any cost savings of removing the Tensor Cores and RT Cores come from? Definitely the cost of completely re-architecting another GPU would outweigh the slight reduction in die size they seem to have achieved.

    On the other hand, I'd imagine TU116 will be such a high volume part that unless yields are really lousy, binning alone won't provide enough chips (and where are the fully enabled versions of the 284 mm^2 RTX dies going, anywhere? No such product has thus far been announced.) Perhaps such a small number of RT cores was judged to be insufficient for RTX gaming. Even if not impossible to create some useful effects including that many RT cores, if developers were incentivized to target such few RT cores with their RTX efforts because the volume of such RT-enabled cards was significant then they may reduce the scope and scale of RTX enhancements they undertake, putting a drag on the adoption of the technology. So NVIDIA opted to disable the RT cores, and perhaps the Tensor Cores, present on the dies even when they are actually fully functioning. Perhaps it was simply cheaper to eat the wasted die space per chip than to design an entirely new GPU with the RT cores and Tensor Cores removed.
    Reply

Log in

Don't have an account? Sign up now