TU116: When Turing Is Turing… And When It Isn’t

Diving a bit deeper into matters, we have the new TU116 GPU at the heart of the GTX 1660 Ti. While NVIDIA does not announce future products in advance, you can expect that this will be the first of at least a couple of GPUs in what’s now the TU11x family, as NVIDIA is going to want to follow the same streamlined strategy for the eventual successors to GP107 and possible GP108.

TU116 is an interesting piece of kit, both because of the decisions that lead to this point and because of both the drawbacks and advantages of excising some of Turing’s functionality. As mentioned earlier in this article, NVIDIA has made a very deliberate decision to cut out their RTX functionality – the ray tracing cores and tensor cores – in order to produce a GPU that’s better suited for traditional rendering. The end result is a smaller, cheaper to produce GPU. But it also means that NVIDIA has to change how they go about promoting cards based on this GPU.

With a die size of 284mm2, TU116 tells a story in and of itself. This makes it 40% smaller than the next-smallest Turing GPU, TU106. Similarly, the transistor count has come down from 10.8 billion to 6.6 billion. This greatly improves the manufacturability of the GPU and drives down its costs, especially since NVIDIA will be going into more competitive markets with it than the other TU10x GPUs. Still, TU116 is some 42% bigger than the 200mm2 GP106 die that it replaces, so even though it’s more efficient, NVIDIA is still dealing with a significant increase in die size on a generation-by-generation basis.

Unfortunately, TU116 doesn’t give us a terribly good baseline for determining how much of a TU10x SM was composed of RTX hardware. TU116 doesn’t just drop the RTX hardware in its SMs, but it’s a smaller design overall; fewer SMs, fewer memory channels, and fewer ROPs. So we can’t fully separate the savings of dropping RTX from the savings of making a lighter GPU in general. However it’s interesting to note that on a relative basis, the transistor count difference between TU116 and TU106 is almost exactly the same as GP106 and GP104: there are 39% fewer transistors when stepping down. So later on it will give us an opportunity to look at performance and see if the performance gap between the GTX 1660 Ti and RTX 2070 – the full-fat cards of their respective GPUs – is anything like the sizable gap between the GTX 1060 6GB and the GTX 1080.

NVIDIA Turing GPU Comparison
  TU102 TU104 TU106 TU116
CUDA Cores 4608 3072 2304 1536
SMs 72 48 36 24
Texture Units 288 192 144 96
RT Cores 72 48 36 N/A
Tensor Cores 576 384 288 N/A
ROPs 96 64 64 48
Memory Bus Width 384-bit 256-bit 256-bit 192-bit
L2 Cache 6MB 4MB 4MB 1.5MB
Register File (Total) 18MB 12MB 9MB 6MB
Architecture Turing Turing Turing Turing
Manufacturing Process TSMC 12nm "FFN" TSMC 12nm "FFN" TSMC 12nm "FFN" TSMC 12nm "FFN"
Die Size 754mm2 545mm2 445mm2 284mm2

But getting back to architecture, this launch is one of a handful of times we’ve seen NVIDIA use dissimilar GPUs in their consumer cards, and it’s a situation without a good parallel. NVIDIA had done plenty of non-homogenous families in the past, but typically the black sheep of the family is the high-end server GPU, e.g. GP100, where it gets additional features not found in the consumer lineup. Instead the Turing family ends up having a split right down the middle.

The good news for consumers is that, outside of RTX functionality, TU116 and its ilk – which for the sake of simplicity I’m going to call Turing Minor from here on out – is functionally equivalent to TU102/TU104/TU106 (Turing Major). Turing Minor has the exact same DirectX feature set, the exact same core compute architecture (right on down to cache sizes), the exact same video and display blocks, etc. The RT and tensor cores really are the only thing that’s changed.

The situation looks much the same for programmers & developers as well: on the current press drivers the GTX 1660 TI reports itself as a Compute Capability 7.5 card – the same CC version as all of the Turing Major cards – so developers won’t have to even compile separate code for Turing Minor cards. So long as their code can handle a lack of tensor cores, at least.

(As a brief aside, as a performance exercise we ran the tensor version of our HGEMM benchmark on the GTX 1660 Ti. And it completed?! Performance was a bit lower, at 10.8 TFLOPS versus 11 TFLOPS with tensors disabled, but it did complete. Which indicates that either NVIDIA has been less than forthcoming on TU116, or in order to keep all Turing parts on CC 7.5, they are sending tensor ops through the CUDA cores on Turing Minor cards)

Looking at the TU116 SM, what we find is something almost identical to the SM diagrams used for Turing Major, with the SM arranged into 4 partitions, each with their own warp schedule and set of CUDA cores, while all 4 partitions share the L1 cache and texture units. Cache sizes and register file sizes are all unchanged here, so average throughput and register pressure are similarly unchanged as well. The one standout is that in replacing the tensor cores in their diagram, NVIDIA has opted to draw in FP16 cores, which is a bit of a stretch given what we know about the Turing architecture. NVIDIA only sent out this diagram yesterday, so I’m still checking with them to see if this is the company taking a creative liberty to highlight Turing’s other functionality, or if there’s more to it that NVIDIA is downplaying to keep things simple (ala Kepler and GK104).

Update: NVIDIA has gotten back to me this morning. As it turns out, the FP16 cores in the diagram are quite literal. For more information, please see below.

The Curious Case of FP16: Tensor Cores vs. Dedicated Cores

Even though Turing-based video cards have been out for over 5 months now, every now and then I’m still learning something new about the architecture. And today is one of those days.

Something that escaped my attention with the original TU102 GPU and the RTX 2080 Ti was that for Turing, NVIDIA changed how standard FP16 operations were handled. Rather than processing it through their FP32 CUDA cores, as was the case for GP100 Pascal and GV100 Volta, NVIDIA instead started routing FP16 operations through their tensor cores.

The tensor cores are of course FP16 specialists, and while sending standard (non-tensor) FP16 operations through them is major overkill, it’s certainly a valid route to take with the architecture. In the case of the Turing architecture, this route offers a very specific perk: it means that NVIDIA can dual-issue FP16 operations with either FP32 operations or INT32 operations, essentially giving the warp scheduler a third option for keeping the SM partition busy. Note that this doesn’t really do anything extra for FP16 performance – it’s still 2x FP32 performance – but it gives NVIDIA some additional flexibility.

Of course, as we just discussed, the Turing Minor does away with the tensor cores in order to allow for a learner GPU. So what happens to FP16 operations? As it turns out, NVIDIA has introduced dedicated FP16 cores!

These FP16 cores are brand new to Turing Minor, and have not appeared in any past NVIDIA GPU architecture. Their purpose is functionally the same as running FP16 operations through the tensor cores on Turing Major: to allow NVIDIA to dual-issue FP16 operations alongside FP32 or INT32 operations within each SM partition. And because they are just FP16 cores, they are quite small. NVIDIA isn’t giving specifics, but going by throughput alone they should be a fraction of the size of the tensor cores they replace.

To users and developers this shouldn’t make a difference – CUDA and other APIs abstract this and FP16 operations are simply executed wherever the GPU architecture intends for them to go – so this is all very transparent. But it’s a neat insight into how NVIDiA has optimized Turing Minor for die size while retaining the basic execution flow of the architecture.

Now the bigger question in my mind: why is it so important to NVIDIA to be able to dual-issue FP32 and FP16 operations, such that they’re willing to dedicate die space to fixed FP16 cores? Are they expecting these operations to be frequently used together within a thread? Or is it just a matter of execution ports and routing? But that is a question we’ll have to save for another day.

Turing Minor: Turing Sans RTX

For better or worse, the launch of GTX 1660 Ti and Turing Minor means that NVIDIA has needed to adjust how they go about promoting the new cards and the Turing architecture. While Turing launched with a laundry list of features, most of which had nothing to do with RTX, the broader consumer zeitgeist definitely focused on RTX and for good reason: compared to all of the low-level architectural changes under the hood, ray tracing, DLSS, and other RTX features are a lot more visible, and for NVIDIA they were easier to promote. This means that for Turing Minor NVIDIA instead has to focus on the low-level architectural improvements in Turing, which I think is great since these were largely overlooked at the Turing launch.

While I won’t recap our entire Turing deep dive here, relative to Pascal The big difference here is the numerous steps NVIDIA has taken to improve their IPC and overall efficiency. For example, Turing made the surprising move to ditch regular forms of Instruction Level Parallelism (ILP) by dropping the second warp scheduler dispatch port. Instead, each warp scheduler fires off a single set of instructions on each clock, taking advantage of the fact that it takes 2 (or more) clocks to issue a full warp in order to interleave a second instruction in.

This ILP change goes hand-in-hand with partitioning the SM into 4 blocks instead of 2, which serves to help better control resource contention among the warps and CUDA cores. In fact at a high level, a Turing SM looks a lot more like some of NVIDIA’s server-focused GPUs than their consumer-focused GPUs; there’s a lot more plumbing here in various forms to support the CUDA cores and to help them achieve better performance, rather than just throwing more CUDA cores at the problem. The net result is that while we don’t have metrics from NVIDIA, I fully expect that the ratio of supporting hardware and glue logic to CUDA cores is significantly higher on Turing than it was GP106 Pascal. Though by the same token, I expect the SMs as a whole are larger than Pascal’s as well, which is certainly reflected in the die size.

A big part of this change, in turn, is the fact that NVIDIA broke out their Integer cores into their own block. Previously a separate branch of the FP32 CUDA cores, the INT32 cores can now be addressed separately from the FP32 cores, which combined with instruction interleaving allows NVIDIA to keep both occupied at the same time. Now make no mistake: floating point math is still the heart and soul of shading and GPU compute, however integer performance has been slowly increasing in importance over time as well, especially as shaders get more complex and there’s increased usage of address generation and other INT32 functions. This change is a big part of the IPC gains NVIDIA is claiming for Turing architecture.

Speaking of CUDA cores, like all other Turing parts, TU116 and Turing Minor get NVIDIA’s fast FP16 functionality. This means that these GPUs can process FP16 operations at twice the rate of FP32 operations – via the GPU’s dedicated FP16 cores – which for GTX 1660 Ti works out to 11 TFLOPS of performance. Using FP16 shaders in PC games is still relatively new – the baseline 8th gen consoles don’t support it and NVIDIA previously limited this feature to server parts – but it’s more widely used in mobile games where FP16 support is common. There, as it will be in the PC space, FP16 shaders allow for developers to trade off between performance and shader precision by using a lower precision format; not all shader programs require a full FP32’s worth of precision, and when done right it can improve performance and reduce memory bandwidth needs without any real image quality impact.

Meanwhile, looking at the rest of the GPU, the memory and cache system is a bit of a grab bag. On the one hand, Turing implements NVIDIA’s latest lossless memory compression technology. This has proven to be one of NVIDIA’s bigger advantages over AMD, and continues to allow them to get away with less memory bandwidth than we’d otherwise expect some of their GPUs to need. The actual savings vary from game to game, but for the GTX 2080 Ti launch, NVIDIA reported that they were seeing reductions in traffic between 18% and 33%


From the RTX 2080 Ti Launch

However, distinct to TU116 versus its Turing Major siblings, the latest GPU has a less L2 cache per ROP partition. Turing Major GPUs all have 512KB of L2 cache per partition, giving TU106 a total of 4MB of L2, for example. TU116 on the other hand has just 256KB of L2 per partition for a total of 1.5MB of L2, which happens to be the same amount of cache and cache ratios as on GP106. The performance impact of this is hard to measure given all of the other changes in the GPU, but clearly NVIDIA had traded off some die size at the cost of some increases in cache misses. The wildcard in all of this being how much the additional bandwidth of GDDR6 helps to offset those misses.

Finally on the graphics front, Turing Minor also retains Turing’s adaptive shading capabilities. Not unlike RTX, this is a new feature that is going to take some time to get adopted, so we’ve only seen a handful of games (such as Wolfenstein II) implement it thus far. But by reducing the pixel shader granularity/rate used at various points in a scene, the technology makes it possible to improve performance by reducing the overall shading workload.

The trick with adaptive shading – and why it’s a feature rather than an immediate and transparent means of improving performance – is that it’s making a very direct quality/speed tradeoff with pixel shaders; the reduced shading rate can reduce the overall image quality by reducing clarity and creating aliasing artifacts. So developers are still in their infancy playing with the technology to figure out where they can use it without noticeably hurting image quality. In practice I expect we’re going to see it more widely deployed in VR games at first, as the tech is much easier to use there (reduce the rate anywhere the user isn’t looking), as opposed to traditional games.

The end result of all of this is that while Turing Minor has some very important feature differences from Turing Major, at the end of the day it’s still Turing. NVIDIA for their part is going to have to grapple with the fact that not all of their current-generation cards feature RTX functionality, but that’s going to be marketing’s problem. As for consumers, unless you’re specifically seeking out NVIDIA’s ray tracing and tensor core functionality, GTX 1660 Ti is just another Turing.

The NVIDIA GeForce GTX 1660 Ti Review: Featuring EVGA Meet the EVGA GeForce GTX 1660 Ti XC Black
Comments Locked

157 Comments

View All Comments

  • GreenReaper - Friday, February 22, 2019 - link

    Your point is a lie, though, as you clearly didn't buy it on his recommendation. How can we believe anything you say after that?
  • Questor - Wednesday, March 6, 2019 - link

    Not criticizing, simply adding:
    Several times in the past, honest review sites did comparisons of electrical costs in several places around the States and a few other countries with regard to brand A video card at a lower power draw than brand B video card. The idea was to calculate a reasonable overall cost for the extra power draw and if it was worth worrying about/worth specifically buying the lower draw card. In each case it was negligible in terms of addition power use by dollar (or whatever currency). A lot of these great sites have died out or been bought out and are gone now. It a darned shame. We used to actually real useful information about products and what all these values actually mean to the user/customer/consumer. We used to see the same for power supplies too. I haven't seen anything like that in years now. Too bad. It proved how little a lot of the numbers mattered in real life to real bill paying consumers.
  • Icehawk - Friday, February 22, 2019 - link

    Man this sucks, clearly this card isn't enough for 4k and I'm not willing to spend on a RTX 2070. Can I hope for a GTX 1170 at like $399? 8gb of RAM please. I'm not buying a new card until it's $400 or less and has 8gb+, my 970 runs 1440p maxed or close to it in almost all AAA games and even 4k in some (like Overwatch) so I'm not going for a small improvement - after 2 gens I should be looking at close to double the performance but it sure doesn't look like that's happening currently.
  • eva02langley - Friday, February 22, 2019 - link

    Navi is your only hope.
  • CiccioB - Friday, February 22, 2019 - link

    And I think he will be even more disappointed if he's looking for a 4K card that is able to play with <b>modern</b> games.

    BTW: No 1170 will be made. This card is the top Turing without RT+TC and so it's the best performance you can get at lowest the price. Other Turing with no RT+TC will be slower (though probably cheaper, but you are not looking for just a cheap card, you are looking for a x2 the performace of your actual one).
  • catavalon21 - Sunday, February 24, 2019 - link

    I am curious, what are you basing "no 1170" on?
  • CiccioB - Monday, February 25, 2019 - link

    Huh, let's see...
    designing a new chip costs a lot of money, especially when it is not that tiny.
    A chip bigger than this TU116 will be just faster than the 2060, which has a 445mm^2 die size which has to be sold with some margins (unlike AMD that sells Vega GPU+HBM at the price of bread slices and at the end of the quarter reports gains in the amount of the fractions number of nvidia, but that's good for AMD fans, it is good that the company looses money to make them happy with oversized and HW that performs like mainstream competition one).
    So creating a 1170 simply means killing the 2060 (and probably 2070), just defeating the original purpose of these cards as first lower HW (possible mainstream) capable of RT.

    Unless you are supposing nvidia is going to scrap completely their idea that RT is the future and it's support will be expanded in future generations, there's no a valid, rationale reason for them to create a new GPU that will replace the cut version of TU106.

    All this without considering that AMD is probably not going to compete on 7nm as with that PP they will probably manage to reach Pascal performance while at 7nm nvidia is going to blow any AMD solution away under the point of view of absolute performance, performance per W and performance per mm^2 (despite the addition of the new computational units that will find more and more usage in the future.. none still has thought of using tensor core for advanced AI, for example).

    So, no, there will be no a 1170 unless it will be a further cut of TU106 that at the end will perform just like TU116 but will be just a mere recycle of broken silicon.

    Now, let me hear what makes you believe that a 1170 will be created.
  • catavalon21 - Tuesday, February 26, 2019 - link

    I do not know if they will create an 1170 or not; to be fair, I am surprised they even created the 1160. You have a very good point, upon reflection, it is quite likely such a product would impact RTX sales. I was just curious what had you thinking that way.

    Thank you for the response.
  • Oxford Guy - Saturday, February 23, 2019 - link

    Our only hope is capitalism.

    That's not going to happen, though.

    Instead, we get duopoly/quasi-monopoly.
  • douglashowitzer - Friday, February 22, 2019 - link

    Hey not sure if you're opposed to used GPUs... but you can get a used, overclocked, 3rd party GTX 1080 with 8GB vram on eBay for about $365-$400. In my opinion it's an amazing deal and I can tell you from experience that it would satisfy the performance jump that you're looking for. It's actually the exact situation I was in back in June of 2016 when I upgraded my 970 to a 1080. Being a proper geek, I maintained a spreadsheet of my benchmark performance improvements and the LOWEST improvement was an 80% gain. The highest was a 122% gain in Rise of the Tomb Raider (likely VRAM related but impressive nonetheless). Honestly I don't believe I've ever experienced a performance improvement that felt so "game changing" as when I went from my 970 to the 1080. Maybe waaay back when I upgraded my AMD 6950 to a GTX 670 :). If "used" doesn't turn you off, the upgrade of your dreams is waiting for you. Good luck to you!

Log in

Don't have an account? Sign up now