TU116: When Turing Is Turing… And When It Isn’t

Diving a bit deeper into matters, we have the new TU116 GPU at the heart of the GTX 1660 Ti. While NVIDIA does not announce future products in advance, you can expect that this will be the first of at least a couple of GPUs in what’s now the TU11x family, as NVIDIA is going to want to follow the same streamlined strategy for the eventual successors to GP107 and possible GP108.

TU116 is an interesting piece of kit, both because of the decisions that lead to this point and because of both the drawbacks and advantages of excising some of Turing’s functionality. As mentioned earlier in this article, NVIDIA has made a very deliberate decision to cut out their RTX functionality – the ray tracing cores and tensor cores – in order to produce a GPU that’s better suited for traditional rendering. The end result is a smaller, cheaper to produce GPU. But it also means that NVIDIA has to change how they go about promoting cards based on this GPU.

With a die size of 284mm2, TU116 tells a story in and of itself. This makes it 40% smaller than the next-smallest Turing GPU, TU106. Similarly, the transistor count has come down from 10.8 billion to 6.6 billion. This greatly improves the manufacturability of the GPU and drives down its costs, especially since NVIDIA will be going into more competitive markets with it than the other TU10x GPUs. Still, TU116 is some 42% bigger than the 200mm2 GP106 die that it replaces, so even though it’s more efficient, NVIDIA is still dealing with a significant increase in die size on a generation-by-generation basis.

Unfortunately, TU116 doesn’t give us a terribly good baseline for determining how much of a TU10x SM was composed of RTX hardware. TU116 doesn’t just drop the RTX hardware in its SMs, but it’s a smaller design overall; fewer SMs, fewer memory channels, and fewer ROPs. So we can’t fully separate the savings of dropping RTX from the savings of making a lighter GPU in general. However it’s interesting to note that on a relative basis, the transistor count difference between TU116 and TU106 is almost exactly the same as GP106 and GP104: there are 39% fewer transistors when stepping down. So later on it will give us an opportunity to look at performance and see if the performance gap between the GTX 1660 Ti and RTX 2070 – the full-fat cards of their respective GPUs – is anything like the sizable gap between the GTX 1060 6GB and the GTX 1080.

NVIDIA Turing GPU Comparison
  TU102 TU104 TU106 TU116
CUDA Cores 4608 3072 2304 1536
SMs 72 48 36 24
Texture Units 288 192 144 96
RT Cores 72 48 36 N/A
Tensor Cores 576 384 288 N/A
ROPs 96 64 64 48
Memory Bus Width 384-bit 256-bit 256-bit 192-bit
L2 Cache 6MB 4MB 4MB 1.5MB
Register File (Total) 18MB 12MB 9MB 6MB
Architecture Turing Turing Turing Turing
Manufacturing Process TSMC 12nm "FFN" TSMC 12nm "FFN" TSMC 12nm "FFN" TSMC 12nm "FFN"
Die Size 754mm2 545mm2 445mm2 284mm2

But getting back to architecture, this launch is one of a handful of times we’ve seen NVIDIA use dissimilar GPUs in their consumer cards, and it’s a situation without a good parallel. NVIDIA had done plenty of non-homogenous families in the past, but typically the black sheep of the family is the high-end server GPU, e.g. GP100, where it gets additional features not found in the consumer lineup. Instead the Turing family ends up having a split right down the middle.

The good news for consumers is that, outside of RTX functionality, TU116 and its ilk – which for the sake of simplicity I’m going to call Turing Minor from here on out – is functionally equivalent to TU102/TU104/TU106 (Turing Major). Turing Minor has the exact same DirectX feature set, the exact same core compute architecture (right on down to cache sizes), the exact same video and display blocks, etc. The RT and tensor cores really are the only thing that’s changed.

The situation looks much the same for programmers & developers as well: on the current press drivers the GTX 1660 TI reports itself as a Compute Capability 7.5 card – the same CC version as all of the Turing Major cards – so developers won’t have to even compile separate code for Turing Minor cards. So long as their code can handle a lack of tensor cores, at least.

(As a brief aside, as a performance exercise we ran the tensor version of our HGEMM benchmark on the GTX 1660 Ti. And it completed?! Performance was a bit lower, at 10.8 TFLOPS versus 11 TFLOPS with tensors disabled, but it did complete. Which indicates that either NVIDIA has been less than forthcoming on TU116, or in order to keep all Turing parts on CC 7.5, they are sending tensor ops through the CUDA cores on Turing Minor cards)

Looking at the TU116 SM, what we find is something almost identical to the SM diagrams used for Turing Major, with the SM arranged into 4 partitions, each with their own warp schedule and set of CUDA cores, while all 4 partitions share the L1 cache and texture units. Cache sizes and register file sizes are all unchanged here, so average throughput and register pressure are similarly unchanged as well. The one standout is that in replacing the tensor cores in their diagram, NVIDIA has opted to draw in FP16 cores, which is a bit of a stretch given what we know about the Turing architecture. NVIDIA only sent out this diagram yesterday, so I’m still checking with them to see if this is the company taking a creative liberty to highlight Turing’s other functionality, or if there’s more to it that NVIDIA is downplaying to keep things simple (ala Kepler and GK104).

Update: NVIDIA has gotten back to me this morning. As it turns out, the FP16 cores in the diagram are quite literal. For more information, please see below.

The Curious Case of FP16: Tensor Cores vs. Dedicated Cores

Even though Turing-based video cards have been out for over 5 months now, every now and then I’m still learning something new about the architecture. And today is one of those days.

Something that escaped my attention with the original TU102 GPU and the RTX 2080 Ti was that for Turing, NVIDIA changed how standard FP16 operations were handled. Rather than processing it through their FP32 CUDA cores, as was the case for GP100 Pascal and GV100 Volta, NVIDIA instead started routing FP16 operations through their tensor cores.

The tensor cores are of course FP16 specialists, and while sending standard (non-tensor) FP16 operations through them is major overkill, it’s certainly a valid route to take with the architecture. In the case of the Turing architecture, this route offers a very specific perk: it means that NVIDIA can dual-issue FP16 operations with either FP32 operations or INT32 operations, essentially giving the warp scheduler a third option for keeping the SM partition busy. Note that this doesn’t really do anything extra for FP16 performance – it’s still 2x FP32 performance – but it gives NVIDIA some additional flexibility.

Of course, as we just discussed, the Turing Minor does away with the tensor cores in order to allow for a learner GPU. So what happens to FP16 operations? As it turns out, NVIDIA has introduced dedicated FP16 cores!

These FP16 cores are brand new to Turing Minor, and have not appeared in any past NVIDIA GPU architecture. Their purpose is functionally the same as running FP16 operations through the tensor cores on Turing Major: to allow NVIDIA to dual-issue FP16 operations alongside FP32 or INT32 operations within each SM partition. And because they are just FP16 cores, they are quite small. NVIDIA isn’t giving specifics, but going by throughput alone they should be a fraction of the size of the tensor cores they replace.

To users and developers this shouldn’t make a difference – CUDA and other APIs abstract this and FP16 operations are simply executed wherever the GPU architecture intends for them to go – so this is all very transparent. But it’s a neat insight into how NVIDiA has optimized Turing Minor for die size while retaining the basic execution flow of the architecture.

Now the bigger question in my mind: why is it so important to NVIDIA to be able to dual-issue FP32 and FP16 operations, such that they’re willing to dedicate die space to fixed FP16 cores? Are they expecting these operations to be frequently used together within a thread? Or is it just a matter of execution ports and routing? But that is a question we’ll have to save for another day.

Turing Minor: Turing Sans RTX

For better or worse, the launch of GTX 1660 Ti and Turing Minor means that NVIDIA has needed to adjust how they go about promoting the new cards and the Turing architecture. While Turing launched with a laundry list of features, most of which had nothing to do with RTX, the broader consumer zeitgeist definitely focused on RTX and for good reason: compared to all of the low-level architectural changes under the hood, ray tracing, DLSS, and other RTX features are a lot more visible, and for NVIDIA they were easier to promote. This means that for Turing Minor NVIDIA instead has to focus on the low-level architectural improvements in Turing, which I think is great since these were largely overlooked at the Turing launch.

While I won’t recap our entire Turing deep dive here, relative to Pascal The big difference here is the numerous steps NVIDIA has taken to improve their IPC and overall efficiency. For example, Turing made the surprising move to ditch regular forms of Instruction Level Parallelism (ILP) by dropping the second warp scheduler dispatch port. Instead, each warp scheduler fires off a single set of instructions on each clock, taking advantage of the fact that it takes 2 (or more) clocks to issue a full warp in order to interleave a second instruction in.

This ILP change goes hand-in-hand with partitioning the SM into 4 blocks instead of 2, which serves to help better control resource contention among the warps and CUDA cores. In fact at a high level, a Turing SM looks a lot more like some of NVIDIA’s server-focused GPUs than their consumer-focused GPUs; there’s a lot more plumbing here in various forms to support the CUDA cores and to help them achieve better performance, rather than just throwing more CUDA cores at the problem. The net result is that while we don’t have metrics from NVIDIA, I fully expect that the ratio of supporting hardware and glue logic to CUDA cores is significantly higher on Turing than it was GP106 Pascal. Though by the same token, I expect the SMs as a whole are larger than Pascal’s as well, which is certainly reflected in the die size.

A big part of this change, in turn, is the fact that NVIDIA broke out their Integer cores into their own block. Previously a separate branch of the FP32 CUDA cores, the INT32 cores can now be addressed separately from the FP32 cores, which combined with instruction interleaving allows NVIDIA to keep both occupied at the same time. Now make no mistake: floating point math is still the heart and soul of shading and GPU compute, however integer performance has been slowly increasing in importance over time as well, especially as shaders get more complex and there’s increased usage of address generation and other INT32 functions. This change is a big part of the IPC gains NVIDIA is claiming for Turing architecture.

Speaking of CUDA cores, like all other Turing parts, TU116 and Turing Minor get NVIDIA’s fast FP16 functionality. This means that these GPUs can process FP16 operations at twice the rate of FP32 operations – via the GPU’s dedicated FP16 cores – which for GTX 1660 Ti works out to 11 TFLOPS of performance. Using FP16 shaders in PC games is still relatively new – the baseline 8th gen consoles don’t support it and NVIDIA previously limited this feature to server parts – but it’s more widely used in mobile games where FP16 support is common. There, as it will be in the PC space, FP16 shaders allow for developers to trade off between performance and shader precision by using a lower precision format; not all shader programs require a full FP32’s worth of precision, and when done right it can improve performance and reduce memory bandwidth needs without any real image quality impact.

Meanwhile, looking at the rest of the GPU, the memory and cache system is a bit of a grab bag. On the one hand, Turing implements NVIDIA’s latest lossless memory compression technology. This has proven to be one of NVIDIA’s bigger advantages over AMD, and continues to allow them to get away with less memory bandwidth than we’d otherwise expect some of their GPUs to need. The actual savings vary from game to game, but for the GTX 2080 Ti launch, NVIDIA reported that they were seeing reductions in traffic between 18% and 33%


From the RTX 2080 Ti Launch

However, distinct to TU116 versus its Turing Major siblings, the latest GPU has a less L2 cache per ROP partition. Turing Major GPUs all have 512KB of L2 cache per partition, giving TU106 a total of 4MB of L2, for example. TU116 on the other hand has just 256KB of L2 per partition for a total of 1.5MB of L2, which happens to be the same amount of cache and cache ratios as on GP106. The performance impact of this is hard to measure given all of the other changes in the GPU, but clearly NVIDIA had traded off some die size at the cost of some increases in cache misses. The wildcard in all of this being how much the additional bandwidth of GDDR6 helps to offset those misses.

Finally on the graphics front, Turing Minor also retains Turing’s adaptive shading capabilities. Not unlike RTX, this is a new feature that is going to take some time to get adopted, so we’ve only seen a handful of games (such as Wolfenstein II) implement it thus far. But by reducing the pixel shader granularity/rate used at various points in a scene, the technology makes it possible to improve performance by reducing the overall shading workload.

The trick with adaptive shading – and why it’s a feature rather than an immediate and transparent means of improving performance – is that it’s making a very direct quality/speed tradeoff with pixel shaders; the reduced shading rate can reduce the overall image quality by reducing clarity and creating aliasing artifacts. So developers are still in their infancy playing with the technology to figure out where they can use it without noticeably hurting image quality. In practice I expect we’re going to see it more widely deployed in VR games at first, as the tech is much easier to use there (reduce the rate anywhere the user isn’t looking), as opposed to traditional games.

The end result of all of this is that while Turing Minor has some very important feature differences from Turing Major, at the end of the day it’s still Turing. NVIDIA for their part is going to have to grapple with the fact that not all of their current-generation cards feature RTX functionality, but that’s going to be marketing’s problem. As for consumers, unless you’re specifically seeking out NVIDIA’s ray tracing and tensor core functionality, GTX 1660 Ti is just another Turing.

The NVIDIA GeForce GTX 1660 Ti Review: Featuring EVGA Meet the EVGA GeForce GTX 1660 Ti XC Black
POST A COMMENT

157 Comments

View All Comments

  • CiccioB - Tuesday, March 5, 2019 - link

    Kid, as I said you lack basic intelligence to recognize when you are just arguing about nothing.
    The number I'm using are those published by AMD and nvidia in their quarter results.
    Now, if you are asking me the links for those reports it means you don't have the minimum idea of what I'm talking about AND you cannot do a simple search with Google.
    So I stand my "insults": you have not the intelligence to argue about this simple topic, so stop writing completely on this site that has much better readers than you and is not gaining anything by your presence.
    Reply
  • Korguz - Tuesday, March 5, 2019 - link

    ahh here are the insults and name calling... and you are calling me a kid ??

    i can, and have done a simple google search.. BUT, i would like to see the SAME info YOU are looking at as well, but again.. i guess that is just too much to ask of you, is it wrong to want to be able to compare the same facts as you are looking at ? i guess so.. cause you STILL refuse to post where you get your facts and info from, i sure dont have the time so spend who knows how long to do a simple google search...

    by standing by our insults, just shows YOU are the child here.. NOT me, as only CHILDREN resort to insults and name calling...

    as i said in my reply farther down :
    you refuse post links, OR mention your sources, simply because YOU DONT HAVE ANY.. most of what you post.. is probably made up, or rumor, if AT posted things like you do, with no sources, you probably would be all over them asking for links, proof and the like... and by YOUR previous posts, all of your info is made up and false..

    maybe YOU should stop posting your info and facts from rumor sites, and learn to talk with some intelligence your self.
    Reply
  • Qasar - Tuesday, March 5, 2019 - link

    sorry CiccioB, but i agree with Korguz.. i have tried to find the " facts " on some of the things you have posted in this thread.. and i cant find them.. i also would like to know where you get your " opinions " from. Reply
  • CiccioB - Wednesday, March 6, 2019 - link

    Quarter results!
    It's not really difficult to find them..
    Try "AMD quarter results" and then "nvidia quarter results" in Google search engine and.. voilà, les jeux sont faits. Two clicks and you can read them. Back some years if you want,so you can have a history of what has happened during the last years apart the useless comments by fanboys you find on forums.
    Now, if you have further problems at understanding all those tables and numbers or you do not know what is a gross margin vs a net income, then, you can't come here and argue I have no facts. It's you that can't understand publicly available data.

    So if you want already chewed numbers that someone has interpreted for you, you can read them here:
    https://www.anandtech.com/show/13917/amd-earnings-...

    I wonder what you were looking for for not finding those numbers that have been commented by every site that is about technology.

    @Korguz
    You a re definitely a kid. You surely do not scare me with all those nonsense you write when the solution for YOUR problem (not mine) was simply to read more and write less.
    Reply
  • Korguz - Thursday, March 7, 2019 - link

    CiccioB you are hilarious !!!
    did you look up the word hypnotize, to see what it means, and how it even relates to this ? as, and i quote YOU " Or you may try to hypnotize the third view " what does that even mean ??

    i KNEW the ONLY link you would mention.. is the EASY one to find.

    BUT... what about all of these LIES :

    " GCN was dead at its launch time"
    " 9 years of discounted sell are not enough to show you that GCN simply started as half a generation old architecture to end as a generation obsolete one? "
    " that is Hawaii, which was so discounted "
    " starting with Fiji and it's monster BOM cost "
    " AMD is selling Ryzen CPU at a discount like GPUs and both have a 0.2% net margin "
    " One is that AMD is discounting every product (GPU and CPU) to a ridiculous margin "
    " a "panic plan" that required about 3 years to create the chips. 3 years ago they already know that they would have panicked at the RTX cards launch and so they made the RT-less chip as well "

    i did the simple google search for the above comments from you, as well as variations.. and guess what.. NOTHING comes up. THESE are the links i would like you to provide, as i cant find any of these LIES . so " It's you that can't understand publicly available data. " the above quotes, are not publicly available data, even your sacred " simple google search " cant find them.

    lastly.. your insults and name calling ( and the fact that you stand by them ).. the only people i hear things like this from.. are from my friends and coworkers TEENAGE CHILDREN. NOT adults.. adults don't resort to things like this, at least the ones that i know... this alone.. shows how immature and childish you really are... i am pretty sure.. you WILL never post links to 98% of the LIES, RUMORS, or your personal speculation, and opinions, because of the simple fact, you just CAN'T, as your sources for all this... simply doesn't exist.

    when you are able to reply to people with out having to resort to name calling and insults, then maybe you might be taken seriously. till then... you are nothing but a lying, immature child, who needs to grow up, and learn how to talk to other people in a respectful manner... maybe YOU should take your OWN advice, and simply read more and write less. Have a good day.
    Reply
  • D. Lister - Saturday, February 23, 2019 - link

    @CiccioB

    Navi will still be GCN unfortunately.
    Reply
  • CiccioB - Monday, February 25, 2019 - link

    If so don't cry if price will remain high (if not higher) for the next 3 years.
    We already know why.
    Reply
  • Simplex - Friday, February 22, 2019 - link

    "EVGA Precision remains some of the best overclocking software on the market."

    Better than MSI Afterburner?
    Reply
  • Ryan Smith - Friday, February 22, 2019 - link

    I consider both of them to be in the same tier, for what it's worth. Reply
  • Rudde - Friday, February 22, 2019 - link

    In what way does 12nm FFN improve over 16nm? The transistor density is roughly the same, the frequencies see little to no improvement and the power-efficiency has only seen small improvements. Worse yet, the space used per SM has gotten worse. I do know that Turing brings architectural improvements, but are they at the cost of die space? Seems odd that Nvidia wouldn't care about die area when their flagships are huge chips that would benefit from a more dense architecture.

    Or could it be that Turing adds some kind of (sparse) logic that they haven't mentioned?
    Reply

Log in

Don't have an account? Sign up now