The NVIDIA Turing GPU Architecture Deep Dive: Prelude to GeForce RTX

Name: The NVIDIA Turing GPU Architecture Deep Dive: Prelude to GeForce RTX
Item: The NVIDIA Turing GPU Architecture Deep Dive: Prelude to GeForce RTX
Author: Nate Oh

by Nate Oh on September 14, 2018 12:30 PM EST

111 Comments | Add A Comment

111 Comments

Feeding the Beast (2018): GDDR6 & Memory Compression

Memory bandwidth has always been a challenge for video cards, and that challenge only continues to get harder. Thanks to the mechanics of Moore’s Law, GPU transistor counts – and therefore the quantities of various cores – is growing at a rapid pace. Meanwhile DRAM, whose bandwidth is not subject to the same laws, has grown at a much smaller pace.

The net result is that with nearly every generation, the amount of memory bandwidth available per FLOP, per texture lookup, and per pixel blend has continued to drop. So to keep GPU performance scaling – to feed the great graphical beast – GPU manufacturers and the memory industry as a whole have to look for new ways to boost memory bandwidth for future memory technologies while reducing the amount of memory bandwidth they’re using right now. Neither is easy, and both are areas where NVIDIA has been executing well on over most of the past decade, making it an architectural strength for the company.

NVIDIA Memory Bandwidth per FLOP (In Bits)
GPU	Bandwidth/FLOP	Total CUDA FLOPs	Total Bandwidth
RTX 2080	0.36 bits	10.06 TFLOPs	448GB/sec
GTX 1080	0.29 bits	8.87 TFLOPs	320GB/sec
GTX 980	0.36 bits	4.98 TFLOPs	224GB/sec
GTX 680	0.47 bits	3.25 TFLOPs	192GB/sec
GTX 580	0.97 bits	1.58 TFLOPs	192GB/sec

Turing, in turn, is a bit of an interesting swerve in this pattern thanks to its heavy focus on ray tracing and neural network inferencing. If we're looking at memory bandwidth merely per CUDA core FLOP, then bandwidth per FLOP has actually gone up, since RTX 2080 doesn't deliver a significant increase in (on-paper) CUDA core throughput relative to GTX 1080. However RTX 2080 also now has to feed ray tracing cores and tensor cores, both of which are very bandwidth hungry on their own. So while in pure, FP32 core compute scenarios the situation has improved a bit, once the entire GPU is put to work, the amount of contention for memory bandwidth is still higher than ever.

In terms of memory technologies, the 16nm/14nm generation of GPUs saw an interesting and atypical divergence in the memory space. GDDR5, which has been with us for a decade now, has been ripe for replacement. The JEDEC, the industry standardization body responsible for setting common memory standards, initially approached this in two directions. The first route was more traditional, developing a successor technology to GDDR5, which became GDDR5X. Meanwhile the more radical approach was ultra-wide I/O technologies, which became the High Bandwidth Memory (HBM) standards.

NVIDIA for their part embraced both, but at different levels. HBM is very powerful, but the realities of wide I/O make it harder to manufacture and more costly to put on a product, thanks to the need for a silicon interposer. As a result, HBM (and specifically, HBM2) has only ever been used on NVIDIA’s flagship compute GPUs, the GP100 and GV100. For everything else, NVIDIA turned to GDDR5X. And this is where things get a bit odd.

GDDR5X is a JEDEC standard like GDDR5 before it, but it simply never saw the same kind of adoption that GDDR5 did. This goes both for memory vendors and GPU vendors. Only Micron ever produced the memory, and only NVIDIA ever used it. So the fastest Pascal cards – GTX 1080, GTX 1080 Ti, & Titan Xp – are outliers in that they’re the only (consumer) cards using this memory technology. GDDR5X was an important piece of the Pascal puzzle as it allowed NVIDIA to better feed their fastest cards, but in time it has essentially became a dead-end branch of GDDR memory technology as it never saw the kind of adoption required to reach critical mass.

So where did GDDR branch to instead? This brings us to GDDR6, the latest and greatest in GDDR memory technology. And unlike GDDR5X before it, GDDR6 has the full backing of the Big 3 memory manufacturers – Samsung, SK Hynix, and Micron – so the memory industry as a whole has a much larger stake in the technology. For NVIDIA’s products, this is evident right off the bat: NVIDIA is using Samsung’s 16Gb capacity GDDR6 modules for their Quadro cards, and meanwhile they’re tapping Micron’s 8Gb modules for the new GeForce RTX cards.

The performance impact of GDDR6, in turn, depends in part on what we’re comparing it to. Relative to GDDR5X, GDDR6 is not quite as big of a step up as some past memory generations, as many of GDDR6’s innovations were already baked into GDDR5X. GDDR5 officially topped out at 8Gbps per pin (with NV working with partners to do 9Gbps overclocked SKUs), while NVIDIA shipped GDDR5X cards clocked as high as 11.4Gbps. GDDR6, in turn, is going to be starting at 14Gbps in graphics cards, with future generations of the technology set to reach 16Gbps and higher. So for the likes of NVIDIA’s x70 cards, the switch from GDDR5 to GDDR6 is going to be one of those massive once-in-a-generation bandwidth jumps. However for NVIDIA’s x80 cards, the upgrade from GDDR5X to GDDR6 is going to give those products a healthy increase in memory bandwidth, it just won’t be a huge jump.

Diving a bit deeper here, there are really two core changes coming from GDDR5 that enable GDDR6’s big bandwidth boost. The first is the implementation of Quad Data Rate (QDR) signaling on the memory bus. Whereas GDDR5’s memory bus would transfer data twice per write clock (WCK) via DDR, GDDR6 (& 5X) extends this to four transfers per clock. All other things held equal, this allows GDDR6 to transfer twice as much data per clock as GDDR5.

The challenge in doing this, of course, is that the more you pump a memory bus, the tighter the signal integrity requirements. So while it’s simple to say “let’s just double the memory bus bandwidth”, doing it is another matter. In practice a lot of work goes into the GPU memory controller, the memory itself, and the PCB to handle these transmission speeds.

Every time NVIDIA launches support for a new memory technology, they like to roll out a new “eye diagram” signal analysis graph. And while at a high level these things don’t tell us anything we don’t already know – that NVIDIA has the technology working, and that, if it wasn’t, they wouldn’t publish these – they’re none the less neat to see. In this case we’re looking at a fairly clean eye diagram, illustrating the very tight 70ps transitions between data transfers. NVIDIA says that they were able to reduce signal crosstalk by 40% here, which is one of the key signal integrity changes required to make GDDR6’s high speed signaling possible.

Moving on, the second big change for GDDR6 is that how data is read out of the DRAM cells themselves has changed. For many generations the solution has been to just read and write in larger strides – the prefetch value – with GDDR5 taking this to 8n and GDDR5X taking it to 16n. However the resulting access granularities of 32 bytes and 64 bytes respectively were on the path of becoming increasingly suboptimal for small memory operations. As a result, GDDR6 does a larger prefetch and yet it does not.

Whereas both GDDR5 and GDDR5X used a single 32-bit channel per chip, GDDR6 instead uses a pair of 16-bit channels. This means that in a single memory core clock cycle (ed: not to be confused with the memory bus), 32 bytes will be fetched from each channel for a total of 64 bytes. This means that each GDDR6 memory chip can fetch twice as much data per clock as a GDDR5 chip, but it doesn’t have to be one contiguous chunk of memory. In essence, each GDDR6 memory chip can function like two chips.

For graphics this doesn’t have much of an impact since GPUs already read and write to RAM in massive sequential parallelism. However it’s a more meaningful change for other markets. In this case the smaller memory channels will help with random access performance, especially compared to GDDR5X and its massive 64 byte access granularity.

Moving on, GDDR6 also implements some changes to further reduce power consumption – or perhaps it’s better to say that these keep power consumption from continuing to grow. The standard operating voltage for the memory technology is 1.35v; this is identical to GDDR5X’s 1.35v voltage, but down from 1.5v for standard GDDR5.

The actual power savings are a bit hard to quantify here, as NVIDIA has rolled that data into all of their memory controller optimizations. But at least publicly, what they are saying is that in conjunction with “extensive” clock gating, they’ve been able to improve power efficiency by 20% over Pascal and GDDR5X, and undoubtedly by more versus Pascal paired with GDDR5. That said, these numbers should be taken with a small grain of salt, as these numbers appear to be averages rather than peaks. NVIDIA’s clock gating enhancements are primarily about reducing power consumption when GDDR6 is not running at full utilization, so peak power may be another story.

GPU Memory Math: GDDR6 vs. HBM2 vs. GDDR5X
	NVIDIA GeForce RTX 2080 Ti (GDDR6)	NVIDIA GeForce RTX 2080 (GDDR6)	NVIDIA Titan V (HBM2)	NVIDIA Titan Xp	NVIDIA GeForce GTX 1080 Ti	NVIDIA GeForce GTX 1080
Total Capacity	11 GB	8 GB	12 GB	12 GB	11 GB	8 GB
B/W Per Pin	14 Gb/s		1.7 Gb/s	11.4 Gbps	11 Gbps
Chip capacity	1 GB (8 Gb)		4 GB (32 Gb)	1 GB (8 Gb)
No. Chips/KGSDs	11	8	3	12	11	8
B/W Per Chip/Stack	56 GB/s		217.6 GB/s	45.6 GB/s	44 GB/s
Bus Width	352-bit	256-bit	3092-bit	384-bit	352-bit	256-bit
Total B/W	616 GB/s	448GB/s	652.8 GB/s	547.7 GB/s	484 GB/s	352 GB/s
DRAM Voltage	1.35 V		1.2 V (?)	1.35 V

All told then, NVIDIA will be the first GPU manufacturer to roll out GDDR6 support. And with the GTX 2070 on-up having all been announced already, it’s already going to be a wider roll-out than what we saw for GDDR5X. And to put things in numbers, relative to the GTX 10 series, the RTX 2080 Ti will get 27% more memory bandwidth, the RTX 2080 40% more bandwidth, and the RTX 2070 a whopping 75% more memory bandwidth than its predecessor.

However as this is a brand-new memory technology, I’m not sure whether we’re going to see it on the obligatory 2060 & 2050 cards. In transition periods, these tiers have been known to use older memory for cost and supply reasons – so we’ll have to see what happens.

Finally, just as GDDR6 is already seeing greater adoption on the memory manufacturer side, I’m expecting the same on the GPU side. AMD hasn’t announced their plans thus far, but I will be greatly surprised if we see them skip out on GDDR6 like they did GDDR5X.

Turing: Memory Compression Iterated

As I stated at the start of this section, the key to improving GPUs is to tackle the problem from two directions: increase the available memory bandwidth, and then decrease how much of it you use. For the latter, NVIDIA has employed a number of tricks over the years. Perhaps the most potent of which (that they’re willing to talk about, at least) being their memory compression technology.

The cornerstone of memory compression is what is called data color compression. DCC is a per-buffer/per-frame compression method that breaks down a frame into tiles, and then looks at the differences between neighboring pixels – their deltas. By utilizing a large pattern library, NVIDIA is able to try different patterns to describe these deltas in as few pixels as possible, ultimately conserving bandwidth throughout the GPU, not only reducing DRAM bandwidth needs, but also L2 bandwidth needs and texture unit bandwidth needs (in the case of reading back a compressed render target).

With Pascal, NVIDIA rolled out their 4^th generation technology, and now with Turing we’re at the 5^th generation. Unfortunately, details on what the 5^th generation entails are very slim; NVIDIA just isn’t talking about the technology much. The nature of DCC is such that it’s meant to be expandable: more silicon can be devoted to allowing more patterns to be checked at once. So it’s practically guaranteed that NVIDIA has once again expanded their library of patterns. However what those expanded patterns are, we don’t know.

However of the limited information NVIDIA has released, they’ve offered some effective memory bandwidth graphs, with the results broken down into how much of that gain came from actual memory bandwidth increases, and then how much of that came from efficiency increases. In NVIDIA’s examples the effective increase in bandwidth varies by game; as this is the RTX 2080 Ti, GDDR6 provides a constant 27% of the improvement, while the rest of the improvement is variable based on how useful NVIDIA’s memory compression updates are to the given game.

Overall, NVIDIA is seeing anywhere between a 45% increase in effective memory bandwidth for Grand Theft Auto V, up to a 60% increase for Ashes of the Singularity. Which, if we subtract out the base 27% memory clock increase, means that the efficiency increases are between 18% and 33%.

More broadly speaking, NVIDIA is claiming a 50% increase in effective memory bandwidth for the RTX 2080 Ti versus the GTX 1080 Ti. Which again subtracting the base 27% memory bandwidth increase from GDDR6, leaves us with an average efficiency improvement of 23%.

Relative to Pascal then, this is a smaller increase in effective memory bandwidth, but a slightly larger increase in compression efficiency. For Pascal – and specifically, GTX 1080 – NVIDIA claimed a 70% effective memory bandwidth increase, of which 20% was compression improvements.

So while NVIDIA isn’t gaining as much effective memory bandwidth this time around due to the smaller step up from GDDR5X to GDDR6, their compression gains have actually improved a bit between generations. Which is actually a bit surprising, as I would have otherwise expected diminishing returns in the gains from memory compression. After all, NVIDIA started with the most commonly seen pixel patterns, and each generation of DCC would be adding less common patterns.

The Turing Trio: TU102, TU104, & TU106 Unpacking 'RTX', 'NGX', and Game Support

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

111 Comments

View All Comments

Spunjji - Monday, September 17, 2018 - link
There's no such thing as a bad product, just bad pricing. AMD aren't out of the game but they are playing in an entirely different league.
siberian3 - Friday, September 14, 2018 - link
Good architectural leap for nvidia but it is sad very few of gamers can afford the new cards.
And AMD is not doing anything for 2018 and probably navi will be mid range on 7nm
V900 - Friday, September 14, 2018 - link
Meh, it’s always been that way with the newest, fastest GPUs.

Wait 6 months to a year, and prices will be where people with more modest budgets can play along.
B3an - Friday, September 14, 2018 - link
You must literally live under a rock while also being absurdly naive.

It's never been this way in the 20 years that i've been following GPUs. These new RTX GPUs are ridiculously expensive, way more than ever, and the prices will not be changing much at all when there's literally zero competition. The GPU space right now is worse than it's ever been before in history.
Amandtec - Friday, September 14, 2018 - link
I read somewhere that
8800GTX + inflation = 2080ti price
Without factoring in inflation the prices seem unprecedented.
Yojimbo - Saturday, September 15, 2018 - link
And you must factor in inflation, otherwise you are just pushing numbers around.
Yojimbo - Saturday, September 15, 2018 - link
And comparing the 2080 Ti to previous flagship launch cards is not really proper. The 2080 Ti is a different tier of card. The die size is so much larger than any previous launch GPU. It's just a demonstration of the increase in the amount of resources people are willing to devote to their GPUs, not an indication of an inflation of GPU prices.
eddman - Saturday, September 15, 2018 - link
2006 $600 at 2018 dollar value = $750
Samus - Saturday, September 15, 2018 - link
What inflation, exactly are you talking about. The dollar hasn't had a substantial change in valuation for 20 years (compared to other first-world currency.)

The USD inflation rate has averaged around 2.7%/year since 2000. That means one dollar in 2000 is now worth slightly less than $1.50 today. That means the top-of-the-line GPU released in 2000, I'd take a guess it was the Geforce2 GTS and/or the 3Dfx Voodoo5 5500, both cost $300.

For those who want to throw in cards like the Geforce 2 Ultra and the Voodoo5 6000, the former a card for nVidia to 'probe' the market for how much they could milk it going forward (and creating the situation we have today) and the other a card that never actually "launched"...we can include them for fun. The Ultra launched at $500 (even though it was slower than the Geforce 3 that launched 3 months later) and the Voodoo5 6000 had an MSRP set by 3Dfx at $500.

These were the most expensive gaming-focused GPU's ever made up until that date. Even SLI setups didn't cost $500 (the most expensive Voodoo2 card in the 90's was from Creative Labs @$229/ea - you needed two cards of course - so $460.)

Ok, so you have the absolute cream-of-the-crop cards in 2000 at $500, one was a marketing stunt, and the other never launched because nobody would have bought it. Realistically the most expensive cards were $300. But we will go with $500.

The most expensive high-end gaming focused cards now are $1000+

That would assume an inflation rate of over 5% annually, or the value of the dollar DOUBLING over 2 decades. Which it didn't come close to doing.

Stop using inflation as an excuse. It's bullshit. These companies are fucking greedy. Especially nVidia. They are effectively charging FOUR TIMES more than they used to for the same market segment card. 20 years ago you would have bought a TNT2 Ultra for $230 bucks and had the ultimate card available. Most people purchased entirely capable mainstream cards for $100-$150 like the TNT2 Pro or the Geforce2 MX400 that ran the most demanding games of the day like Counter Strike and Half-Life at 1024x768 in maximum detail.

http://www.in2013dollars.com/2000-dollars-in-2018?...
Yojimbo - Saturday, September 15, 2018 - link
"What inflation, exactly are you talking about."

CPI. Consumer Price Index. Even though inflation has been low for quite a while, $649 in 2013 is $697 today. That's almost $50 more, and it's enough to make up the difference between the 2013 launch price of the GTX 780 and the 2018 launch price of the RTX 2080.

I'm not sure why you are talking about cards from 20+ years ago. It's not relevant to my reply. In any case, those cards were completely different. The die sizes were much smaller and the cards were much less capable. They did a lot less of the work, as much of it was done on the CPU. The CPU was much more important to the game performance than today, as was the RAM and other components that were worth spending money on to significantly improve the gaming performance/experience.

"Stop using inflation as an excuse."

I'm not using inflation as an excuse. I'm using inflation as a tool to accurately compare the prices of cards from different years. And doing so clearly shows that the claim that the OP made is wrong. My reply had nothing to do with whether cards were in general cheaper 20 years ago or not. It was in response to "These new RTX GPUs are ridiculously expensive, way more than ever". That's provably untrue. Why are you replying to me and arguing about some entirely different point I wasn't ever talking about?

The NVIDIA Turing GPU Architecture Deep Dive: Prelude to GeForce RTX

Feeding the Beast (2018): GDDR6 & Memory Compression

Turing: Memory Compression Iterated

Post Your Comment

111 Comments

View All Comments

Spunjji - Monday, September 17, 2018 - link

siberian3 - Friday, September 14, 2018 - link

V900 - Friday, September 14, 2018 - link

B3an - Friday, September 14, 2018 - link

Amandtec - Friday, September 14, 2018 - link

Yojimbo - Saturday, September 15, 2018 - link

Yojimbo - Saturday, September 15, 2018 - link

eddman - Saturday, September 15, 2018 - link

Samus - Saturday, September 15, 2018 - link

Yojimbo - Saturday, September 15, 2018 - link

Log in

Don't have an account? Sign up now