Designing GP104: Running Up the Clocks

So if GP104’s per-unit throughput is identical to GM204, and the SM count has only been increased from 2048 to 2560 (25%), then what makes GTX 1080 60-70% faster than GTX 980? The answer there is that instead of vastly increasing the number of functional units for GP104 or increasing per-unit throughput, NVIDIA has instead opted to significantly raise the GPU clockspeed. And this in turn goes back to the earlier discussion on TSMC’s 16nm FinFET process.

With every advancement in fab technology, chip designers have been able to increase their clockspeeds thanks to the basic physics at play. However because TSMC’s 16nm node adds FinFETs for the first time, it’s extra special. What’s happening here is a confluence of multiple factors, but at the most basic level the introduction of FinFETs means that the entire voltage/frequency curve gets shifted. The reduced leakage and overall “stronger” FinFET transistors can run at higher clockspeeds at lower voltages, allowing for higher overall clockspeeds at the same (or similar) power consumption. We see this effect to some degree with every node shift, but it’s especially potent when making the shift from planar to FinFET, as has been the case for the jump from 28nm to 16nm.

Given the already significant one-off benefits of such a large jump in the voltage/frequency curve, for Pascal NVIDIA has decided to fully embrace the idea and run up the clocks as much as is reasonably possible. At an architectural level this meant going through the design to identify bottlenecks in the critical paths – logic sections that couldn’t run at as high a frequency as NVIDIA would have liked – and reworking them to operate at higher frequencies. As GPUs typically (and still are) relatively low clocked, there’s not as much of a need to optimize critical paths in this matter, but with NVIDIA’s loftier clockspeed goals for Pascal, this changed things.

From an implementation point of view this isn’t the first time that NVIDIA has pushed for high clockspeeds, as most recently the 40nm Fermi architecture incorporated a double-pumped shader clock. However this is the first time NVIDIA has attempted something similar since they reined in their power consumption with Kepler (and later Maxwell). Having learned their lesson the hard way with Fermi, I’m told a lot more care went into matters with Pascal in order to avoid the power penalties NVIDIA paid with Fermi, exemplified by things such as only adding flip-flops where truly necessary.

Meanwhile when it comes to the architectural impact of designing for high clockspeeds, the results seem minimal. While NVIDIA does not divulge full information on the pipeline of a CUDA core, all of the testing I’ve run indicates that the latency (in clock cycles) of the CUDA cores is identical to Maxwell. Which goes hand in hand with earlier observations about throughput. So although optimizations were made to the architecture to improve clockspeeds, it doesn’t look like NVIDIA has made any more extreme optimizations (e.g. pipeline lengthening) that detectably reduces Pascal’s per-clock performance.

Beyond3D Suite - Estimated MADD Latency

Finally, more broadly speaking, while this is essentially a one-time trick for NVIDIA, it’s an interesting route for them to go. By cranking up their clockspeeds in this fashion, they avoid any real scale-out issues, at least for the time being. Although graphics are the traditional embarrassingly parallel problem, even a graphical workload is subject to some degree of diminishing returns as GPUs scale farther out. A larger number of SMs is more difficult to fill, not every aspect of the rendering process is massively parallel (shadow maps being a good example), and ever-increasing pixel shader lengths compound the problem. Admittedly NVIDIA’s not seeing significant scale-out issues quite yet, but this is why GTX 980 isn’t quite twice as fast as GTX 960, for example.

Just increasing the clockspeed, comparatively speaking, means that the entire GPU gets proportionally faster without shifting the resource balance; the CUDA cores are 43% faster, the geometry frontends are 43% faster, the ROPs are 43% faster, etc. The only real limitation in this regard isn’t the GPU itself, but whether you can adequately feed it. And this is where GDDR5X comes into play.

FP16 Throughput on GP104: Good for Compatibility (and Not Much Else) Feeding Pascal: GDDR5X
Comments Locked

200 Comments

View All Comments

  • eddman - Wednesday, July 20, 2016 - link

    That puts a lid on the comments that Pascal is basically a Maxwell die-shrink. It's obviously based on Maxwell but the addition of dynamic load balancing and preemption clearly elevates it to a higher level.

    Still, seeing that using async with Pascal doesn't seem to be as effective as GCN, the question is how much of a role will it play in DX12 games in the next 2 years. Obviously async isn't be-all and end-all when it comes to performance but can Pascal keep up as a whole going forward or not.

    I suppose we won't know until more DX12 are out that are also optimized properly for Pascal.
  • javishd - Wednesday, July 20, 2016 - link

    Overwatch is extremely popular right now, it deserves to be a staple in gaming benchmarks.
  • jardows2 - Wednesday, July 20, 2016 - link

    Except that it really is designed as an e-sport style game, and can run very well with low-end hardware, so isn't really needed for reviewing flagship cards. In other words, if your primary desire is to find a card that will run Overwatch well, you won't be looking at spending $200-$700 for the new video cards coming out.
  • Ryan Smith - Wednesday, July 20, 2016 - link

    And this is why I really wish Overwatch was more demanding on GPUs. I'd love to use it and DOTA 2, but 100fps at 4K doesn't tell us much of use about the architecture of these high-end cards.
  • Scali - Wednesday, July 20, 2016 - link

    Thanks for the excellent write-up, Ryan!
    Especially the parts on asynchronous compute and pre-emption were very thorough.
    A lot of nonsense was being spread about nVidia's alleged inability to do async compute in DX12, especially after Time Spy was released, and actually showed gains from using multiple queues.
    Your article answers all the criticism, and proves the nay-sayers wrong.
    Some of them went so far in their claims that they said nVidia could not even do graphics and compute at the same time. Even Maxwell v2 could do that.
    I would say you have written the definitive article on this matter.
  • The_Assimilator - Wednesday, July 20, 2016 - link

    Sadly that won't stop the clueless AMD fanboys from continuing to harp on that NVIDIA "doesn't have async compute" or that it "doesn't work". You've gotta feel for them though, NVIDIA's poor performance in a single tech demo... written with assistance from AMD... is really all the red camp has to go on. Because they sure as hell can't compete in terms of performance, or power usage, or cooler design, or adhering to electrical specifications...
  • tipoo - Wednesday, July 20, 2016 - link

    Pretty sure critique was of Maxwell. Pascals async was widely advertised. It's them saying "don't worry, Maxwell can do it" to questions about it not having it, and then when Pascal is released, saying "oh yeah, performance would have tanked with it on Maxwell", that bugs people as it should
  • Scali - Wednesday, July 20, 2016 - link

    Nope, a lot of critique on Time Spy was specifically *because* Pascal got gains from the async render path. People said nVidia couldn't do it, so FutureMark must be cheating/bribed.
  • darkchazz - Thursday, July 21, 2016 - link

    It won't matter much though because they won't read anything in this article or Futuremark's statement on Async use in Time Spy.
    And they will keep linking some forum posts that claim nvidia does not support Async Compute.

    Nothing will change their minds that it is a rigged benchmark and the developers got bribed by nvidia.
  • Scali - Friday, July 22, 2016 - link

    Yea, not even this official AMD post will: http://radeon.com/radeon-wins-3dmark-dx12/

Log in

Don't have an account? Sign up now