Designing GP104: Running Up the Clocks

So if GP104’s per-unit throughput is identical to GM204, and the SM count has only been increased from 2048 to 2560 (25%), then what makes GTX 1080 60-70% faster than GTX 980? The answer there is that instead of vastly increasing the number of functional units for GP104 or increasing per-unit throughput, NVIDIA has instead opted to significantly raise the GPU clockspeed. And this in turn goes back to the earlier discussion on TSMC’s 16nm FinFET process.

With every advancement in fab technology, chip designers have been able to increase their clockspeeds thanks to the basic physics at play. However because TSMC’s 16nm node adds FinFETs for the first time, it’s extra special. What’s happening here is a confluence of multiple factors, but at the most basic level the introduction of FinFETs means that the entire voltage/frequency curve gets shifted. The reduced leakage and overall “stronger” FinFET transistors can run at higher clockspeeds at lower voltages, allowing for higher overall clockspeeds at the same (or similar) power consumption. We see this effect to some degree with every node shift, but it’s especially potent when making the shift from planar to FinFET, as has been the case for the jump from 28nm to 16nm.

Given the already significant one-off benefits of such a large jump in the voltage/frequency curve, for Pascal NVIDIA has decided to fully embrace the idea and run up the clocks as much as is reasonably possible. At an architectural level this meant going through the design to identify bottlenecks in the critical paths – logic sections that couldn’t run at as high a frequency as NVIDIA would have liked – and reworking them to operate at higher frequencies. As GPUs typically (and still are) relatively low clocked, there’s not as much of a need to optimize critical paths in this matter, but with NVIDIA’s loftier clockspeed goals for Pascal, this changed things.

From an implementation point of view this isn’t the first time that NVIDIA has pushed for high clockspeeds, as most recently the 40nm Fermi architecture incorporated a double-pumped shader clock. However this is the first time NVIDIA has attempted something similar since they reined in their power consumption with Kepler (and later Maxwell). Having learned their lesson the hard way with Fermi, I’m told a lot more care went into matters with Pascal in order to avoid the power penalties NVIDIA paid with Fermi, exemplified by things such as only adding flip-flops where truly necessary.

Meanwhile when it comes to the architectural impact of designing for high clockspeeds, the results seem minimal. While NVIDIA does not divulge full information on the pipeline of a CUDA core, all of the testing I’ve run indicates that the latency (in clock cycles) of the CUDA cores is identical to Maxwell. Which goes hand in hand with earlier observations about throughput. So although optimizations were made to the architecture to improve clockspeeds, it doesn’t look like NVIDIA has made any more extreme optimizations (e.g. pipeline lengthening) that detectably reduces Pascal’s per-clock performance.

Beyond3D Suite - Estimated MADD Latency

Finally, more broadly speaking, while this is essentially a one-time trick for NVIDIA, it’s an interesting route for them to go. By cranking up their clockspeeds in this fashion, they avoid any real scale-out issues, at least for the time being. Although graphics are the traditional embarrassingly parallel problem, even a graphical workload is subject to some degree of diminishing returns as GPUs scale farther out. A larger number of SMs is more difficult to fill, not every aspect of the rendering process is massively parallel (shadow maps being a good example), and ever-increasing pixel shader lengths compound the problem. Admittedly NVIDIA’s not seeing significant scale-out issues quite yet, but this is why GTX 980 isn’t quite twice as fast as GTX 960, for example.

Just increasing the clockspeed, comparatively speaking, means that the entire GPU gets proportionally faster without shifting the resource balance; the CUDA cores are 43% faster, the geometry frontends are 43% faster, the ROPs are 43% faster, etc. The only real limitation in this regard isn’t the GPU itself, but whether you can adequately feed it. And this is where GDDR5X comes into play.

FP16 Throughput on GP104: Good for Compatibility (and Not Much Else) Feeding Pascal: GDDR5X
Comments Locked

200 Comments

View All Comments

  • jcardel - Wednesday, July 27, 2016 - link

    This is excactly the same situation as me. I got a 770 sitting in my rig, and am looking hard at the 1070, maybe soon. Although my 770 is still up to the task in most games, I really play only blizzard games theese days and they are not hard on your hardware.

    My biggest issue is really that it is rather noisy, so I will be looking for a solution with the lowest DB.

    Great article, it was totally worth waiting for.. I only read this sort of stuff here so have been waiting till now for any 1080 review.

    Thanks!
  • D. Lister - Thursday, July 21, 2016 - link

    Nice job, Ryan. Good comeback. Keep it up.
  • Saeid92 - Thursday, July 21, 2016 - link

    What is 99th procentile framerate?
  • Ryan Smith - Thursday, July 21, 2016 - link

    If you sorted the framerate from highest to lowest, this would be the framerate of the slowest 1%. It's basically a more accurate/meaningful metric for minimum frame rates.
  • Eris_Floralia - Thursday, July 21, 2016 - link

    This is why I love Anandtech. Deep in reviews. Well I even wanted to be one of your editors if you have the plan to create a Chinese transtate version of these reviews.
  • daku123 - Thursday, July 21, 2016 - link

    Typo on FP16 Throughput page. In second paragraph, it should be Tegra X1 (not Tesla X1?).
  • Ryan Smith - Thursday, July 21, 2016 - link

    Eyup. Thanks!
  • Badelhas - Thursday, July 21, 2016 - link

    Great detailed review, as always. But I have to ask once again:
    why didnt you do some kind of VR Benchmarks? Thats what drives my choises now, to be honest.

    Cheers
  • Ranger1065 - Thursday, July 21, 2016 - link

    After over 2 months of reading GTX1080 reviews I felt a distinct lack of excitement
    as I read Anandtech kicking off their review of the finfet generation. Could it
    prove to be anything but an anticlimax?

    Sadly and unsurprisingly...NOT.

    It was however amusing to see the faithfull positively gushing praises for Anandtech
    now that the "greatly anticipated" review is finally out.

    Yes folks, 20 or so pages of (well written) information, mostly already covered by other tech sites,
    finally published, it's as if a magic wand has been waved, the information has been presented with
    that special Anandtech sauce, new insights have been illuminated and all is well in Anandtechland again.

    (AT LEAST UNTIL THE NEXT 2 MONTH DELAY.) LOL.

    I do like the way Anandtech presents the FPS charts.

    Back to sleep now Anandtech :)
  • mkaibear - Thursday, July 21, 2016 - link

    You've hit the nail on the head here Ranger.

    The info which is included within the article is indeed mostly already covered by other tech sites.

    Emphasis on the "mostly" and the plural "sites".

    Those of us who have jobs which keep us busy and have an interest in this sort of thing often don't have the time to trawl round many different sites to get reviews and pertinent technical data so we rely upon those sites which we trust to produce in-depth articles, even if they take a bit longer.

    As an IT Manager for (most recently) a manufacturing firm and then a school, I don't care about bleeding edge, get the new stuff as soon as it comes out, I care about getting the right stuff, and a two month delay to get a proper review is absolutely fine. If I need quick benchmarks I'll use someone like Hexus or HardOCP but to get a deep dive into the architecture so I can justify purchases to the Art and Media departments, or the programers is essential. You don't get that anywhere else.

Log in

Don't have an account? Sign up now