The Kepler Architecture: Fermi Distilled

As GPU complexity has increased over the years, so has the amount of time it takes to design a GPU. Specific definitions vary on what constitutes planning, but with modern GPUs such as Fermi, Kepler, and Tahiti planning basically starts 4+ years ahead of time. At this point in time GPUs have a similarly long development period as CPUs, and that alone comes with some ramifications.

The biggest ramification of this style of planning is that because designing a GPU is such a big undertaking, it’s not something you want to do frequently. In the CPU space Intel has their tick-tock strategy, which has Intel making incremental architecture updates every 2 years. While in the GPU space neither NVIDIA or AMD have something quite like that – new architectures and new process nodes still tend to premiere together – there is a similar need to spread out architectures over multiple years.

For NVIDIA, Kepler is the embodiment of that concept. Kepler brings with it some very important architectural changes compared to Fermi, but at the same time it’s still undeniably Fermi. From a high level overview Kepler is identical to Fermi: it’s still organized into CUDA cores, SMs, and GPCs, and how warps are executed has not significantly changed. Nor for that matter has the rendering side significantly changed, with rendering still being handled in a distributed fashion through raster engines, polymorph engines, and of course the ROPs. The fact that NVIDIA has chosen to draw up Kepler like Fermi is no accident or coincidence; at the end of the day Kepler is the next generation of Fermi, tweaked and distilled to improve on Fermi’s strengths while correcting its weaknesses.

For our look at Kepler’s architecture, we’re going to be primarily comparing it to GF114, aka Fermi Lite. As you may recall, with Fermi NVIDIA had two designs: a multipurpose architecture for gaming and graphics (GF100/GF110), and a streamlined architecture built with a stronger emphasis on graphics than compute (GF104, etc) that was best suited for use in consumer graphics. As hinted at by the name alone, GK104 is designed to fill the same consumer graphics role as GF114, and consequently NVIDIA built GK104 off of GF114 rather than GF110.

So what does GK104 bring to the table? Perhaps it’s best to start with the SMs, as that’s where most of the major changes have happened.

In GF114 each SM contained 48 CUDA cores, with the 48 cores organized into 3 groups of 16. Joining those 3 groups of CUDA cores were 16 load/store units, 16 interpolation SFUs, 8 special function SFUs, and 8 texture units. Feeding all of those blocks was a pair of warp schedulers, each of which could issue up to 2 instructions per core clock cycle, for a total of up to 4 instructions in flight at any given time.

GF104/GF114 SM Functional Units

  • 16 CUDA cores (#1)
  • 16 CUDA cores (#2)
  • 16 CUDA cores, FP64 capable (#3)
  • 16 Load/Store Units
  • 16 Interpolation SFUs (not on NVIDIA's diagrams)
  • 8 Special Function SFUs
  • 8 Texture Units

Within the SM itself different units operated on different clocks, with the schedulers and texture units operating on the core clock, while the CUDA cores, load/store units, and SFUs operated on the shader clock, which ran at twice the core clock. As NVIDIA’s warp size is 32 threads, if you do the quick math you realize that warps are twice as large as any block of functional units, which is where the shader clock comes in. With Fermi, a warp would be split up and executed over 2 cycles of the shader clock; 16 threads would go first, and then the other 16 threads over the next clock. The shader clock is what allowed NVIDIA to execute a full warp over a single graphics clock cycle while only using enough hardware for half of a warp.

So how does GK104 change this? The single most important aspect of GK104, the thing that in turn dictates the design of everything else, is that NVIDIA has dropped the shader clock. Now the entire chip, from ROP to CUDA core, runs on the same core clock. As a consequence, rather than executing two half-warps in quick succession GK104 is built to execute a whole warp at once, and GK104’s hardware has changed dramatically as a result.

Because NVIDIA has essentially traded a fewer number of higher clocked units for a larger number of lower clocked units, NVIDIA had to go in and double the size of each functional unit inside their SM. Whereas a block of 16 CUDA cores would do when there was a shader clock, now a full 32 CUDA cores are necessary. The same is true for the load/store units and the special function units, all of which have been doubled in size in order to adjust for the lack of a shader clock. Consequently, this is why we can’t just immediately compare the CUDA core count of GK104 and GF114 and call GK104 4 times as powerful; half of that additional hardware is just to make up for the lack of a shader clock.

But of course NVIDIA didn’t stop there, as swapping out the shader clock for larger functional units only gives us the same throughput in the end. After doubling the size of the functional units in a SM, NVIDIA then doubled the number of functional units in each SM in order to grow the performance of the SM itself. 3 groups of CUDA cores became 6 groups of CUDA cores, 2 groups of load/store units, 16 texture units, etc. At the same time, with twice as many functional units NVIDIA also doubled the other execution resources, with 2 warp schedulers becoming 4 warp schedulers, and the register file being doubled from 32K entries to 64K entries.

Ultimately where the doubling of the size of the functional units allowed NVIDIA to drop the shader clock, it’s the second doubling of resources that makes GK104 much more powerful than GF114. The SMX is in nearly every significant way twice as powerful as a GF114 SM. At the end of the day NVIDIA already had a strong architecture in Fermi, so with Kepler they’ve gone and done the most logical thing to improve their performance: they’ve simply doubled Fermi.

Altogether the SMX now has 15 functional units that the warp schedulers can call on. Each of the 4 schedulers in turn can issue up to 2 instructions per clock if there’s ILP to be extracted from their respective warps, allowing the schedulers as a whole to issue instructions to up to 8 of the 15 functional units in any clock cycle.

GK104 SMX Functional Units

  • 32 CUDA cores (#1)
  • 32 CUDA cores (#2)
  • 32 CUDA cores (#3)
  • 32 CUDA cores (#4)
  • 32 CUDA cores (#5)
  • 32 CUDA cores (#6)
  • 16 Load/Store Units (#1)
  • 16 Load/Store Units (#2)
  • 16 Interpolation SFUs (#1)
  • 16 Interpolation SFUs (#2)
  • 16 Special Function SFUs (#1)
  • 16 Special Function SFUs (#2)
  • 8 Texture Units (#1)
  • 8 Texture Units (#2)
  • 8 CUDA FP64 cores

While that covers the operation of the SMX in a nutshell, there are a couple of other things relating to the SMX that need to be touched upon. Because NVIDIA still only has a single Polymorph Engine per SMX, the number of Polymorph Engines hasn’t been doubled like most of the other hardware in an SMX. Instead the capabilities of the Polymorph Engine itself have been doubled, making each Polymorph Engine 2.0 twice as powerful as a GF114 Polymorph Engine. In absolute terms, this means each Polymorph Engine can now spit out a polygon in 2 cycles, versus 4 cycles on GF114, for a total of 4 polygons/clock across GK104.

The other change coming from GF114 is the mysterious block #15, the CUDA FP64 block. In order to conserve die space while still offering FP64 capabilities on GF114, NVIDIA only made one of the three CUDA core blocks FP64 capable. In turn that block of CUDA cores could execute FP64 instructions at a rate of ¼ FP32 performance, which gave the SM a total FP64 throughput rate of 1/12th FP32. In GK104 none of the regular CUDA core blocks are FP64 capable; in its place we have what we’re calling the CUDA FP64 block.

The CUDA FP64 block contains 8 special CUDA cores that are not part of the general CUDA core count and are not in any of NVIDIA’s diagrams. These CUDA cores can only do and are only used for FP64 math. What's more, the CUDA FP64 block has a very special execution rate: 1/1 FP32. With only 8 CUDA cores in this block it takes NVIDIA 4 cycles to execute a whole warp, but each quarter of the warp is done at full speed as opposed to ½, ¼, or any other fractional speed that previous architectures have operated at. Altogether GK104’s FP64 performance is very low at only 1/24 FP32 (1/6 * ¼), but the mere existence of the CUDA FP64 block is quite interesting because it’s the very first time we’ve seen 1/1 FP32 execution speed. Big Kepler may not end up resembling GK104, but if it does then it may be an extremely potent FP64 processor if it’s built out of CUDA FP64 blocks.

Moving on, now that we’ve dissected GK104’s SMX, let’s take a look at the bigger picture. Above the SM(X)es we have the GPCs. The GPCs contain multiple SMs and more importantly the Raster Engine responsible for all rasterization. As with the many other things being doubled in GK104, the number of GPCs has been doubled from 2 to 4, thereby doubling the number of Raster Engines present. This in turn changes the GPC-to-SM(X) ratio from 1:4 on GF114 to 1:2 on GK104. The ratio itself is not particularly important, but it’s worth noting that it does take more work for NVIDIA to lay down and connect 4 GPCs than it does 2 GPCs.

Last, but certainly not least is the complete picture. The 4 GPCs are combined with the rest of the hardware that makes GK104 tick, including the ROPs, the memory interfaces, and the PCIe interface. The ROPs themselves are virtually identical to those found on GF114; theoretical performance is the same, and at a lower level the only notable changes are an incremental increase in compression efficiency, and improved polygon merging. Similarly, the organization of the ROPs has not changed either, as the ROPs are still broken up into 4 blocks of 8, with each ROP block tied to a 64 bit memory controller and 128KB of L2 cache. Altogether there are 32 ROPs, giving us 512KB of L2 cache and a 256bit memory bus.

On a final tangent, the memory controllers ended up being an unexpected achievement for NVIDIA. As you may recall, Fermi’s memory controllers simply never reached their intended targets – NVIDIA hasn’t told us what those targets were, but ultimately Fermi was meant to reach memory clocks higher than 4GHz. With GK104 NVIDIA has improved their memory controllers and then some. GK104’s memory controllers can clock some 50% higher than GF114’s, leading to GTX 680 shipping at 6GHz.

On a technical level, getting to 6GHz is hard, really hard. GDDR5 RAM can reach 7GHz and beyond on the right voltage, but memory controllers and the memory bus are another story. As we have mentioned a couple of times before, the memory bus tends to give out long before anything else does, which is what’s keeping actual shipping memory speeds well below 7GHz. With GK104 NVIDIA’s engineers managed to put together a chip and board that are good enough to run at 6GHz, and this alone is quite remarkable given how long GDDR5 has dogged NVIDIA and AMD.


GK104 GDDR5 Signal Analysis

Perhaps the icing on the cake for NVIDIA though is how many revisions it took them to get to 6GHz: one. NVIDIA was able to get 6GHz on the very first revision of GK104, which after Fermi’s lackluster performance is a remarkable turn of events. And ultimately while NVIDIA says that they’re most proud of the end result of GK104, the fact of the matter is that everyone seems just a bit prouder of their memory controller, and for good reason.

NVIDIA GeForce GTX 680 The Kepler Architecture: Efficiency & Scheduling
Comments Locked

404 Comments

View All Comments

  • Slayer68 - Saturday, March 24, 2012 - link

    Being able to run 3 screens off of one card is new for Nvidia. Barely even mentioned it in your review. It would be nice to see Nvidia surround / Eyefinity compared on these new cards. Especially interested in scaling at 5760 x 1080 between a 680 and 7970.....
  • ati666 - Saturday, March 24, 2012 - link

    does the gtx680 still have the same anisotropic filtering pattern like the gtx470/480/570/580 (octagonal pattern) or is it like AMDs HD7970 all angle-independent anisotropic filtering (circular pattern)?
  • Ryan Smith - Saturday, March 24, 2012 - link

    It's not something we were planning on publishing, but it is something we checked. It's still the same octagon pattern as Fermi. It would be nice if NVIDIA did have angle-independent AF, but to be honest the difference between that and what NVIDIA does has been so minor that it's not something we've ever been able to create a noticeable issue with in the real world.

    Now Intel's AF on the other hand...
  • ati666 - Saturday, March 24, 2012 - link

    thank for the reply, now i can finally make a decision to buy hd7970 or gtx680..
  • CeriseCogburn - Saturday, March 24, 2012 - link

    Yes I thank him too for finally coming clean and noting the angle independent amd algorithm he's been fanboy over for a long time has absolutely no real world gaming advantage whatsoever.
    It's a big fat zero of nothing but FUD for fanboys.
    It would be nice if notional advantages actually showed up in games, and when they don't or for the life of the reviewer cannot be detected in games, that be clearly stated and the insane "advantage" declared be called what it really is, a useless talking point of deception that fools purchasers instead of enlightening them.
    The biased emphasis with zero advantage is as unscientific as it gets. Worse yet, within the same area, the "perfectly round algorithm" yielded in game transition lines with the amd cards, denied by the reviewer for what, a year ? Then a race game finally convinced him, and in this 7000 series release we find another issue the "perfectly round algorithm" apparently was attached to flaw with, a "poor transition resolution" - rather crudely large instead of fine like Nvidia's which casued excessive amd shimmering in game, and we are treated to that information only now after the 7000 series "solved" the issue and brought it near or up to the GTX long time standard.
    So this whole "perfectly round algorithm" has been nothing but fanboy lies for amd all along, while ignoring at least 2 large IQ issues when it was "put to use" in game. (transition shading and shimmering)
    I'm certain an explanation could be given that there are other factors with differing descriptive explanation, like the fineness of textural changes as one goes toward center of the image not directly affecting roundness one way or another, used as an excuse, perhaps the self deceptive justification that allowed such misbehavior to go on for so long.
  • _vor_ - Saturday, March 24, 2012 - link

    Will you seriously STFU already? It's hard to read this discussion with your blatant and belligerent jackassery all over it.

    You love NVIDIA. Great. Now STFU and stop posting.
  • CeriseCogburn - Saturday, March 24, 2012 - link

    Great attack, did I get anything wrong at all ? I guess not.
  • silverblue - Monday, March 26, 2012 - link

    Could you provide a link to an article based on this subject, please? Not an attack; just curious.
  • CeriseCogburn - Tuesday, March 27, 2012 - link

    http://www.anandtech.com/show/5261/amd-radeon-hd-7...

    http://forums.anandtech.com/showpost.php?p=3152067...

    " So what then is going on that made Civ V so much faster for NVIDIA? Admittedly I had to press NVIDIA for this - performance practically doubled on high-end GPUs, which is unheard of. Until they told me what exactly they did, I wasn't convinced it was real or if they had come up with a really sweet cheat. It definitely wasn't a cheat.

    If you recall from our articles, I keep pointing to how we seem to be CPU limited at the time. "

    (YES, SO THAT'S WHAT WE GOT, THEY'RE CHEATING IT'S FAKE WE'RE CPU LIMITED- ALL WRONG ALL LIES)

    Since AMD’s latest changes are focused on reducing shimmering in motion we’ve put together a short video of the 3D Center Filter Tester running the tunnel test with the 7970, the 6970, and GTX 580. The tunnel test makes the differences between the 7970 and 6970 readily apparent, and at this point both the 7970 and GTX 580 have similarly low levels of shimmering.

    with both implementing DX9 SSAA with the previous generation of GPUs, and AMD catching up to NVIDIA by implementing Enhanced Quality AA (their version of NVIDIA’s CSAA) with Cayman. Between Fermi and Cayman the only stark differences are that AMD offers their global faux-AA MLAA filter, while NVIDIA has support for true transparency and super sample anti-aliasing on DX10+ games.

    (AMD FINALLY CATCHES UP IN EQAA PART, NVIDIA TRUE STANS AND SUPER SAMPLE HIGH Q STUFF, AMD CHEAT AND BLUR AND BLUR TEXT)

    Thus I had expected AMD to close the gap from their end with Southern Islands by implementing DX10+ versions of Adaptive AA and SSAA, but this has not come to pass.

    ( AS I INTERPRETED AMD IS WAY BEHIND STILL A GAP TO CLOSE ! )

    AMD has not implemented any new AA modes compared to Cayman, and as a result AAA and SSAA continue to only available in DX9 titles.

    Finally, while AMD may be taking a break when it comes to anti-aliasing they’re still hard at work on tessellation

    ( BECAUSE THEY'RE BEHIND IN TESSELLATION TOO.)

    Don't forget amd has a tessellation cheat in their 7000 series driver, so 3dmark 11 is cheated on as is unigine heaven, while Nvidia does no such thing.

    ---
    I do have more like the race car game admission, but I think that's enough helping you doing homework .
  • CeriseCogburn - Tuesday, March 27, 2012 - link

    So here's more mr curious ..
    " “There’s nowhere left to go for quality beyond angle-independent filtering at the moment.”

    With the launch of the 5800 series last year, I had high praise for AMD’s anisotropic filtering. AMD brought truly angle-independent filtering to gaming (and are still the only game in town), putting an end to angle-dependent deficiencies and especially AMD’s poor AF on the 4800 series. At both the 5800 series launch and the GTX 480 launch, I’ve said that I’ve been unable to find a meaningful difference or deficiency in AMD’s filtering quality, and NVIDIA was only deficienct by being not quite angle-independent. I have held – and continued to hold until last week – the opinion that there’s no practical difference between the two.

    It turns out I was wrong. Whoops.

    The same week as when I went down to Los Angeles for AMD’s 6800 series press event, a reader sent me a link to a couple of forum topics discussing AF quality. While I still think most of the differences are superficial, there was one shot comparing AMD and NVIDIA that caught my attention: Trackmania."

    " The shot clearly shows a transition between mipmaps on the road, something filtering is supposed to resolve. In this case it’s not a superficial difference; it’s very noticeable and very annoying.

    AMD appears to agree with everyone else. As it turns out their texture mapping units on the 5000 series really do have an issue with texture filtering, specifically when it comes to “noisy” textures with complex regular patterns. AMD’s texture filtering algorithm was stumbling here and not properly blending the transitions between the mipmaps of these textures, resulting in the kind of visible transitions that we saw in the above Trackmania screenshot. "

    http://www.anandtech.com/show/3987/amds-radeon-687...

    WE GET THIS AFTER 6000 SERIES AMD IS RELEASED, AND DENIAL UNTIL, NOW WE GET THE SAME THING ONCE 7000 SERIES IS RELEASED, AND COMPLETE DENIAL BEFORE THAT...

    HERE'S THE 600 SERIES COVERUP THAT COVERS UP 5000 SERIES AFTER ADMITTING THE PROBLEM A WHOLE GENERATION LATE
    " So for the 6800 series, AMD has refined their texture filtering algorithm to better handle this case. Highly regular textures are now filtered properly so that there’s no longer a visible transition between them. As was the case when AMD added angle-independent filtering we can’t test the performance impact of this since we don’t have the ability to enable/disable this new filtering algorithm, but it should be free or close to it. In any case it doesn’t compromise AMD’s existing filtering features, and goes hand-in-hand with their existing angle-independent filtering."

    NOW DON'T FORGET RYAN HAS JUST ADMITTED AMD ANGLE INDEPENDENT ALGORITHM IS WORTH NOTHING IN REAL GAME- ABSOLUTELY NOTHING.

Log in

Don't have an account? Sign up now