The Kepler Architecture: Efficiency & Scheduling

So far we’ve covered how NVIDIA has improved upon Fermi for; now let’s talk about why.

Mentioned quickly in our introduction, NVIDIA’s big push with Kepler is efficiency. Of course Kepler needs to be faster (it always needs to be faster), but at the same time the market is making a gradual shift towards higher efficiency products. On the desktop side of matters GPUs have more or less reached their limits as far as total power consumption goes, while in the mobile space products such as Ultrabooks demand GPUs that can match the low power consumption and heat dissipation levels these devices were built around. And while strictly speaking NVIDIA’s GPUs haven’t been inefficient, AMD has held an edge on performance per mm2 for quite some time, so there’s clear room for improvement.

In keeping with that ideal, for Kepler NVIDIA has chosen to focus on ways they can improve Fermi’s efficiency. As NVIDIA's VP of GPU Engineering, Jonah Alben puts it, “[we’ve] already built it, now let's build it better.”

There are numerous small changes in Kepler that reflect that goal, but of course the biggest change there was the removal of the shader clock in favor of wider functional units in order to execute a whole warp over a single clock cycle. The rationale for which is actually rather straightforward: a shader clock made sense when clockspeeds were low and die space was at a premium, but now with increasingly small fabrication processes this has flipped. As we have become familiar with in the CPU space over the last decade, higher clockspeeds become increasingly expensive until you reach a point where they’re too expensive – a point where just distributing that clock takes a fair bit of power on its own, not to mention the difficulty and expense of building functional units that will operate at those speeds.

With Kepler the cost of having a shader clock has finally become too much, leading NVIDIA to make the shift to a single clock. By NVIDIA’s own numbers, Kepler’s design shift saves power even if NVIDIA has to operate functional units that are twice as large. 2 Kepler CUDA cores consume 90% of the power of a single Fermi CUDA core, while the reduction in power consumption for the clock itself is far more dramatic, with clock power consumption having been reduced by 50%.

Of course as NVIDIA’s own slide clearly points out, this is a true tradeoff. NVIDIA gains on power efficiency, but they lose on area efficiency as 2 Kepler CUDA cores take up more space than a single Fermi CUDA core even though the individual Kepler CUDA cores are smaller. So how did NVIDIA pay for their new die size penalty?

Obviously 28nm plays a significant part of that, but even then the reduction in feature size from moving to TSMC’s 28nm process is less than 50%; this isn’t enough to pack 1536 CUDA cores into less space than what previously held 384. As it turns out not only did NVIDIA need to work on power efficiency to make Kepler work, but they needed to work on area efficiency. There are a few small design choices that save space, such as using 8 SMXes instead of 16 smaller SMXes, but along with dropping the shader clock NVIDIA made one other change to improve both power and area efficiency: scheduling.

GF114, owing to its heritage as a compute GPU, had a rather complex scheduler. Fermi GPUs not only did basic scheduling in hardware such as register scoreboarding (keeping track of warps waiting on memory accesses and other long latency operations) and choosing the next warp from the pool to execute, but Fermi was also responsible for scheduling instructions within the warps themselves. While hardware scheduling of this nature is not difficult, it is relatively expensive on both a power and area efficiency basis as it requires implementing a complex hardware block to do dependency checking and prevent other types of data hazards. And since GK104 was to have 32 of these complex hardware schedulers, the scheduling system was reevaluated based on area and power efficiency, and eventually stripped down.

The end result is an interesting one, if only because by conventional standards it’s going in reverse. With GK104 NVIDIA is going back to static scheduling. Traditionally, processors have started with static scheduling and then moved to hardware scheduling as both software and hardware complexity has increased. Hardware instruction scheduling allows the processor to schedule instructions in the most efficient manner in real time as conditions permit, as opposed to strictly following the order of the code itself regardless of the code’s efficiency. This in turn improves the performance of the processor.

However based on their own internal research and simulations, in their search for efficiency NVIDIA found that hardware scheduling was consuming a fair bit of power and area for few benefits. In particular, since Kepler’s math pipeline has a fixed latency, hardware scheduling of the instruction inside of a warp was redundant since the compiler already knew the latency of each math instruction it issued. So NVIDIA has replaced Fermi’s complex scheduler with a far simpler scheduler that still uses scoreboarding and other methods for inter-warp scheduling, but moves the scheduling of instructions in a warp into NVIDIA’s compiler. In essence it’s a return to static scheduling.

Ultimately it remains to be seen just what the impact of this move will be. Hardware scheduling makes all the sense in the world for complex compute applications, which is a big reason why Fermi had hardware scheduling in the first place, and for that matter why AMD moved to hardware scheduling with GCN. At the same time however when it comes to graphics workloads even complex shader programs are simple relative to complex compute applications, so it’s not at all clear that this will have a significant impact on graphics performance, and indeed if it did have a significant impact on graphics performance we can’t imagine NVIDIA would go this way.

What is clear at this time though is that NVIDIA is pitching GTX 680 specifically for consumer graphics while downplaying compute, which says a lot right there. Given their call for efficiency and how some of Fermi’s compute capabilities were already stripped for GF114, this does read like an attempt to further strip compute capabilities from their consumer GPUs in order to boost efficiency. Amusingly, whereas AMD seems to have moved closer to Fermi with GCN by adding compute performance, NVIDIA seems to have moved closer to Cayman with Kepler by taking it away.

With that said, in discussing Kepler with NVIDIA’s Jonah Alben, one thing that was made clear is that NVIDIA does consider this the better way to go. They’re pleased with the performance and efficiency they’re getting out of software scheduling, going so far to say that had they known what they know now about software versus hardware scheduling, they would have done Fermi differently. But whether this only applies to consumer GPUs or if it will apply to Big Kepler too remains to be seen.

The Kepler Architecture: Fermi Distilled GPU Boost: Turbo For GPUs
Comments Locked

404 Comments

View All Comments

  • Targon - Thursday, March 22, 2012 - link

    Many people have been blasting AMD for price vs performance in the GPU arena in the current round of fighting. The thing is, until now, AMD had no competition, so it was expected that the price would remain high until NVIDIA released their new generation. So, expect lower prices from AMD to be released in the next week.

    You also fail to realize that with a 3 month lead, AMD is that much closer to the refresh parts being released that will beat NVIDIA for price vs. performance. Power draw may still be higher from the refresh parts, but that will be addressed for the next generation.

    Now, you and others have been claiming that NVIDIA is somehow blowing AMD out of the water in terms of performance, and that is NOT the case. Yes, the 680 is faster, but isn't so much faster that AMD couldn't EASILY counter with a refresh part that catches up or beats the 680 NEXT WEEK. The 7000 series has a LOT of overclocking room there.

    So, keep things in perspective. A 3 percent performance difference isn't enough to say that one is so much better than the other. It also remains to be seen how quickly the new NVIDIA parts will be made available.
  • SlyNine - Thursday, March 22, 2012 - link

    I still blast them, I'm not happy with the price/performance increase of this generation at all.
  • Unspoken Thought - Thursday, March 22, 2012 - link

    Finally! Logic! But it still falls on deaf ears. We finally see both sides getting their act together to get minimum features sets in, and we can't see passed our own bias squabbles.

    How about we continue to push these manufactures in what we want and need most; more features, better algorithms, and last and most important, revolutionize and find new way to render, aside from vector based rendering.

    Lets start incorporating high level mathematics for fluid dynamics and the such. They have already absorbed PhysX and moved to Direct Compute. Lets see more realism in games!

    Where is the Technological Singularity when you need one.
  • CeriseCogburn - Thursday, March 22, 2012 - link

    Well, the perspective I have is amd had a really lousy (without drivers) and short 2.5 months when the GTX580 wasn't single core king w GTX590 dual core king and the latter still is and the former has been replaced by the GTX680.
    So right now Nvidia is the asbolute king, and before now save that very small time period Nvidia was core king for what with the 580 .. 1.5 years ?
    That perspective is plain fact.
    FACTS- just stating those facts makes things very clear.
    We already have heard the Nvidia monster die is afoot - that came out with "all the other lies" that turned out to be true...
    I don't fail to realize anything - I just have a clear mind about what has happened.
    I certainly hope AMD has a new better core slapped down very soon, a month would be great.
    Until AMD is really winning, it's LOSING man, it's LOSING!
  • CeriseCogburn - Thursday, March 22, 2012 - link

    Since amd had no competition for 2.5 months and that excuses it's $569.99 price, then certainly the $500 price on the GTX580 that had no competition for a full year and a half was not an Nvidia fault, right ? Because you're a fair person and "finally logic!" is what another poster supported you with...
    So thanks for saying the GTX580 was never priced too high because it has no competition for 1.5 years.

  • Unspoken Thought - Saturday, March 24, 2012 - link

    Honestly the only thing I was supporting was the fact he is showing that perspective changes everything. a fact exacerbated when bickering over marginal differences that are driven by the economy when dealing with price vs performance.

    Both of you have valid arguments, but it sounds like you just want to feel better about supporting nVidia.

    You should be able to see how AMD achieved its goals with nVidia following along playing leap frog. Looking at benchmarks, no it doesn't beat it in everything and both are very closely matched in power consumption, heat, and noise. Features are where nVidia shine and get my praise. but I would not fault you if you had either card.
  • CeriseCogburn - Friday, April 6, 2012 - link

    Ok Targon, now we know TiN put the 1.3V core on the 680 and it OC'ed on air to 1,420 core, surpassing every 7970 1.3V core overclock out there.
    Furthermore, Zotac has put out the 2,000Ghz 680 edition...
    So it appears the truth comes down to the GTX680 has more left in the core than the 7970.
    Nice try but no cigar !
    Nice spin but amd does not win !
    Nice prediction, but it was wrong.
  • SlyNine - Thursday, March 22, 2012 - link

    Go back and look at the benchmarks idiot. 7970 wins in some situations.
  • SlyNine - Thursday, March 22, 2012 - link

    In Crysis max, 7970 gets 36 FPS while the 680 only gets 30 FPS.

    Yes, some how the 7970 is losing. LOOK AT THE NUMBERS, HELLO!!???

    Metro 2033 the 7970 gets 38 and the 680 gets 37. But yet in your eyes another loss for the 7970...

    7970 kills it in certian GPU Compute, and has hardware H.264 encoding.

    In a couple of games, which you already get massive FPS with both, the 680 boasts much much higher FPS. But than in games where you need the FPS the 7970 wins. Hmmm

    But no no, you're right, the 680 is total elite top shit.
  • eddieroolz - Friday, March 23, 2012 - link

    You pretty much admitted that 7970 loses in a lot of other cases by stating that:

    "7970 kills it in certain GPU compute..."

    Adding the word modifier "certain" to describe a win is like admitting defeat in every other compute situation.

    Even for the games, you can only mention 2 out of what, 6 games where 7970 wins by a <10% margin. Meanwhile, GTX 680 proceeds to maul the 7970 by >15% in at least 2 of the games.

    Yes, 7970 is full of win, indeed! /s

Log in

Don't have an account? Sign up now