The Kepler Architecture: Efficiency & Scheduling

So far we’ve covered how NVIDIA has improved upon Fermi for; now let’s talk about why.

Mentioned quickly in our introduction, NVIDIA’s big push with Kepler is efficiency. Of course Kepler needs to be faster (it always needs to be faster), but at the same time the market is making a gradual shift towards higher efficiency products. On the desktop side of matters GPUs have more or less reached their limits as far as total power consumption goes, while in the mobile space products such as Ultrabooks demand GPUs that can match the low power consumption and heat dissipation levels these devices were built around. And while strictly speaking NVIDIA’s GPUs haven’t been inefficient, AMD has held an edge on performance per mm2 for quite some time, so there’s clear room for improvement.

In keeping with that ideal, for Kepler NVIDIA has chosen to focus on ways they can improve Fermi’s efficiency. As NVIDIA's VP of GPU Engineering, Jonah Alben puts it, “[we’ve] already built it, now let's build it better.”

There are numerous small changes in Kepler that reflect that goal, but of course the biggest change there was the removal of the shader clock in favor of wider functional units in order to execute a whole warp over a single clock cycle. The rationale for which is actually rather straightforward: a shader clock made sense when clockspeeds were low and die space was at a premium, but now with increasingly small fabrication processes this has flipped. As we have become familiar with in the CPU space over the last decade, higher clockspeeds become increasingly expensive until you reach a point where they’re too expensive – a point where just distributing that clock takes a fair bit of power on its own, not to mention the difficulty and expense of building functional units that will operate at those speeds.

With Kepler the cost of having a shader clock has finally become too much, leading NVIDIA to make the shift to a single clock. By NVIDIA’s own numbers, Kepler’s design shift saves power even if NVIDIA has to operate functional units that are twice as large. 2 Kepler CUDA cores consume 90% of the power of a single Fermi CUDA core, while the reduction in power consumption for the clock itself is far more dramatic, with clock power consumption having been reduced by 50%.

Of course as NVIDIA’s own slide clearly points out, this is a true tradeoff. NVIDIA gains on power efficiency, but they lose on area efficiency as 2 Kepler CUDA cores take up more space than a single Fermi CUDA core even though the individual Kepler CUDA cores are smaller. So how did NVIDIA pay for their new die size penalty?

Obviously 28nm plays a significant part of that, but even then the reduction in feature size from moving to TSMC’s 28nm process is less than 50%; this isn’t enough to pack 1536 CUDA cores into less space than what previously held 384. As it turns out not only did NVIDIA need to work on power efficiency to make Kepler work, but they needed to work on area efficiency. There are a few small design choices that save space, such as using 8 SMXes instead of 16 smaller SMXes, but along with dropping the shader clock NVIDIA made one other change to improve both power and area efficiency: scheduling.

GF114, owing to its heritage as a compute GPU, had a rather complex scheduler. Fermi GPUs not only did basic scheduling in hardware such as register scoreboarding (keeping track of warps waiting on memory accesses and other long latency operations) and choosing the next warp from the pool to execute, but Fermi was also responsible for scheduling instructions within the warps themselves. While hardware scheduling of this nature is not difficult, it is relatively expensive on both a power and area efficiency basis as it requires implementing a complex hardware block to do dependency checking and prevent other types of data hazards. And since GK104 was to have 32 of these complex hardware schedulers, the scheduling system was reevaluated based on area and power efficiency, and eventually stripped down.

The end result is an interesting one, if only because by conventional standards it’s going in reverse. With GK104 NVIDIA is going back to static scheduling. Traditionally, processors have started with static scheduling and then moved to hardware scheduling as both software and hardware complexity has increased. Hardware instruction scheduling allows the processor to schedule instructions in the most efficient manner in real time as conditions permit, as opposed to strictly following the order of the code itself regardless of the code’s efficiency. This in turn improves the performance of the processor.

However based on their own internal research and simulations, in their search for efficiency NVIDIA found that hardware scheduling was consuming a fair bit of power and area for few benefits. In particular, since Kepler’s math pipeline has a fixed latency, hardware scheduling of the instruction inside of a warp was redundant since the compiler already knew the latency of each math instruction it issued. So NVIDIA has replaced Fermi’s complex scheduler with a far simpler scheduler that still uses scoreboarding and other methods for inter-warp scheduling, but moves the scheduling of instructions in a warp into NVIDIA’s compiler. In essence it’s a return to static scheduling.

Ultimately it remains to be seen just what the impact of this move will be. Hardware scheduling makes all the sense in the world for complex compute applications, which is a big reason why Fermi had hardware scheduling in the first place, and for that matter why AMD moved to hardware scheduling with GCN. At the same time however when it comes to graphics workloads even complex shader programs are simple relative to complex compute applications, so it’s not at all clear that this will have a significant impact on graphics performance, and indeed if it did have a significant impact on graphics performance we can’t imagine NVIDIA would go this way.

What is clear at this time though is that NVIDIA is pitching GTX 680 specifically for consumer graphics while downplaying compute, which says a lot right there. Given their call for efficiency and how some of Fermi’s compute capabilities were already stripped for GF114, this does read like an attempt to further strip compute capabilities from their consumer GPUs in order to boost efficiency. Amusingly, whereas AMD seems to have moved closer to Fermi with GCN by adding compute performance, NVIDIA seems to have moved closer to Cayman with Kepler by taking it away.

With that said, in discussing Kepler with NVIDIA’s Jonah Alben, one thing that was made clear is that NVIDIA does consider this the better way to go. They’re pleased with the performance and efficiency they’re getting out of software scheduling, going so far to say that had they known what they know now about software versus hardware scheduling, they would have done Fermi differently. But whether this only applies to consumer GPUs or if it will apply to Big Kepler too remains to be seen.

The Kepler Architecture: Fermi Distilled GPU Boost: Turbo For GPUs
POST A COMMENT

405 Comments

View All Comments

  • jospoortvliet - Thursday, March 22, 2012 - link

    Seeing on other sites, the AMD does overclock better than the NVIDIA card - and the difference in power usage in every day scenario's is that NVIDIA uses a few more watts in idle and a few less under load.

    I'd agree with my dutch hardware.info site which concludes that the two cards are incredibly close and that price should determine what you'd buy.

    A quick look shows that at least in NL, the AMD is about 50 bucks cheaper so unless NVIDIA lowers their price, the 7970 continues to be the better buy.

    Obviously, AMD has higher costs with the bigger die so NVIDIA should have higher margins. If only they weren't so late to market...

    Let's see what the 7990 and NVIDIA's answer to that will do; and what the 8000 and 700 series will do and when they will be released. NVIDIA will have to make sure they don't lag behind AMD anymore, this is hurting them...
    Reply
  • theartdude - Thursday, March 22, 2012 - link

    Late to market? with Battlefield DLC, Diablo III, MechWarrier Online (and many more titles approaching), this is the PERFECT TIME for an upgrade, btw, my computer is begging for an upgrade right now, just in time for summer-time LAN parties. Reply
  • CeriseCogburn - Tuesday, March 27, 2012 - link

    GTX680 overclocks to 1,280 out of the box for an average easy attempt...
    http://www.newegg.com/Product/Product.aspx?Item=N8...

    See the feedback bro.
    7970 makes it to 1200 if it's very lucky.
    Sorry, another lie is 7970 oc's better.
    Reply
  • CeriseCogburn - Tuesday, March 27, 2012 - link

    So you're telling me the LIGHTNING amd card is cheaper ? LOL
    Further, if you don't get that exact model you won't get the overclocks, and they got a pathetic 100 on the nvidia, which noobs surpass regularly, then they used 2dmark 11 which has amd tessellation driver cheating active.... (apparently they are clueless there as well).
    Furthermore, they declared the Nvidia card 10% faster overall- well worth the 50 bucks difference for your generic AMD card no Overclocked LIghtning further overclocked with the special vrm's onboard and much more expensive... then not game tested but benched in amd cheater ware 3dmark 11 tess cheat.
    Reply
  • Reaper_17 - Thursday, March 22, 2012 - link

    i agree, Reply
  • blanarahul - Tuesday, March 27, 2012 - link

    Mr. AMD Fan Boy then you should compare how was AMD doing it since since the HD 5000 Series.

    6970= 880 MHz
    GTX 580=772 MHz
    Is it a fair comparison?

    GTX 480=702 MHz
    HD 5870=850 Mhz
    Is it a fair compaison?

    According to your argument the NVIDIA cards were at a disadvantage since the AMD cards were always clocked higher. But still the NVIDIA cards were better.

    And now that NVIDIA has taken the lead in clock speeds you are crying like a baby that NVIDIA built a souped up overclocked GK104.

    First check the facts. Plus the HD 8000 series aren't gonna come so early.
    Reply
  • CeriseCogburn - Friday, April 06, 2012 - link

    LOL
    +1
    Tell 'em bro !
    (fanboys and fairness don't mix)
    Reply
  • Sabresiberian - Thursday, March 22, 2012 - link

    Yah, I agree here. Clearly, once again, your favorite game and the screen size (resolution) you run at are going to be important factors in making a wise choice.

    ;)
    Reply
  • Concillian - Thursday, March 22, 2012 - link

    "... but he's correct. The 680 does dominate in nearly every situation and category."

    Except some of the most consistently and historically demanding games (Crysis Warhead and Metro 2033) it doesn't fare so well compared to the AMD designs. What does this mean if the PC gaming market ever breaks out of it's console port funk?

    I suppose it's unlikely, but it indicates it handles easy loads well (loads that can often be handled by a lesser card,) but when it comes to the most demanding resolutions and games, it loses a lot of steam compared to the AMD offering, to the point where it goes from a >15% lead in games that don't need it (Portal 2, for example) to a 10-20% loss in Crysis Warhead at 2560x.

    That it struggles in what are traditionally the most demanding games is worrisome, but, I suppose as long as developers continue pumping out the relatively easy to render console ports, it shouldn't pose any major issues.
    Reply
  • Eugene86 - Thursday, March 22, 2012 - link

    Yes, because people are really buying both the 7970 and GTX680 to play Crysis Warhead at 2560x.... :eyeroll:

    Nobody cares about old, unoptimized games like that. How about you take a look at the benchmarks that actually, realistically, matter. Look at the benches for Battlefield 3, which is a game that people are actually playing right now. The GTX680 kills the 7970 with about 35% higher frame rates, according to the benchmarks posted in this review.

    THAT is what actually matters and that is why the GTX680 is a better card than the 7970.
    Reply

Log in

Don't have an account? Sign up now