The Kepler Architecture: Efficiency & Scheduling

So far we’ve covered how NVIDIA has improved upon Fermi for; now let’s talk about why.

Mentioned quickly in our introduction, NVIDIA’s big push with Kepler is efficiency. Of course Kepler needs to be faster (it always needs to be faster), but at the same time the market is making a gradual shift towards higher efficiency products. On the desktop side of matters GPUs have more or less reached their limits as far as total power consumption goes, while in the mobile space products such as Ultrabooks demand GPUs that can match the low power consumption and heat dissipation levels these devices were built around. And while strictly speaking NVIDIA’s GPUs haven’t been inefficient, AMD has held an edge on performance per mm2 for quite some time, so there’s clear room for improvement.

In keeping with that ideal, for Kepler NVIDIA has chosen to focus on ways they can improve Fermi’s efficiency. As NVIDIA's VP of GPU Engineering, Jonah Alben puts it, “[we’ve] already built it, now let's build it better.”

There are numerous small changes in Kepler that reflect that goal, but of course the biggest change there was the removal of the shader clock in favor of wider functional units in order to execute a whole warp over a single clock cycle. The rationale for which is actually rather straightforward: a shader clock made sense when clockspeeds were low and die space was at a premium, but now with increasingly small fabrication processes this has flipped. As we have become familiar with in the CPU space over the last decade, higher clockspeeds become increasingly expensive until you reach a point where they’re too expensive – a point where just distributing that clock takes a fair bit of power on its own, not to mention the difficulty and expense of building functional units that will operate at those speeds.

With Kepler the cost of having a shader clock has finally become too much, leading NVIDIA to make the shift to a single clock. By NVIDIA’s own numbers, Kepler’s design shift saves power even if NVIDIA has to operate functional units that are twice as large. 2 Kepler CUDA cores consume 90% of the power of a single Fermi CUDA core, while the reduction in power consumption for the clock itself is far more dramatic, with clock power consumption having been reduced by 50%.

Of course as NVIDIA’s own slide clearly points out, this is a true tradeoff. NVIDIA gains on power efficiency, but they lose on area efficiency as 2 Kepler CUDA cores take up more space than a single Fermi CUDA core even though the individual Kepler CUDA cores are smaller. So how did NVIDIA pay for their new die size penalty?

Obviously 28nm plays a significant part of that, but even then the reduction in feature size from moving to TSMC’s 28nm process is less than 50%; this isn’t enough to pack 1536 CUDA cores into less space than what previously held 384. As it turns out not only did NVIDIA need to work on power efficiency to make Kepler work, but they needed to work on area efficiency. There are a few small design choices that save space, such as using 8 SMXes instead of 16 smaller SMXes, but along with dropping the shader clock NVIDIA made one other change to improve both power and area efficiency: scheduling.

GF114, owing to its heritage as a compute GPU, had a rather complex scheduler. Fermi GPUs not only did basic scheduling in hardware such as register scoreboarding (keeping track of warps waiting on memory accesses and other long latency operations) and choosing the next warp from the pool to execute, but Fermi was also responsible for scheduling instructions within the warps themselves. While hardware scheduling of this nature is not difficult, it is relatively expensive on both a power and area efficiency basis as it requires implementing a complex hardware block to do dependency checking and prevent other types of data hazards. And since GK104 was to have 32 of these complex hardware schedulers, the scheduling system was reevaluated based on area and power efficiency, and eventually stripped down.

The end result is an interesting one, if only because by conventional standards it’s going in reverse. With GK104 NVIDIA is going back to static scheduling. Traditionally, processors have started with static scheduling and then moved to hardware scheduling as both software and hardware complexity has increased. Hardware instruction scheduling allows the processor to schedule instructions in the most efficient manner in real time as conditions permit, as opposed to strictly following the order of the code itself regardless of the code’s efficiency. This in turn improves the performance of the processor.

However based on their own internal research and simulations, in their search for efficiency NVIDIA found that hardware scheduling was consuming a fair bit of power and area for few benefits. In particular, since Kepler’s math pipeline has a fixed latency, hardware scheduling of the instruction inside of a warp was redundant since the compiler already knew the latency of each math instruction it issued. So NVIDIA has replaced Fermi’s complex scheduler with a far simpler scheduler that still uses scoreboarding and other methods for inter-warp scheduling, but moves the scheduling of instructions in a warp into NVIDIA’s compiler. In essence it’s a return to static scheduling.

Ultimately it remains to be seen just what the impact of this move will be. Hardware scheduling makes all the sense in the world for complex compute applications, which is a big reason why Fermi had hardware scheduling in the first place, and for that matter why AMD moved to hardware scheduling with GCN. At the same time however when it comes to graphics workloads even complex shader programs are simple relative to complex compute applications, so it’s not at all clear that this will have a significant impact on graphics performance, and indeed if it did have a significant impact on graphics performance we can’t imagine NVIDIA would go this way.

What is clear at this time though is that NVIDIA is pitching GTX 680 specifically for consumer graphics while downplaying compute, which says a lot right there. Given their call for efficiency and how some of Fermi’s compute capabilities were already stripped for GF114, this does read like an attempt to further strip compute capabilities from their consumer GPUs in order to boost efficiency. Amusingly, whereas AMD seems to have moved closer to Fermi with GCN by adding compute performance, NVIDIA seems to have moved closer to Cayman with Kepler by taking it away.

With that said, in discussing Kepler with NVIDIA’s Jonah Alben, one thing that was made clear is that NVIDIA does consider this the better way to go. They’re pleased with the performance and efficiency they’re getting out of software scheduling, going so far to say that had they known what they know now about software versus hardware scheduling, they would have done Fermi differently. But whether this only applies to consumer GPUs or if it will apply to Big Kepler too remains to be seen.

The Kepler Architecture: Fermi Distilled GPU Boost: Turbo For GPUs
Comments Locked

404 Comments

View All Comments

  • Arbie - Friday, March 23, 2012 - link

    "I've always said, choose your hardware by application, not by overall results"

    Actually, that' is what I said. But I wasn't as pompous about it, which may have confused you.

    ;)
  • CeriseCogburn - Thursday, March 22, 2012 - link

    Well it's a good thing fair and impartial Ryan put the two games 680 doesn't trounce the 7970 in up first in the bench line up, so it would make amd look very good to the chart and chan click through crowd.
    Yeah, I like an alphabet that goes C for Crysis then M for Metro, so in fact A for AMD comes in first !
  • Sivar - Thursday, March 22, 2012 - link

    Many Anandtech articles not written by Anand have a certain, "written by an intelligent, geeky, slightly insecure teenager" feel to them. While still much better than other tech websites, and I've been around them all for some time, Anand is a cut above.

    This article, and a few others you've written, show that you are really getting the hang of being a truly professional writer.
    - Great technical detail without paraphrasing marketing material.
    - Not even the slightest hint of "fanboyism" for one company over another.
    - Doesn't drag on and on or repeat the same thing several times in slightly different ways.
    - Anand, who usually takes the cool articles for himself, had the trust in you to let you do this one solo.

    I would request, however, that you hyperlink some of the acronyms used. Even after being a reader since the Geocities days, it's sometimes difficult to remember every term and three letter combination on an article with so much depth and breadth.
    Also, for the sake of mobile users and image quality, there really needs to be some internal discussion on when to use which image format. PNG-8 for large areas of flat or gradient color, charts, screen captures, and slides -- but only when the source is not originally a JPG (because JPG subtly corrupts the image so as to ruin PNG's chance of compression) and JPG for pretty much all photographs. I wrote a program to analyze images and suggest a format -- Look for "ImageGuide" on Google Code.

    In any case, the fact that I can think of only the most minor of suggestions, as opposed to when I read a certain other website named after its founder of a much shorter name.
  • Sabresiberian - Thursday, March 22, 2012 - link

    I agree, another thorough review by one of the better people doing it on the internet. Thanks Ryan!

    As far as the dig on Tomshardware, I don't quite agree there. I notice Chris Angelini wrote the GTX 680 article for that website, and I'm very much looking forward to reading another thorough review.

    ;)
  • Sivar - Thursday, March 22, 2012 - link

    Tom's may have improved greatly since I last gave it another chance, but since not long after they were bought out, I've found the reporting to be flagrantly sensationalist and light on fact. The entity that bought them out, and the journalists he hired, are well known for just that. Many times I read the author's conclusion and wondered if he was looking at the same bar charts that I was.

    To be blunt, at times when people quoted their site, I felt as if I'd shifted into an alternate dimension where otherwise knowledgeable people were comically oblivious to the most egregiously flawed journalism. It was as if a group of Nobel prize winners were unthinkingly quoting Bill O'Reilly or Michael Moore on a political matter as if it was assumed they were a paragon of truth and even-headedness.
  • Sabresiberian - Thursday, March 22, 2012 - link

    Very well said. (I especially like the comment using both a staunch conservative and flaming liberal as examples of poor source material.)

    I do tend to look at specific writers, and probably give Toms too much credit based on that more narrow view. I freely admit to having a somewhat fanboy feel for the site, too, since it was one of the first and set a mark, at one time, unreached by any other site I knew about.

    I have been a bit confused by some statements made by some writers on that site, conclusions that didn't seem to be supported by the data they published. Perhaps it's time to step up and comment when that happens, instead of just interpreting my confusion as a lack of careful reading on my part (which happens to the best of us).

    ;)
  • Nfarce - Sunday, March 25, 2012 - link

    "It was as if a group of Nobel prize winners were unthinkingly quoting Bill O'Reilly or Michael Moore on a political matter"

    Well Obama, Al Gore, and Arafat were each given a Nobel Prize, so I'd hardly consider that entity a good reference point of analogy in validity. In Any event, I welcome opinions from all sides. The main stream "news" media long ago abandoned objective reporting. One is most informed by reading different takes on the same "facts" and formulate one's own opinion. Of course, you have to also research outside the spectrum for some information that the main stream media will hide from time to time: like how bad off the US economy really is.
  • Ryan Smith - Thursday, March 22, 2012 - link

    Thanks for the kind words, though I'm not sure whether "slightly insecure teenager" is a compliment on my youthful vigor or a knock against my immaturity.;-)

    Anyhow, we usually use PNGs where it makes sense. All of my photo processing is done with Photoshop, so I know ahead of time whether JPG or PNG will spit out a smaller image, and any blurring that may result. Generally speaking we should be using the right format in the right place, but if you have any specific examples where it's not, drop me a line (it will be hard to keep track of this thread) and I'll take a look.
  • IlllI - Thursday, March 22, 2012 - link

    ok there seems to be some confusion here. many times in the review you directly compare it to GF114 (which i think was never present in the 580 series) yet also at the same time you say the 680 is a direct replacement for the 580.
    i dont think it is. what it DOES seem like however, is that this 680 was indeed suppose to be the mainstream part, but since the ati competition was so low that nvidia just jacked up the card number (and price).
  • CeriseCogburn - Friday, March 23, 2012 - link

    So Nvidia should have dropped the 680, their GTX580($450+) killer in at $299...
    Charlie D's $299 rumor owns internet group think brains.

Log in

Don't have an account? Sign up now