Efficiency Through Hyper-Q, Dynamic Parallelism, & More

When NVIDIA first announced K20 they stated that their goal was to offer 3x the performance per watt of their Fermi based Tesla solutions. With wattage being held nearly constant from Fermi to Kepler, NVIDIA essentially needed to triple their total performance to reach that number.

However as we’ve already seen from NVIDIA’s hardware specifications, K20 triples their theoretical FP32 performance but not their theoretical FP64 performance, due to the fact that NVIDIA’s FP64 execution rate falls from ½ to 1/3their FP32 rate. Does that mean NVIDIA has given up on tripling their performance? No, but with Kepler the solution isn’t just going to be raw hardware, but the efficient use of existing hardware.

Of everything Kepler and GK110 in particular add to NVIDIA’s compute capabilities, their marquee features, HyperQ and Dynamic Parallelism, are firmly rooted in maximizing their efficiency. Now that we’ve seen what NVIDIA’s hardware can do at a low level, we’ll wrap up our look at K20 and GK110 by looking at how NVIDIA intends to maximize their efficiency and best feed the beast that is GK110.


Sometimes the simplest things can be the most powerful things, and this is very much the case for Hyper-Q.  Simply put, Hyper-Q expands the number of hardware work queues from 1 on GF100 to 32 on GK110. The significance of this being that having 1 work queue meant that GF100 could be under occupied at times (that is, hardware units were left without work to do) if there wasn’t enough work in that queue to fill every SM or if there were dependency issues, even with parallel kernels in play. By having 32 work queues to select from, GK110 can in many circumstances achieve higher utilization by being able to put different program streams on what would otherwise be an idle SMX.

The simplistic nature of Hyper-Q is further reinforced by the fact that it’s designed to easily map to MPI, a common message passing interface frequently used in HPC. As NVIDIA succinctly puts it, legacy MPI-based algorithms that were originally designed for multi-CPU systems and that became bottlenecked by false dependencies now have a solution. By increasing the number of MPI jobs (a very easy modification) it’s possible to utilize Hyper-Q on these algorithms to improve the efficiency all without changing the core algorithm itself. Ultimately this is also one of the ways NVIDIA hopes to improve their HPC market share, as by tweaking their hardware to better map to existing HPC workloads is in this fashion NVIDIA’s hardware will become a much higher performing option.

Dynamic Parallelism

If Hyper-Q was the simple efficiency feature, then NVIDIA’s other marquee feature, Dynamic Parallelism, is the harder and more complex of the features.

Dynamic Parallelism is NVIDIA’s name for the ability for kernels to be able to dispatch other kernels. With Fermi only the CPU could dispatch a new kernel, which incurs a certain amount of overhead by having to communicate back and forth with the CPU. By giving kernels the ability to dispatch their own child kernels, GK110 can both save time by not having to go back to the GPU, and in the process free up the CPU to work on other tasks.

The difficult of course comes from the fact that dynamic parallelism implicitly relies on recursion, to which as the saying goes “to understand recursion, you must first understand recursion”. The use of recursion brings with it many benefits so the usefulness of dynamic parallelism should not be understated, but if nothing else it’s a forward looking feature. Recursion isn’t something that can easily be added to existing algorithms, so taking full advantage of dynamic parallelism will require new algorithms specifically designed around it. (ed: fork bombs are ready-made for this)

Reduced ECC Overhead

Although this isn’t strictly a feature, one final efficiency minded addition to GK110 is the use of a new lower-overhead ECC algorithm. As you may recall, Tesla GPUs implement DRAM ECC in software, allowing ECC to be added without requiring wider DRAM busses to account for the checkbits, and allowing for ECC to be enabled and disabled as necessary. The tradeoff for this is that enabling ECC consumes some memory bandwidth, reducing effective memory bandwidth to kernels running on the GPU. GK110 doesn’t significantly change this model, but what it does do is reduce the amount of ECC checkbit traffic that results from ECC being turned on. The amount of memory bandwidth saved is workload dependent, but NVIDIA’s own tests are showing that the performance hit from enabling ECC has been reduced by 66% for their internal test suite.

Putting It All Together: The Programmer

Bringing things to a close, while we were on the subject of efficiency the issue of coder efficiency came up in our discussions with NVIDIA. GK110 is in many ways a direct continuation of Fermi, but at the same time it brings about a significant number of changes. Given the fact that HPC is so performance-centric and consequently often so heavily tuned for specific processors (a problem that also spans to consumer GPGPU workloads) we asked NVIDIA about just how well existing programs run on K20.

The short answer is that despite the architectural changes between Fermi and GK110, existing programs run well on K20 and are usually capable of taking advantage of the additional performance offered by the hardware. It’s clear that peak performance on K20 will typically require some rework, particularly to take advantage of features like dynamic parallelism, but otherwise we haven’t been hearing about any notable issues transitioning to K20 thus far.

Meanwhile as part of their marketing plank NVIDIA is also going to be focusing on bringing over additional HPC users by leveraging their support for OpenACC, MPI, and other common HPC libraries and technologies, and showcasing just how easy porting HPC programs to K20 is when using those technologies. Note that these comparisons can be a bit misleading since the core algorithms of most programs are complex yet code dense, but the main idea is not lost. For NVIDIA to continue to grow their HPC market share they will need to covert more HPC users from other systems, which means they need to make it as easy as possible to accommodate their existing code and tools.


GK110: The GPU Behind Tesla K20
Comments Locked


View All Comments

  • DanNeely - Monday, November 12, 2012 - link

    The Tesla (and quadro) cards have always been much more expensive than their consumer equivalents. The Fermi generation M2090 and M2070Q were priced at the same several thousand dollar pricepoint as K20 family; but the gaming oriented 570/580 were at the normal several hundred dollar prices you'd expect for a high end GPU.
  • wiyosaya - Tuesday, November 13, 2012 - link

    Yes, I understand that; however, IMHO, the performance differences are not significant enough to justify the huge price difference unless you work in very high end modeling or simulation.

    To me, with this generation of chips, this changes. I paid close attention to 680 reviews, and DP performance on 680 based cards is below that of the 580 - not, of course, that it matters to the average gamer. However, I highly doubt that the chips in these Teslas would not easily adapt to use as graphics cards.

    While it is nVidia's right to sell these into any market they want, as I see it, the only market for these cards is the HPC market, and that is my point. It will be interesting to see if nVidia continues to be able to make a profit on these cards now that they are targeted only at the high-end market. With the extreme margins on these cards, I would be surprised if they are unable to make a good profit on them.

    In other words, do they sell X amount at consumer prices, or do they sell Y amount at professional prices and which target market would be the better market for them in terms of profits? IMHO, X is likely the market where they will sell many times the amount of chips than they do in the Y market, but, for example, they can only charge 5X for the Y card. If they sell ten times the chips in X market, they will have lost profits buy targeting the Y market with these chips.

    Also, nVidia is writing their own ticket on these. They are making the market. They know that they have a product that every supercomputing center will have on its must buy list. I doubt that they are dumb.

    What I am saying here is that nVidia could sell these for almost any price they choose to any market. If nVidia wanted to, they could sell this into the home market at any price. It is nVidia that is making the choice of the price point. By selling the 680 at high-end enthusiast prices, they artificially push the price points of the market.

    Each time a new card comes out, we expect it to be more expensive than the last generation, and, therefore, consumers perceive that as good reason to pay more for the card. This happens in the gaming market, too. It does not matter to the average gamer that the 580 outperforms the 680 in DP operations; what matters is that games run faster. Thus, the 680 becomes worth it to the gamer and the price of the hardware gets artificially pushed higher - as I see it.

    IMHO, the problem with this is that nVidia may paint themselves into an elite market. Many companies have tried this, notably Compaq and currently Apple. Compaq failed, and Apple, depending on what analysts you listen to, is losing its creative edge - and with that may come the loss of its ability to charge high prices for its products. While nVidia may not fall into the "niche" market trap, as I see it, it is a pattern that looms on the horizon, and nVidia may fall into that trap if they are not careful.
  • CeriseCogburn - Thursday, November 29, 2012 - link

    Yep, amd is dying, rumors are it's going to be bought up after a chapter bankruptcy, restructured, saved from permadeath, and of course, it's nVidia that is in danger of killing itself... LOL
    Boinc is that insane sound in your head.
    NVidia professionals do not hear that sound, they are not insane.
  • shompa - Monday, November 12, 2012 - link

    These are not "home computer" cards. These are cards for high performance calculations "super computers". And the prices are low for this market.

    The unique thing about this years launch is that Nvidia always before sold consumer cards first and supercomputer cards later. This time its the other way.

    Nvidia uses the supercomputer cards for more or less subsidising its "home PC" graphic cards. Usually its the same card but with different drivers.

    Home 500 dollars
    Workstation 1000-1500 dollars
    Supercomputing 3000+ dollars

    Three different prices for the same card.

    But 7 billion transistors on 28nm will be expensive for home computing. It cost more then 100% more to manufacture these GPUs then Nvidia 680.

    7 BILLION. Remember that the first Pentium was the first 1 MILLION transistors. This is 7000 more dense.
  • kwrzesien - Monday, November 12, 2012 - link

    All true.

    But I think what has people complaining is that this time around Nvidia isn't going to release this "big" chip to the Home market at all. They signaled this pretty clearly by putting their "middle" chip into the 680. Unless they add a new top-level part name like a 695 or something they have excluded this part from the home graphics naming scheme. Plus since it is heavily FP64 biased it may not perform well for a card that would have to be sold for ~$1000. (Remember they are already getting $500 for their middle-size chip!)

    Record profits - that pretty much sums it up.
  • DanNeely - Monday, November 12, 2012 - link

    AFAIK that was necessity speaking. The GK100 had some (unspecified) problems; forcing them to put the Gk104 in both the mid and upper range of their product line. When the rest of the GK11x series chips show up and nVidia launches the 7xx series I expect to see GK110's in the top as usual. Having seen nVidia's midrange chip trade blows with their top end one, AMD is unlikely to be resting on it's laurels for their 8xxx series.
  • RussianSensation - Monday, November 12, 2012 - link

    Great to see someone who understood the situation NV was in. Also, people think NV is a charity or something. When they were selling 2x 294mm^2 GTX690 for $1000, we can approximate that on a per wafer cost, it would have been too expensive to launch a 550-600mm^2 GK100/110 early in the year and maintain NV's expected profit margins. They also faced wafer shortages which explains why they re-allocated mobile Kepler GPUs and had to delay under $300 desktop Kepler allocation by 6+ months to fulfill 300+ notebook design wins. Sure, it's still Kepler's mid-range chip in the Kepler family, but NV had to use GK104 as flagship.
  • CeriseCogburn - Thursday, November 29, 2012 - link

    kwrsezien, another amd fanboy idiot loser with a tinfoil brain and rumor mongered brainwashed gourd
    Everything you said is exactly wrong.
    Perhaps and OWS gathering will help your emotional turmoil, maybe you can protest in front of the nVidia campus.
    Good luck, wear red.
  • bebimbap - Monday, November 12, 2012 - link

    Each "part" being made with the "same" chip is more expensive for a reason.

    For example Hard drives made by the same manufacturer have different price points for enterprise, small business, and home user. I remember an Intel server rep said to use parts that are designed for their workload so enterprise "should" use an enterprise drive and so forth because of costs. And he added further that with extensive testing the bearings used in home user drives will force out their lubricant fluid causing the drive to spin slower and give read/write errors if used in certain enterprise scenarios, but if you let the drive sit on a shelf after it has "failed" it starts working perfectly again because the fluids returned to where they need to be. Enterprise drives also tend to have 1 or 2 orders of magnitude better bit read error rate than consumer drives too.

    In the same way i'm sure the tesla, quadro, and gtx all have different firmwares, different accepted error rates, different loads they are tested for, and different binning. So though you say "the same card" they are different.

    And home computing has changed and have gone in a different direction. No longer are we gaming in a room that needs a separate AC unit because of the 1500w of heat coming from the computer. We have moved from using 130w CPUs to only 78w. Single gpu cards are no longer using 350w but only 170w. so we went from using +600-1500w systems using ~80% efficient PSUs to using only about ~<300-600w with +90% efficient PSUs, and that is just under high loads. If we were to compare idle power, instead of only using 1/2 we are only using 1/10. We no longer need a GK110 based GPU, and it might be said that it will not make economic sense for the home user.

    GK104 is good enough.
  • EJ257 - Monday, November 12, 2012 - link

    The consumer model of this with the fully operational die will be in the $1000 range. 7 billion transitors is a really big chip even for 28nm process.

Log in

Don't have an account? Sign up now