NVIDIA Launches Tesla K20, Cont

To put the Tesla K20's performance in perspective, this is going to be a very significant increase in the level of compute performance NVIDIA can offer with the Tesla lineup. The Fermi based M2090 offered 655 GFLOPS of performance with FP64 workloads, while the K20X will straight-up double that with 1.31 TFLOPS. Meanwhile in the 225W envelope the 1.17 TFLOPS K20 will be replacing the 515 GFLOPS M2075, more than doubling NVIDIA’s FP64 performance there. As for FP32 workloads the gains are even greater due to the fact that NVIDIA’s FP64 rate has fallen from ½ on GF100/GF110 Fermi to 1/3 on GK110 Kepler; the 1.33 TFLOPS M2090 for example is being replaced by the 3.95 TFLOPS K20X.

Speaking of FP32 performance, when asked about the K10 NVIDIA told us that K20 would not be replacing K10, rather the two will exist side-by-side. K10 actually has better FP32 performance at 4.5 TFLOPs (albeit split across two GPUs), but as it’s based on the GK104 GPU it lacks some Tesla features like on-die (SRAM) ECC protection and HyperQ/Dynamic Parallelism. For the user base that could already be sufficiently served by the K10 it will continue to exist for those users, while for the FP64 users and users who needed ECC and other Tesla features K20 will now step up to the plate as NVIDIA’s other FP32 compute powerhouse.

The Tesla K20 family will be going up against a number of competitors, both traditional and new. On a macro level the K20 family and supercomputers based on it like Titan will go up against more traditional supercomputers like those based on IBM’s BlueGene/Q hardware, which Titan is just now dethroning in the Top500 list.


A Titan compute board: 4 AMD Opteron (16-core CPUs) + 4 NVIDIA Tesla K20 GPUs

Meanwhile on a micro/individual level the K20 family will be going up against products like AMD’s FirePro S9000 and FirePro S10000, along with Intel’s Xeon Phi, their first product based on their GPU-like MIC architecture. Both the Xeon Phi and FirePro S series can exceed 1 TFLOPS FP64 performance, making them potentially strong competition for the K20. Ultimately these products aren’t going to be separated by their theoretical performance but rather their real world performance, so while NVIDIA has a significant 30%+ lead in theoretical performance over their most similar competition (FirePro S9000 and Xeon Phi) it's too early to tell whether the real world performance difference will be quite that large, or conversely whether it will be even larger. Tool chains will also play a huge part here, with K20 relying predominantly on CUDA, the FirePro S on OpenCL, and the Xeon Phi on x86 coupled with Phi-specific tools.

Finally, let’s talk about pricing and availability. NVIDIA’s previous projection for K20 family availability was December, but they have now moved ahead by a couple of weeks. K20 products are already shipping to NVIDIA’s server partners, with those partners and NVIDIA both getting ready to ship to buyers soon after that. NVIDIA’s general guidance is November-December, so some customers should have K20 cards in their hands before the end of the month.

Meanwhile pricing will be in the $3000 to $5000 range, owing mostly to the fact that NVIDIA’s list prices rarely line up with the retail price of their cards, or what their server partners charge customers for specific cards. Back at the Quadro K5000 launch NVIDIA announced a MSRP of $3199 for the K20, and we’d expect the shipping K20 to trend close to that. Meanwhile we expect the K20X to trend closer to $4000-$5000, again depending on various markup factors.


K20 Pricing As Announced During Quadro K5000 Launch

As for the total number of cards they’re looking at shipping and the breakdown of K20/K20X, NVIDIA’s professional solutions group is as mum as usual, but whatever it is we’re being told it won’t initially be enough.  NVIDIA is already taking pre-orders through their server partners, with a very large number of pre-orders outstripping the supply of cards and creating a backlog.

Interestingly NVIDIA tells us that their yields are terrific – a statement backed up in their latest financial statement – so the problem NVIDIA is facing appears to be demand and allocation rather than manufacturing. This isn’t necessarily a good problem to have as either situation involves NVIDIA selling fewer Teslas than they’d like, but it’s the better of the two scenarios. Similarly, for the last month NVIDIA has been offering time on a K20 cluster to customers, only for it to end up being oversubscribed due to the high demand from customers. So NVIDIA has no shortage of customers at the moment.

Ultimately the Tesla K20 launch appears to be shaping up very well for NVIDIA. Fermi was NVIDIA’s first “modern” compute architecture, and while it didn’t drive the kind of exponential growth that NVIDIA had once predicted it was very well received regardless. Though there’s no guarantee that Tesla K20 will finally hit that billion dollar mark, the K20 enthusiasm coming out of NVIDIA is significant, legitimate, and infectious. Powering the #1 computer in the Top500 list is a critical milestone for the company’s Tesla business and is just about the most positive press the company could ever hope for. With Titan behind them, Tesla K20 may be just what the company needs to finally vault themselves into a position as a premiere supplier of HPC processors.

NVIDIA Launches Tesla K20 GK110: The GPU Behind Tesla K20
Comments Locked

73 Comments

View All Comments

  • dcollins - Monday, November 12, 2012 - link

    It should be noted that recursive algorithms are not always more difficult to understand than their iterative counterparts. For example, the quicksort algorithm used in nvidia's demos is extremely simple to implement recursively but somewhat tricky to get right with loops.

    The ability to directly spawn sub-kernels has applications beyond supporting recursive GPU programming. I could see the ability to create your own workers would simplify some problems and leave the CPU free to do other work. Imagine an image processing problem where a GPU kernel could do the work of sharding an image and distributing it to local workers instead of relying on a (comparatively) distant CPU to perform that task.

    In the end, this gives more flexibility to GPU compute programs which will eventually allow them to solve more problems, more efficiently.
  • mayankleoboy1 - Monday, November 12, 2012 - link

    We need compilers that can work on GPGPU to massively speed up compilation times.
  • Loki726 - Tuesday, November 13, 2012 - link

    I'm working on this. It actually isn't as hard as it might seem at first glance.

    The amount of parallelism in many compiler optimizations scale with program size, and the simplest algorithms basically boil down to for(all instructions/functions/etc) { do something; }. Everything isn't so simple though, and it still isn't clear if there are parallel versions of some algorithms that are as efficient as their sequential implementations (value-numbering is a good example).

    So far the following work very well on a massively parallel processor:
    - instruction selection
    - dataflow analysis (live sets, reaching defs)
    - control flow analysis
    - dominance analysis
    - static single assignment conversion
    - linear scan register allocation
    - strength reduction, instruction simplification
    - constant propagation (local)
    - control flow simplification

    These are a bit harder and need more work:
    - register allocation (general)
    - instruction scheduling
    - instruction subgraph isomorphism (more general instruction selection)
    - subexpression elimination/value numbering
    - loop analysis
    - alias analysis
    - constant propagation (global)
    - others

    Some of these might end up being easy, but I just haven't gotten to them yet.

    The language frontend would also require a lot of work. It has been shown
    that it is possible to parallelize parsing, but writing a parallel parser for a language
    like C++ would be a very challenging software design project. It would probably make more sense to build a parallel parser generator for framework like Bison or ANTLR than to do it by hand.
  • eachus - Wednesday, November 14, 2012 - link

    I've always assumed that the way to do compiles on a GPU or other heavily parallel CPU is to do the parsing in a single sequential process, then spread the semantic analysis and code generation over as many threads as you can.

    I may be biased in this since I've done a lot of work with Ada, where adding (or changing) a 10 line file can cause hundreds of re-analysis/code-generation tasks. The same thing can happen in any object-oriented language. A change to a class library, even just adding another entry point, can cause all units that depend on the class to be recompiled to some extent. In Ada you can often bypass the parser, but there are gotchas when the new function has the same (simple) name as an existing function, but a different profile.

    Anyway, most Ada compilers, including the GNAT front-end for GCC will use as many CPU cores as are available. However, I don't know of any compiler yet that uses GPUs.
  • Loki726 - Thursday, November 15, 2012 - link

    The language frontend (semantic analysis and IR generation, not just parsing) for C++ is generally harder than languages that have concepts of import/modules or interfaces because you typically need to parse all included files for every object file. This is especially true for big code bases (with template libraries).

    GPUs need thousands of way parallelism rather than just one dimension for each file/object, so it is necessary to extract parallelism on a much finer granularity (e.g. for each instruction/value).

    A major part of the reason why GPU compilers don't exist is because they are typically large/complex codebases that don't map well onto parallel models like OpenMP/OpenACC etc. The compilers for many languages like OpenCL are also immature enough that writing a debugging a large codebase like this would be intractable.

    CUDA is about the only language right now that is stable enough and has enough language features (dynamic memory allocation, object oriented programming, template) to try. I'm writing all of the code in C++ right now and hoping that CUDA will eventually cease to be a restricted subset of C++ and just become C++ (all it is missing is exceptions, the standard library, and some minor features that other compilers are also lax about supporting).
  • CeriseCogburn - Thursday, November 29, 2012 - link

    Don't let the AMD fans see that about OpenCL sucking so badly and being immature.
    It's their holy grail of hatred against nVidia/Cuda.
    You might want to hire some protection.
    I can only say it's no surprise to me, as the amd fanboys are idiots 100% of the time.
    Now as amd crashes completely, gets stuffed in a bankruptcy, gets dismantled and bought up as it's engineers are even now being pillaged and fired, the sorry amd fanboy has "no drivers" to look forward to.
    I sure hope their 3G ram 79xx "futureproof investment" they wailed and moaned about being the way to go for months on end will work with the new games...no new drivers... 3rd tier loser engineers , sparse crew, no donuts and coffee...
    *snickering madly*
    The last laugh is coming, justice will be served !
    I'd just like to thank all the radeon amd ragers here for all the years of lies and spinning and amd tokus kissing, the giant suction cup that is your pieholes writ large will soon be able to draw a fresh breath of air, you'll need it to feed all those delicious tears.
    ROFL
    I think I'll write the second edition of "The Joys of Living".
  • inaphasia - Tuesday, November 13, 2012 - link

    Everybody seems to be fixated on the fact that the K20 doesn't have ALL it's SMXes enabled and assuming this is the result of binning/poor yields, whatever...

    AFAICT the question everybody should be asking and the one I'd love to know the answer to is:
    Why does the TFLOP/W ratio actually IMPROVE when nVidia does that?

    Watt for Watt the 660Ti is slightly better at compute than the 680 and far better than the 670, and we all know they are based on the "same" GK104 chip. Why? How?

    My theory is that even if TSCM's output of the GK110 was golden, we'd still be looking at disabled SMXes. Of course since it's just a theory, it could very well be wrong.
  • frenchy_2001 - Tuesday, November 13, 2012 - link

    No, you are probably right.

    Products are more than their raw capabilities. When GF100 came out, Nvidia placed a 480 core version (out of 512) in the consumer market (at 700MHz+) and a 448 at 575MHz in the Quadro 6000. Power consumption, reliability and longevity were all parts of that decision.

    This is part of what was highlighted in the article as a difference between K20X and K20, the 235W vs 225W makes a big difference if your chassis is designed for the latter.
  • Harry Lloyd - Tuesday, November 13, 2012 - link

    Can you actually play games with these cards (drivers)?

    I reckon some enthusiasts would pick this up.
  • Ryan Smith - Wednesday, November 14, 2012 - link

    Unfortunately not. If nothing else, because there aren't any display outputs.

Log in

Don't have an account? Sign up now