Final Words

At a high level, the Titan supercomputer delivers an order of magnitude increase in performance over the outgoing Jaguar system at roughly the same energy price. Using over 200,000 AMD Opteron cores, Jaguar could deliver roughly 2.3 petaflops of performance at around 7MW of power consumption. Titan approaches 300,000 AMD Opteron cores but adds nearly 19,000 NVIDIA K20 GPUs, delivering over 20 petaflops of performance at "only" 9MW. The question remains: how can it be done again?

In 4 years, Titan will be obsolete and another set of upgrades will have to happen to increase performance in the same power envelope. By 2016 ORNL hopes to be able to build a supercomputer capable of 10x the performance of Titan but within a similar power envelope. The trick is, you don't get the performance efficiency from first adopting GPUs for compute a second time. ORNL will have to rely on process node shrinks and improvements in architectural efficiency, on both CPU and GPU fronts, to deliver the next 10x performance increase. Over the next few years we'll see more integration between the CPU and GPU with an on-die communication fabric. The march towards integration will help improve usable performance in supercomputers just as it will in client machines.

Increasing performance by 10x in 4 years doesn't seem so far fetched, but breaking the 1 Exaflop barrier by 2020 - 2022 will require something much more exotic. One possibility is to move from big beefy x86 CPU cores to billions of simpler cores. Given ORNL's close relationship with NVIDIA, it's likely that the smartphone core approach is being advocated internally. Everyone involved has differing definitions of what is a simple core (by 2020 Haswell will look pretty darn simple), but it's clear that whatever comes after Titan's replacement won't just look like a bigger, faster Titan. There will have to be more fundamental shifts in order to increase performance by 2 orders of magnitude over the next decade. Luckily there are many research projects that have yet to come to fruition. Die stacking and silicon photonics both come to mind, even though we'll need more than just that.

It's incredible to think that the most recent increase in supercomputer performance has its roots in PC gaming. These multi-billion transistor GPUs first came about to improve performance and visual fidelity in 3D games. The first consumer GPUs were built to better simulate reality so we could have more realistic games. It's not too surprising then to think that in the research space the same demands apply, although in pursuit of a different goal: to create realistic models of the world and universe around us. It's honestly one of the best uses of compute that I've ever seen.

Applying for Time on Titan & Supercomputing Applications


View All Comments

  • mayankleoboy1 - Wednesday, October 31, 2012 - link

    Is there any scope of a FPGA or a group of FPGA's that replace standard algorithms with hardware implementations ?

    Example : Fourier transforms, matrix multiplication.
  • prashanth07 - Wednesday, October 31, 2012 - link

    Yes, there is significant research going on. In our lab we had a pretty big group working of using FPGAs for HPC. The RC based supercomputer is called Novo-G. It was the worlds biggest publicly known RC super computer.

    It is very small in physical size compared to some of the top conventional super computer, but for some specific compute requirements it comes close to beating top supercomputers. There was a major upgrade planned (around the time I was graduating) so it might even better now.
    What exact type of computations? I don't remember very well (I didn't work on RC, I was mostly s/w guy in conventional HPC part of lab), you might be able to get some info by checking out few posters or papers abstract.

    See: (very outdated, We didn't had anyone updating this site regularly) (very abstract, low on specific info)
  • Guspaz - Wednesday, October 31, 2012 - link

    Just think, if Moore's law holds for another few decades, you'll see this performance in a smartphone in 20-30 years... Reply
  • Montrey - Saturday, November 03, 2012 - link

    According to the paper, it takes 6 to 8 years for the #1 computer on the list to move to #500, and then another 8 to 10 years for that performance to be available in your average notebook computer. Not sure on notebook to smartphone, but it can't be very long.
  • Doh! - Wednesday, October 31, 2012 - link

    This kind of article keeps me coming back to Awesome stuff. Reply
  • bl4C - Wednesday, October 31, 2012 - link

    indeed, i was thinking:
    "now THIS is an article"

    great, thx !
  • gun_will_travel - Wednesday, October 31, 2012 - link

    With all the settings turned up. Reply
  • dragonsqrrl - Wednesday, October 31, 2012 - link

    Anand, I just want to confirm the core count on the Tesla K20. So this means one of the 15 SMX blocks is disabled on the K20? Reply
  • Ryan Smith - Wednesday, October 31, 2012 - link

    We're basing our numbers off of the figures published by HPCWire.

    For a given clockspeed of 732MHz and DP performance of 1.3TFLOPs, it has to be 14 SMXes. The math doesn't work for anything else.
  • RussianSensation - Wednesday, October 31, 2012 - link

    The article only states a range for DP of 1.2-1.3 Tflops.

    The specification could be 705mhz with GPU boost to 732mhz x 2496 CUDA cores ~ 1.22 Tflops

    Not saying it can't be 2688 CUDA cores but you are using the high-end of the range when the article clearly lists a range of 1.2-1.3Tflops. I don't think you can just assume that it's 2688 without a confirmation given the range of values provided.

Log in

Don't have an account? Sign up now