Inside the Titan Supercomputer: 299K AMD x86 Cores and 18.6K NVIDIA GPUs

Name: Inside the Titan Supercomputer: 299K AMD x86 Cores and 18.6K NVIDIA GPUs
Item: Inside the Titan Supercomputer: 299K AMD x86 Cores and 18.6K NVIDIA GPUs
Author: Anand Lal Shimpi

by Anand Lal Shimpi on October 31, 2012 1:28 AM EST

130 Comments | Add A Comment

130 Comments

Physical Architecture

The physical architecture of Titan is just as interesting as the high level core and transistor counts. I mentioned earlier that Titan is built from 200 cabinets. Inside each cabinets are Cray XK7 boards, each of which has four AMD G34 sockets and four PCIe slots. These aren't standard desktop PCIe slots, but rather much smaller SXM slots. The K20s NVIDIA sells to Cray come on little SXM cards without frivolous features like display outputs. The SXM form factor is similar to the MXM form factor used in some notebooks.

Gallery: Oak Ridge National Laboratory Tour - Titan Installation

There's no way around it. ORNL techs had to install 18,688 CPUs and GPUs over the past few weeks in order to get Titan up and running. Around 10 of the formerly-Jaguar cabinets had these new XK boards but were using Fermi GPUs. I got to witness one of the older boards get upgraded to K20. The process isn't all that different from what you'd see in a desktop: remove screws, remove old card, install new card, replace screws. The form factor and scale of installation are obviously very different, but the basic premise remains.

As with all computer components, there's no guarantee that every single chip and card is going to work. When you're dealing with over 18,000 computers as a part of a single entity, there are bound to be failures. All of the compute nodes go through testing, and faulty hardware swapped out, before the upgrade is technically complete.

OS & Software

Titan runs the Cray Linux Environment, which is based on SUSE 11. The OS has to be hardened and modified for operation on such a large scale. In order to prevent serialization caused by interrupts, Cray takes some of the cores and uses them to run all of the OS tasks so that applications running elsewhere aren't interrupted by the OS.

Jobs are batch scheduled on Titan using Moab and Torque.

AMD CPUs and NVIDIA GPUs

If you're curious about why Titan uses Opterons, the explanation is actually pretty simple. Titan is a large installation of Cray XK7 cabinets, so CPU support is actually defined by Cray. Back in 2005 when Jaguar made its debut, AMD's Opterons were superior to the Intel Xeon alternative. The evolution of Cray's XT/XK lines simply stemmed from that point, with Opteron being the supported CPU of choice.

The GPU decision was just as simple. NVIDIA has been focusing on non-gaming compute applications for its GPUs for years now. The decision to partner with NVIDIA on the Titan project was made around 3 years ago. At the time, AMD didn't have a competitive GPU compute roadmap. If you remember back to our first Fermi architecture article from back in 2009, I wrote the following:

"By adding support for ECC, enabling C++ and easier Visual Studio integration, NVIDIA believes that Fermi will open its Tesla business up to a group of clients that would previously not so much as speak to NVIDIA. ECC is the killer feature there."

At the time I didn't know it, but ORNL was one of those clients. With almost 19,000 GPUs, errors are bound to happen. Having ECC support was a must have for GPU enabled Jaguar and Titan compute nodes. The ORNL folks tell me that CUDA was also a big selling point for NVIDIA.

Finally, some of the new features specific to K20/GK110 (e.g. Hyper Q and GPU Direct) made Kepler the right point to go all-in with GPU compute.

Power Delivery & Cooling

Titan's cabinets require 480V input to reduce overall cable thickness compared to standard 208V cabling. Total power consumption for Titan should be around 9 megawatts under full load and around 7 megawatts during typical use. The building that Titan is housed in has over 25 megawatts of power delivered to it.

In the event of a power failure there's no cost effective way to keep the compute portion of Titan up and running (remember, 9 megawatts), but you still want IO and networking operational. Flywheel based UPSes kick in, in the event of a power interruption. They can power Titan's network and IO for long enough to give diesel generators time to come on line.

The cabinets themselves are air cooled, however the air itself is chilled using liquid cooling before entering the cabinet. ORNL has over 6600 tons of cooling capacity just to keep the recirculated air going into these cabinets cool.

Oak Ridge National Laboratory Applying for Time on Titan & Supercomputing Applications

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

130 Comments

View All Comments

galaxyranger - Sunday, November 4, 2012 - link
I am not intelligent in any way but I enjoy reading the articles on this site a great deal. It's probably my favorite site.

What I would like to know is how does Titan compare in power to the CPU that was at the center of the star ship Voyager?

Also, surely a supercomputer like Titan is powerful enough to become self aware, if it had the right software made for it?
Hethos - Tuesday, November 6, 2012 - link
For your second question, if it has the right software then any high-end consumer desktop PC could become self-aware. It would work rather sluggishly, compared to some sci-fi AIs like those in the Halo universe, but would potentially start learning and teaching itself.
Daggarhawk - Tuesday, November 6, 2012 - link
Hethos that is not by any stretch certain. Since "self awareness" or "consciousness" has never been engineered or simulated, it is still quite uncertain what the specific requirements would be to produce it. Yet here you're not only postulating that all it would take would be the right OS but also how well it would perform. My guess is that Titan would much sooner be able to simulate a brain (and therefore be able to learn, think, dream, and do all the things that brains do) much sooner than it would /become/ "a brain" It look a 128 core computer a 10hr run render a few-minute simulation of a complete single celled organism . Hard to say how much more compute power it would take to fully simulate a brain and be able to interact with it in real time. as for other methods of AI, it may take totally different kinds of hardware and networking all together.
quirksNquarks - Sunday, November 4, 2012 - link
Thank You,

this was a perfectly timed article - as people have forgotten why it is important the Technology keeps pushing boundaries regardless of *daily use* stagnation.

Also is a great example of why AMD does offer 16-core Chips. For These Kinds of Reasons! More Cores on One Chip means Less Chips are needed to be implemented - powered - tested - maintained.

an AMD 4 socket Mobo offers 64 cores. A personal Supercomputer. (Just think of how many they'll stuff full of ARM cores).

why Nvidia GPUs ?
a) Error Correction Code
b) CUDA

as to the CPUs...

http://www.newegg.ca/Product/Product.aspx?Item=N82...
$599 for every AMD 6274 chip (obvi they don't pay as much when ordering 300k).

vs

http://www.newegg.ca/Product/Product.aspx?Item=N82...
$1329 for an Intel Sandy Bridge equivalent which isn't really an equivalent considering these do NOT run in 4 socket designs. (obvi a little less when ordering in bulk numbers).

now multiple that price difference (the ratio) in the order of 10's of THOUSANDS!!

COMMON SENSE people.... Less Money for MORE CORES - or - More Money for LESS CORES ?
which road would YOU take? if you were footing the $ Bill.

but the Biggest thing to consider...

ORNL Upgraded from Jaguar <> Titan - which meant they ONLY needed a CHIP upgrade in that regards (( SAME SOCKET )) .. TRY THAT WITH INTEL > :P
phoenicyan - Monday, November 5, 2012 - link
I'd like to see description of logical architecture. I guess it could be 16x16x73 3D Torus.
XyaThir - Saturday, November 10, 2012 - link
Nice article, too bad there is nothing about the storage in this HPC cluster!
logain7997 - Tuesday, November 13, 2012 - link
Imagine the PPD this baby could produce folding. 0.0
hyperblaster - Tuesday, December 4, 2012 - link
In addition to the bit about ECC, nVidia really made headway over AMD primarily because of CUDA. nVidia specially targeted a whole bunch of developers of popular academic software and loaned out free engineers. Experienced devs from nVidia would actually do most of the legwork to port MPI code to CUDA, while AMD did nothing of the sort. Therefore, there is now a large body of well-optimized computational simulation software that supports CUDA (and not OpenCL). However, this is slowly changing and OpenCL is catching on.
Jag128 - Tuesday, January 15, 2013 - link
I wonder if it could play crysis on full?
mikbe - Friday, June 28, 2013 - link
I was actually surprise at how many actual times the word "actually" was actually used. Actually, the way it's actually used in this actual article it's actually meaningless and can actually be dropped, actually, most of the actual time.

Inside the Titan Supercomputer: 299K AMD x86 Cores and 18.6K NVIDIA GPUs

Physical Architecture

OS & Software

AMD CPUs and NVIDIA GPUs

Power Delivery & Cooling

Post Your Comment

130 Comments

View All Comments

galaxyranger - Sunday, November 4, 2012 - link

Hethos - Tuesday, November 6, 2012 - link

Daggarhawk - Tuesday, November 6, 2012 - link

quirksNquarks - Sunday, November 4, 2012 - link

phoenicyan - Monday, November 5, 2012 - link

XyaThir - Saturday, November 10, 2012 - link

logain7997 - Tuesday, November 13, 2012 - link

hyperblaster - Tuesday, December 4, 2012 - link

Jag128 - Tuesday, January 15, 2013 - link

mikbe - Friday, June 28, 2013 - link

Log in

Don't have an account? Sign up now