Inside the Titan Supercomputer: 299K AMD x86 Cores and 18.6K NVIDIA GPUs

Name: Inside the Titan Supercomputer: 299K AMD x86 Cores and 18.6K NVIDIA GPUs
Item: Inside the Titan Supercomputer: 299K AMD x86 Cores and 18.6K NVIDIA GPUs
Author: Anand Lal Shimpi

by Anand Lal Shimpi on October 31, 2012 1:28 AM EST

130 Comments | Add A Comment

130 Comments

Physical Architecture

The physical architecture of Titan is just as interesting as the high level core and transistor counts. I mentioned earlier that Titan is built from 200 cabinets. Inside each cabinets are Cray XK7 boards, each of which has four AMD G34 sockets and four PCIe slots. These aren't standard desktop PCIe slots, but rather much smaller SXM slots. The K20s NVIDIA sells to Cray come on little SXM cards without frivolous features like display outputs. The SXM form factor is similar to the MXM form factor used in some notebooks.

Gallery: Oak Ridge National Laboratory Tour - Titan Installation

There's no way around it. ORNL techs had to install 18,688 CPUs and GPUs over the past few weeks in order to get Titan up and running. Around 10 of the formerly-Jaguar cabinets had these new XK boards but were using Fermi GPUs. I got to witness one of the older boards get upgraded to K20. The process isn't all that different from what you'd see in a desktop: remove screws, remove old card, install new card, replace screws. The form factor and scale of installation are obviously very different, but the basic premise remains.

As with all computer components, there's no guarantee that every single chip and card is going to work. When you're dealing with over 18,000 computers as a part of a single entity, there are bound to be failures. All of the compute nodes go through testing, and faulty hardware swapped out, before the upgrade is technically complete.

OS & Software

Titan runs the Cray Linux Environment, which is based on SUSE 11. The OS has to be hardened and modified for operation on such a large scale. In order to prevent serialization caused by interrupts, Cray takes some of the cores and uses them to run all of the OS tasks so that applications running elsewhere aren't interrupted by the OS.

Jobs are batch scheduled on Titan using Moab and Torque.

AMD CPUs and NVIDIA GPUs

If you're curious about why Titan uses Opterons, the explanation is actually pretty simple. Titan is a large installation of Cray XK7 cabinets, so CPU support is actually defined by Cray. Back in 2005 when Jaguar made its debut, AMD's Opterons were superior to the Intel Xeon alternative. The evolution of Cray's XT/XK lines simply stemmed from that point, with Opteron being the supported CPU of choice.

The GPU decision was just as simple. NVIDIA has been focusing on non-gaming compute applications for its GPUs for years now. The decision to partner with NVIDIA on the Titan project was made around 3 years ago. At the time, AMD didn't have a competitive GPU compute roadmap. If you remember back to our first Fermi architecture article from back in 2009, I wrote the following:

"By adding support for ECC, enabling C++ and easier Visual Studio integration, NVIDIA believes that Fermi will open its Tesla business up to a group of clients that would previously not so much as speak to NVIDIA. ECC is the killer feature there."

At the time I didn't know it, but ORNL was one of those clients. With almost 19,000 GPUs, errors are bound to happen. Having ECC support was a must have for GPU enabled Jaguar and Titan compute nodes. The ORNL folks tell me that CUDA was also a big selling point for NVIDIA.

Finally, some of the new features specific to K20/GK110 (e.g. Hyper Q and GPU Direct) made Kepler the right point to go all-in with GPU compute.

Power Delivery & Cooling

Titan's cabinets require 480V input to reduce overall cable thickness compared to standard 208V cabling. Total power consumption for Titan should be around 9 megawatts under full load and around 7 megawatts during typical use. The building that Titan is housed in has over 25 megawatts of power delivered to it.

In the event of a power failure there's no cost effective way to keep the compute portion of Titan up and running (remember, 9 megawatts), but you still want IO and networking operational. Flywheel based UPSes kick in, in the event of a power interruption. They can power Titan's network and IO for long enough to give diesel generators time to come on line.

The cabinets themselves are air cooled, however the air itself is chilled using liquid cooling before entering the cabinet. ORNL has over 6600 tons of cooling capacity just to keep the recirculated air going into these cabinets cool.

Oak Ridge National Laboratory Applying for Time on Titan & Supercomputing Applications

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

130 Comments

View All Comments

just4U - Wednesday, October 31, 2012 - link
"The evolution of Cray's XT/XK lines simply stemmed from that point, with Opteron being the supported CPU of choice."

-----

I would have liked more of explaination here.. Does that mean that Intel's line doesn't work as well? Are there plans by Cray to move to Intel?

Power draw must be key. I wonder what sort of power use they'd be looking at running Intel's proccessors.

Great to see AMD in that super computer though.. I just have questions about future plans based on the current situation in the cpu market.
Th-z - Wednesday, October 31, 2012 - link
Very nice article and love your last paragraph, Anand. It's a revelation. It is indeed incredible to think when we wanted that 3D accelerator to play GLQuake, it actually turned the wheel for great things to come. To think back, something as ordinary or insignificant as gaming actually paved the way to accelerate our knowledge today. This goes to show even ordinary things can morph into great things that one can never imagine. It really humbles you to not look down anything, to be respectful in this intertwined world, the same way it humbles us as human beings as we know more about the universe.
pman6 - Wednesday, October 31, 2012 - link
so that's where all of AMD's revenue came from.

I was wondering who was buying AMD products
CeriseCogburn - Saturday, November 10, 2012 - link
What amd revenue ?

Just look up and down, left and right here, the amd fanboys are legion - granted they can barely bone up 10 cents a week, but after a few years they can buy 2 generations back.
lorribot - Wednesday, October 31, 2012 - link
Wonder if PC game piracy will be blamed for the failure of the supercomputer industry?
Braincruser - Saturday, November 3, 2012 - link
Well, you see the more someone pirates games, the more money he has to invest in hardware. So the better the hardware gets. <- nothing beats simple logic.
ClagMaster - Wednesday, October 31, 2012 - link
I have been working with supercomputers for 25 years.

Although parallelism is very important for processing large models, there is one important feature Mr Anand failed to discuss about Titain, choosing instead to obscess about transistor count and CPU's and GPU's.

And that is how much memory per box is available. 96GB? 256GB? of DDR3-1333 memory?

Problem is usually memory for those large reactor or coupled neutron-gamma transport problems analyzed with Monte Carlo or Advanced Discrete Ordinates, not the number of processors. Need lots of memory for the geometry, depleteable materials, and cross-section data.

And once the computing is done, how much space is available for storing the results? I have seen models so large that they run for 2 weeks with over 2000 processors only to fail because the file storage system ran out of space to store the output files.
garadante - Wednesday, October 31, 2012 - link
You failed to read the entire article. Anand stated there was something like 32 GB of RAM per CPU and 6 GB per GPU (if I remember correctly, going off the top of my head) for a grand total of 710 TB RAM total as well as 1 PB of HDD storage available. Check back through the pages to find what exactly he posted.
chemist1 - Wednesday, October 31, 2012 - link
So Sandy Bridge does ~160 GFlops on the LINPACK benchmark, while Titan should do ~20 PFlops, making it 125K times faster. 125K ~ 2^17, so with 17 doublings a PC will be as fast as Titan. If we assume 1.5 years/doubling, that gives us 25 years. And just imagine the capabilities of a 2037 supercomputer....
pandemonium - Wednesday, October 31, 2012 - link
What a treat, for you, to be able to witness this. Thanks for the adventurous article, Anand! :)

Inside the Titan Supercomputer: 299K AMD x86 Cores and 18.6K NVIDIA GPUs

Physical Architecture

OS & Software

AMD CPUs and NVIDIA GPUs

Power Delivery & Cooling

Post Your Comment

130 Comments

View All Comments

just4U - Wednesday, October 31, 2012 - link

Th-z - Wednesday, October 31, 2012 - link

pman6 - Wednesday, October 31, 2012 - link

CeriseCogburn - Saturday, November 10, 2012 - link

lorribot - Wednesday, October 31, 2012 - link

Braincruser - Saturday, November 3, 2012 - link

ClagMaster - Wednesday, October 31, 2012 - link

garadante - Wednesday, October 31, 2012 - link

chemist1 - Wednesday, October 31, 2012 - link

pandemonium - Wednesday, October 31, 2012 - link

Log in

Don't have an account? Sign up now