IBM, NVIDIA and Wistron have introduced their second-generation server for high-performance computing (HPC) applications at the OpenPOWER Summit. The new machine is designed for IBM’s latest POWER8 microprocessors, NVIDIA’s upcoming Tesla P100 compute accelerators (featuring the company’s Pascal architecture) and the company’s NVLink interconnection technology. The new HPC platform will require software makers to port their apps to the new architectures, which is why IBM and NVIDIA plan to help with that.

The new HPC platform developed by IBM, NVIDIA and Wistron (which is one of the major contract makers of servers) is based on several IBM POWER8 processors and several NVIDIA Tesla P100 accelerators. At present, the three companies do not reveal a lot of details about their new HPC platform, but it is safe to assume that it has two IBM POWER8 CPUs and four NVIDIA Tesla P100 accelerators. Assuming that every GP100 chip has four 20 GB/s NVLink interconnections, four GPUs is the maximum number of GPUs per CPU, which makes sense from bandwidth point of view. It is noteworthy that NVIDIA itself managed to install eight Tesla P100 into a 2P server (see the example of the NVIDIA DGX-1).

Correction 4/7/2016: Based on the images released by the OpenPOWER Foundation, the prototype server actually includes four, not eight NVIDIA Tesla P100 cards, as the original story suggested.

IBM’s POWER8 CPUs have 12 cores, each of which can handle eight hardware threads at the same time thanks to 16 execution pipelines. The 12-core POWER8 CPU can run at fairly high clock-rates of 3 – 3.5 GHz and integrate a total of 6 MB of L2 cache (512 KB per core) as well as 96 MB of L3 cache. Each POWER8 processor supports up to 1 TB of DDR3 or DDR4 memory with up to 230 GB/s sustained bandwidth (by comparison, Intel’s Xeon E5 v4 chips “only” support up to 76.8 GB/s of bandwidth with DDR4-2400). Since the POWER8 was designed both for high-end servers and supercomputers in mind, it also integrates a massive amount of PCIe controllers as well as multiple NVIDIA’s NVLinks to connect to special-purpose accelerators as well as the forthcoming Tesla compute processors.

Each NVIDIA Tesla P100 compute accelerator features 3584 stream processors, 4 MB L2 cache and 16 GB of HBM2 memory connected to the GPU using 4096-bit bus. Single-precision performance of the Tesla P100 is around 10.6 TFLOPS, whereas its double precision is approximately 5.3 TFLOPS. A HPC node with eight such accelerators will have a peak 32-bit compute performance of 84.8 TFLOPS, whereas its 64-bit compute capability will be 42.4 TFLOPS. The prototype developed by IBM, NVIDIA and Wistron integrates four Tesla P100 modules, hence, its SP performance is 42.4 TFLOPS, whereas its DP performance is approximately 21.2 TFLOPS. Just for comparison: NEC’s Earth Simulator supercomputer, which was the world’s most powerful system from June 2002 to June 2004, had a performance of 35.86 TFLOPS running the Linpack benchmark. The Earth Simulator consumed 3200 kW of POWER, it consisted of 640 nodes with eight vector processors and 16 GB of memory at each node, for a total of 5120 CPUs and 10 TB of RAM. Thanks to Tesla P100, it is now possible to get performance of the Earth Simulator from just one 2U box (or two 2U boxes if the prototype server by Wistron is used).

IBM, NVIDIA and Wistron expect their early second-generation HPC platforms featuring IBM’s POWER8 processors to become available in Q4 2016. This does not mean that the machines will be deployed widely from the start. At present, the majority of HPC systems are based on Intel’s or AMD’s x86 processors. Developers of high-performance compute apps will have to learn how to better utilize IBM’s POWER8 CPUs, NVIDIA Pascal GPUs, the additional bandwidth provided by the NVLink technology as well as the Page Migration Engine tech (unified memory) developed by IBM and NVIDIA. The two companies intend to create a network of labs to help developers and ISVs port applications on the new HPC platforms. These labs will be very important not only to IBM and NVIDIA, but also to the future of high-performance systems in general. Heterogeneous supercomputers can offer very high performance, but in order to use that, new methods of programming are needed.

The development of the second-gen IBM POWER8-based HPC server is an important step towards building the Summit and the Sierra supercomputers for Oak Ridge National Laboratory and Lawrence Livermore National Laboratory. Both the Summit and the Sierra will be based on the POWER9 processors formally introduced last year as well as NVIDIA’s upcoming code-named Volta processors. The systems will also rely on NVIDIA’s second-generation NVLink interconnect.

Image Sources: NVIDIA, Micron.

Source: OpenPOWER Foundation

Comments Locked

50 Comments

View All Comments

  • protomech - Wednesday, April 6, 2016 - link

    "The Earth Simulator consumed 20 kW of POWER"

    Should be 20 MW, I believe.

    I worked on a contemporary cluster computer, 16 TFLOPS (linpack) and consumed slightly over 1MW. GPGPU was starting to take off; we had some test systems with dual 7870s.

    Pretty impressive that you can get this much performance on your desk (or more likely a rolling rack you can tuck somewhere far away.. these will still be pretty loud).
  • ruthan - Wednesday, April 6, 2016 - link

    Maybe you have more knowledge, but last time i checked, there was some ARM servers ready boards with 48 or 96 CPU cores, but problem was that individual core was relatively weak.. Good for webservers, but not for other on easily parallel computing. Also ARM visualization still looks like something which is not still happening - Vmware dont support ARM at all, there is not big company behind it, like IBM for Lpars virtualisation or HP - vPars.

    As far as i know is strongest cores are in Tegra X1 ARM which is 4 "strong" APU (A57) + 4 weak (A53 )in little big design, even if i would ignore weak ones, its 4 cores in 15W TDP evelope, which means that they are 15/4- 4W cores.. in comparison with x86 were we have 65W package (Broadwell / Skylake) for 4 core - 1 cores uses - 16W.

    So i my eyes, inability to create just 1 beefier ARM core is main issue, why we not using ARM in our desktops, servers or consoles.. Im really curious how hard job is just create this beefy ARM, is angaist ARM architecture fundamentals or just about add some more L2, L3 cache, silicon and increase frequency?
  • Meteor2 - Friday, April 8, 2016 - link

    Apple is creeping towards that with A9X. Not a million miles off Intel m3.
  • tygrus - Wednesday, April 6, 2016 - link

    The whole point of using GPU's for highly parallised workloads is to do the parallel work and leave the single-threaded work to a CPU. It's pointless to have more CPU cores with lower single-threaded performance in conjunction with GPU's. The best system will have fast single-threaded performance in fewer CPU cores and leave the parallel work to the GPU's.
  • Arnulf - Thursday, April 7, 2016 - link

    Who in their right mind thought that putting a flat-chested lady wearing a male shirt next to a rack of switches would make for a good PR image? WTF is it supposed to represent anyway?
  • SaolDan - Thursday, April 7, 2016 - link

    Thats funny. Had to scroll up to look at the picture. She kinda looks like Amy from Big bang theory.
  • lashek37 - Thursday, April 7, 2016 - link

    She a lesbian.Good Pr move.
  • doggface - Thursday, April 7, 2016 - link

    Wow. And who says IT isnt full of sexist male pigs.

    Maybe she is meant to represent a scientist, who is busy sciencing things in a professional environment.

    I am sorry there isnt enough bikini there for you.
  • Michael Bay - Saturday, April 9, 2016 - link

    >muh PC
    >muh workplace sexism
    >muh powerful womyn that need no man

    IT is a male field.
    Scratch that, EVERYTHING is a male field. Deal with it.
  • doncornelius01 - Friday, April 15, 2016 - link

    Sexism defended?

Log in

Don't have an account? Sign up now