Future Visions, Cont: POWERed by NVIDIA

We have to check for ourselves of course, but IBM claims that compared to a dual K80 setup, a dual P100 gets a 2.07x speedup on the S822LC HPC. The same dual P100 on a fast Xeon with PCIe 3.0 only saw a 1.5x speedup. The benchmark used was a rather exotic Lattice QCD, or an approach to "solve quantum chromodynamics".

However, IBM reports that NVLink removes performance bottlenecks in

  1. FFT (signal processing)
  2. STAC-A2 (risk analysis)
  3. CPMD - computational chemistry
  4. Hash tables (used in many algorithms, security and big data)
  5. Spark

Those got our attention as, they are not some exotic niche HPC applications, but wide spread software components/frameworks used in both the HPC and data analytics world.

NVIDIA also claims that thanks to NVLink and the improved page migration engine capabilities, a new breed of GPU accelerated applications will be possible. The unified memory space (CUDA 6) introduced in Kepler was a huge step forward for the CUDA programmers: they no longer had to explicitly copy data from the CPU to the GPU. The Page Migration Engine would do that for them.

But the current system (Kepler and Maxwell) also had quite a few limitations. For example the memory space where the CPU and GPU are sharing data was limited to size of the GPU memory (typically 8-16 GB). The P100 now gets 49-bit virtual addressing, which means CUDA programs can thread every available RAM byte as one big virtual space. In the case of the newly launched S822LC, this means up to 1 TB of DRAM, and consequently 1 TB of memory space. Secondly, the whole virtual address space is coherent thanks to the new page fault mechanism: both the CPU and GPU can access the DRAM together. This requires OS support, and NVIDIA cooperated with the Linux community to make this happen.

Of course as the unified memory space gets larger, the amount of data to transfer back and forth gets larger too and that is where NVLink and the extra memory bandwidth of the POWER8 have a large advantage. Remember that even the POWER8 with only 4 buffer chips delivered twice as much memory bandwidth than the best Xeons. The higher end POWER8 have 8 buffer chips, and as a result offer almost twice as much memory bandwidth.

NVLink, together with the beefy memory subsystem of the POWER8, ensures that CUDA applications using such a unified 1TB memory space can actually work well.


The POWER8 - al heatsinks - looks less hot headed now that it has the companion of 4 Tesla P100 GPUs...

The S822LC will cost less than $50000, and it offers a lot of FLOPS per dollar if you ask us. First consider that a single Tesla P100 SXM2 costs around $9500. The S822LC integrates four of them, two 10-core POWER8s and 256 GB of RAM. More than 21 TFLOPS (FP64) connected by the latest and greatest interconnects in a 2U box: the S822LC HPC is going to turn some heads.

Last but not least, note that once you add two or more GPUs which consume 300W each, the biggest disadvantage of the POWER8 almost literally melts away. The fact that each POWER8 CPU may consume 45-100W more than the high performance Xeons seems all of a sudden relative and not such a deal breaker anymore. Especially in the HPC world, where performance is more important than Watts.

Future Visions: POWER8 with NVLink Back to the Present: Real World Application Benchmarking on IBM's S812LC
Comments Locked

49 Comments

View All Comments

  • nils_ - Monday, September 26, 2016 - link

    Isn't the limit slighty lower than 32 GiB? At some point the JVM switches to 64 bit pointers, which means you'll lose a lot of the available heap to larger pointers. I think you might want to lower your settings. I'm curious, what kind of GC times are you seeing with your heap size? I don't currently have access to Java running on non virtualised hardware so I would like to know if the overhead is significant (mostly running Elasticsearch here).
  • CajunArson - Thursday, September 15, 2016 - link

    All in all the Power chip isn't terrible but the power consumption coupled with the sheer amount of tuning that is required just to get it competitive with the Xeons isn't too encouraging. You could spend far less time tuning the Xeons and still have higher performance or go ahead with tuning to get even more performance out of those Xeons.

    On top of the fact that this isn't a supposedly "high end" model, the higher end power parts cost more and will burn through even more power, and that's an expense that needs to be considered for the types of real-world applications that use these servers.
  • dgingeri - Thursday, September 15, 2016 - link

    That ad on the last page that claims lower equipment cost of course compares that to an HP DL380, the most overpriced Xeon E5 system out right now. (I know because I shopped them.) Comparing it to a comparable Dell R730 would show less expense, better support, and better expansion options.
  • Morawka - Thursday, September 15, 2016 - link

    you mean a company made a slide that uses the most extreme edge cases to make their product look good?!?! Shocking /s
  • Gondalf - Thursday, September 15, 2016 - link

    Something is wrong is these power consumption data. The plataform idles at 221W and under full load only 260W?? the cpu is vanished?? Power 8 at over 3Ghz has an active power of only 40W??
    1) the idle value is wrong or 2) the under load value is wrong. All this is not consistent with IBM TDP official values.
    IMO the energy consumption page of the article has to be rewrite.
  • JohanAnandtech - Thursday, September 15, 2016 - link

    We have double checked those numbers. It is probably an indication that many of the power saving features do not work well under Linux right now.
    BTW, just to give you an idea: running c-ray (floating point) caused the consumption to go to 361W.
  • Kevin G - Thursday, September 15, 2016 - link

    I presume that c-ray uses the 256 bit vector unit on POWER8?

    Also have you done any energy consumption testing that takes advantage of the hardware decimal unit?
  • mapesdhs - Thursday, September 15, 2016 - link

    C-ray isn't that smart. :D It's a very simple code, brute force basically, and the smaller dataset can easily fit in a modern cache (actually the middling size test probably does too on CPUs like these). Hmm, I suppose it's possible one could optimise the compilation a bit to help, but I doubt anything except a full rewrite could make decent use of any vector tech, and I don't want to allow changes to the code, that would make comparisons to all other test results null. Compiler optimisations are ok, but not multi-pass optimisations that feed back info about the target data into the initial compile, that's cheating IMO (some people have done this to obtain what look like really silly run times, but I don't include them on my main C-ray page).

    Ian.
  • Gondalf - Tuesday, September 20, 2016 - link

    Ummm so in short words the utilized sw don't stress at all the cpu, not even the hot caches near the memory banks. We need a bench with an high memory utilization and a balanced mix between integer and FP, more in line with real world utilization

    I don't know if this test is enough to say POWER8 is power/perf competitive with haswell in 22nm.
    In fact POWER market share is definitively at the historic minimum and 14nm Broadwell is pretty young, so this disaster it is not its fault.
  • jesperfrimann - Wednesday, September 21, 2016 - link

    If you have a OPAL (Bare Metal system that cannot run POWERVM) then all the powersavings features are off by default AFAIR.
    Try to have a look at:
    https://public.dhe.ibm.com/common/ssi/ecm/po/en/po...

    Many of the features does have a performance impact, ranging from negative over neutral to positive for a single one.

    But Again. I think your comparison with 'vanilla' software stacks are relevant. This is what people would see out of the box with an existing software stack.
    It is 101% relevant to do that comparison as this is the marked that IBM is trying to break into with these servers.

    But what could be fun to see was some tests where all the Bells and Whistles were utilized. As many have written here.. use of Hardware supported Decimal Floating Point. The Vector Execution unit, the ability to do hardware assisted Memory Compression etc. etc.

    // Jesper

Log in

Don't have an account? Sign up now