Java Performance

The SPECjbb 2015 benchmark has "a usage model based on a world-wide supermarket company with an IT infrastructure that handles a mix of point-of-sale requests, online purchases, and data-mining operations." It uses the latest Java 7 features and makes use of XML, compressed communication, and messaging with security.

We tested with four groups of transaction injectors and backends. The reason why we use the "Multi JVM" test is that it is more realistic: multiple VMs on a server is a very common practice.

The Java version was OpenJDK 1.8.0_91. We applied relatively basic tuning to mimic real-world use, while aiming to fit everything inside a server with 128 GB of RAM:

"-server -Xmx24G -Xms24G -Xmn16G -XX:+AlwaysPreTouch -XX:+UseLargePages"

The graph below shows the maximum throughput numbers for our MultiJVM SPECJbb test.

SPECJBB 2015-Multi Max-jOPS

The Critical-jOPS metric is a throughput metric under response time constraint.

SPECJBB 2013-Multi Critical-jOPS

After checking the other benchmarks and IBM's published benchmarks at spec.org we suspected that there was something suboptimal for the POWER8 server. A Java expert at IBM responded:

Currently OpenJDK 8 performance on power lags behind the IBM JDK for this benchmark, and is not reflective of the hardware. The gap is being closed by driving changes into OpenJDK 9 that will be back ported to OpenJDK 8.

So this is one of those cases where the "standard" open source ecosystem is not optimal. So we tried out IBM's 64 bit SDK version LE 8.0-3.11. Out of the box, throughput got even worse. Only when we used more complicated tuning and more memory...

"-XX:-RuntimeInstrumentation -Xmx32G -Xms32G -Xmn16G -Xlp -Xcompressedrefs -Dcom.ibm.crypto.provider.doAESInHardware=true -XnotlhPrefetch"

...did performance get better. We also had to use static huge pages (16 MB each) instead of transparent huge pages. We gave the Intel Xeons also 32 GB per VM and eight 32 GB DIMMs, but we did not use the IBM JDK as it caused a performance decrease.

SPECJBB 2015-Multi Max-jOPS optimized

Both the Xeon and the IBM POWER8 max throughput got worse. In the case of Intel we can easily explain this: its 32 GB of DDR4-2133 memory offers less bandwidth and higher latency. In case of IBM, we can only say we lack knowledge of the IBM JDK.

SPECJBB 2013-Multi Critical-jOPS optimized

Again, the Xeon E5-2680 v4 loses performance due to the fact that we had to use more but slower memory.

This is not an apples to apples comparison as we would like it, but it gives a good indication that the IBM POWER8 might actually able to beat even more expensive Xeon E5s than our (simulated) Xeon E5-2680v4 with optimized software. A Xeon E5-2690 v4 ($2090) would be beaten by a decent margin and even the 2695v4 ($2400) might be in reach.

Benchmark Configuration and Methodology Database Performance
Comments Locked

49 Comments

View All Comments

  • JohanAnandtech - Sunday, September 25, 2016 - link

    Thanks Jesper. Looks like I will have to spend even more time on that system :-). And indeed, out of the box performance is important if IBM ever wants to get a piece of the x86 market.
  • luminarian - Thursday, September 15, 2016 - link

    It was my understanding that the SMT mode on the power8 could be changed. Depending on the type of work this would make a giant difference, especially with mysql/mariadb that are limited to 1 process/thread per connection.

    With databases the real winner would be with one that supports parallel queries, such as postgresql 9.6, db2, oracle, etc.

    Also yer bench mark very easily could be limiting the power8 if its not opening enough connections to fill out the number of threads that thing can handle, remember mysql/mariaDB are 1 process/thread per connection. Alot of database bench marks default to a small number of connections, this thing has 160 threads with the dual 10 core. I would suggest trying to run that same benchmark again but do it at the same time from multiple client machines. See if the bench takes a larger dip when a second client machine runs the same bench or if the bench shows similar figures(granted this might hit hd io limit on the power8 server).

    So yea, that and try SMT-2 and SMT-4 modes.
  • JohanAnandtech - Friday, September 16, 2016 - link

    Hi, I tried SMT-4, throughput was about 25% worse: 11k instead 14k+. 95th perc response time was better: 3.7 ms.
  • JohanAnandtech - Friday, September 16, 2016 - link

    updated the MySQL graphs with SMT-4 data. Our Spark tests gets worse with SMT-4 and that is also true for SPECjbb.
  • luminarian - Friday, September 16, 2016 - link

    Awesome, Thanks for the response.
  • Meteor2 - Friday, September 16, 2016 - link

    The HPC potential is awesome. You can really see why Oak Ridge chose POWER9 and Volta.
  • Communism - Sunday, September 18, 2016 - link

    Pretty sure most of the reason for that is due to Intel blocking every attempt Nvidia makes at getting a high bandwidth interface bolted onto a Xeon.

    Given that one of the main reasons that Intel blocked Nvidia's chipset business way back in the day was to try to limit the ability of other companies bolting on high bandwidth accelerators onto Intel chips (Presumably to protect their own initiatives in that space).
  • Klimax - Saturday, September 17, 2016 - link

    Not terribly impressive. You have to get SW to paly nice and spend time to fine tune it to outperform Intel and it will cost you in power and cooling. More like "yes, if you get quite bigger TDP you get bit more power". And it won't be terribly good in many cases. (Like public facing service where latency is critical)

    Maybe if you are in USA and can waste admins and devs time and waste a lot on cooling and electricity then maybe. Otherwise why bother...
  • SarahKerrigan - Sunday, September 18, 2016 - link

    I don't see this as a bad result. This is a 22nm processor, over two years old, and it beats Haswell-EP (which is newer) on efficiency. Broadwell-EP is brand new, and P9 should come out well before the end of BDW-EP's lifecycle.
  • Kevin G - Sunday, September 18, 2016 - link

    Some of the POWER9 chips will be out next year though is suspect that the scale-up models maybe an early 2018 part. Considering that those chips go into IBM's big iron Unix servers, they tend to launch a bit later than the low end models so it isn't game changing.

    The real question is when SkyLake-EP/EX will launch and in comparison to the scale-out POWER9 chips. I was expecting a first half of 2017 for the Intel parts but I have no reference as to when to expect the POWER9 SO chips. Thus there is a chance Intel can come out first.

    Intel also wants a quick transition to SkyLake-EP/EX as they unify those to lines to some extent and provide some major platform improvements. I'm thinking Broadwell-EP/EX will have a relatively short life span compared to Haswell-EP/EX. This mimics much of what happened on the desktop and the challenge to move to 14 nm.

Log in

Don't have an account? Sign up now