SPECjbb2005

SPECjbb2005 from SPEC (Standard Performance Evaluation Corporation) evaluates the performance of server side Java by emulating a three-tier client/server system with emphasis on the middle tier. Instead of testing with a possible disk intensive database system, SPECjbb uses tables of objects, implemented by Java Collections, rather than a separate database. The SPECjbb score thus depends on:
  • The JVM (Java Virtual Machine) and the way the JVM is tuned
  • CPU processing power
  • Caching and memory speed
  • Multiprocessing configuration (Scalability)
The latest version SPECjbb2005 is much more memory intensive and uses XML processing among other changes. From spec.org:
"SPECjbb2005 is a follow-on release to SPECjbb2000, which was inspired by the TPC-C benchmark and loosely follows the TPC-C specification for its schema, input generation, and transaction profile. SPECjbb2005 runs in a single JVM in which threads represent terminals, where each thread independently generates random input before calling transaction specific logic. There is neither network nor disk IO in SPECjbb2005."
SPECjbb starts up to two threads per core. For example, with Hyper-Threading enabled on our eight core/quad CPU Xeon MP 7030M system, 32 threads were started on the 16 logical CPUs. Each thread is a warehouse. Again from SPEC.org:
"A warehouse is a unit of stored data. It contains roughly 25MB of data stored in many objects in several Collections (HashMaps, TreeMaps). A thread represents an active user posting transaction requests within a warehouse. There is a one-to-one mapping between warehouses and threads, plus a few threads for SPECjbb2005 main and various JVM functions. As the number of warehouses increases during the full benchmark run, so does the number of threads. A "point" represents the throughput during the measurement interval at a given number of warehouses. A full benchmark run consists of a sequence of measurement points with an increasing number of warehouses (and thus an increasing number of threads)"
First we tested with some decent but rather generic tuning that we could use on all systems. The JVM was Sun's, version 1.5.0_08.
java -classpath jbb.jar:check.jar -Xms3072m -Xmx3072m -Xmn1024m -Xss128k -XX:+AggressiveOpts -XX:+UseParallelOldGC -XX:+UseParallelGC spec.jbb.JBBmain -propfile SPECjbb.props

Our first test is done with only one instance, and you might recall from our Xeon MP coverage that this is a setup that the Opteron does not like. Let us focus mostly on the Intel results. Interestingly, the "Core based" Xeon 5345 cannot outperform the "Pentium 4 based" Xeon MP 7130. The higher clock speed of the Xeon MP (3.2GHz) helps of course, but it is still a surprise, especially considering that the cache system of the Xeon 5345 is quite competitive (4MB low latency L2 per two cores) compared to the Xeon MP (1MB L2 per core, a high latency 8MB L3 per two cores). Clovertown has also the better memory subsystem, especially if you compare the memory latency (120 versus 195 ns).

Next, we also tested SPECjbb with four application instances. Using NUMActl, a clever utility written by Andi Kleen, we were able to bind each Java application to one CPU node on the HP DL585. We didn't bind instances to CPUs on the Intel platforms (it is possible with taskset) as it gives worse performance.

On the Opteron we used:
numactl -cpubind=(1-4) -membind=(1-4) java -classpath jbb.jar:check.jar -Xms3072m -Xmx3072m -Xmn1024m -Xss128k -XX:+AggressiveOpts -XX:+UseParallelOldGC -XX:+UseParallelGC spec.jbb.JBBmain -propfile SPECjbb.props -id (1-4)
On the Xeons we used:
java -classpath jbb.jar:check.jar -Xms3072m -Xmx3072m -Xmn1024m -Xss128k -XX:+AggressiveOpts -XX:+UseParallelOldGC -XX:+UseParallelGC spec.jbb.JBBmain -propfile SPECjbb.props -id (1 to 4)

As we have noticed before, the Xeons do not benefit from using more instances, while Opteron performance is boosted significantly. That is quite good news for AMD, as testing with multiple instances is more realistic according to most java people we talked to. The four dual core 2.4GHz Opterons outperform the 2.33GHz Xeon E5345 by a small margin. This really deserves more attention, as normally the core based CPUs are capable of outperforming similarly clocked Opterons by a 20% margin and more. We decided to check out the scaling of the different CPUs by testing with four and eight cores. We also tested the Opteron 880 with DDR-400. Unfortunately, we were not able to test with more than 8GB, so we could only test with two CPUs. The blue numbers are extrapolated numbers. The 2.8GHz Opteron numbers were based on the performance scaling we saw from 2.2GHz Opteron to a 2.4GHz Opteron.

Specjbb2005 4 instances (Sun Hotspot)
Per core performance
CPU Quad core Octal core Scaling 4->8
Xeon 7130 3.2 GHz 39942 72980 83%
Xeon 5345 2.33 GHz 39781 67447 70%
Opteron 880 2.4 GHz 37397 71364 91%
Opteron 880 2.4 GHz DDR400 41137 78500 91%
Opteron 890 2.8 GHz 46073 87920 91%
Xeon 5160 3 GHz 47743 N/A N/A
Xeon Scaling 2.33 -> 3 GHz 20%    
Opteron 880 vs. Quad core Xeon 2.33 GHz 3% 16% 31%

Here is a first indication that quad core Xeon does not scale as well as the other systems. Two 2.4GHz Opteron 880 processors are as fast as one Xeon 5345, but four Opterons outperform the dual quad core Xeon by 16%. In other words, the quad Opteron system scales 31% better than the Xeon system.

When you are in the market for a new server system, you typically care less about performance per core; instead, you care about the performance per dollar. That is why we should also look at performance per socket. If we look at typical HP systems for example, a two socket system with 8GB RAM can be found in the $6000-$7000 price range, a similar quad socket system can cost $11000-14000.

Specjbb2005 4 instances
Per socket performance
CPU Dual Socket
Quad core Xeon 2.33 GHz vs. Xeon 5160 41%
Quad core Xeon 2.33 GHz vs. Opteron 880 64%
Quad core Xeon 2.33 GHz vs. Opteron 890 46%

The Xeon 5345 might scale worse than the Opteron, but it offers a remarkable price/performance ratio. The Opteron 890/8220 costs about the same as Xeon 5345, but to be fair FB-DIMMs seem to be about 30% more expensive than comparative DDR2 DIMMs. In case of 8GB of RAM, this might amount to an extra cost of $300, making the Xeon 5345 system more expensive. Still the Xeon 5345 offers a compelling performance advantage.

Specjbb 2005 - Bea JRockit

We suspected that the Sun JVM is reasonably well optimized for the Opteron and maybe a little bit less effort went into the Intel optimizations. After all, Sun sells Opteron and Sparc servers. The BEA JRockit JDK provides a highly optimized JVM for running JAVA applications on the x86-64 and Itanium CPUs, so we did also some testing with the BEA Jrockit JVM. BEA is known for being a rather memory gobbling but highly tunable JVM, so we aggressively tuned our server JVM.

On the Xeons we used following parameters:
/java/jrockit-jdk1.5.0_06/bin/java -cp jbb.jar:check.jar -Xms2048m -Xmx2048m -XXaggressive -XXthroughputcompaction -XXallocprefetch -XXallocRedoPrefetch -XXcompressedRefs -XXlazyUnlocking -XXtlasize128k spec.jbb.JBBmain -propfile SPECjbb.props -id 1-4
On the Opterons we used the following parameters:
numactl --cpubind=0-4 --membind=0-4 /java/jrockit-jdk1.5.0_06/bin/java -classpath jbb.jar:check.jar -XXaggressive -XXcompressedRefs -XXthroughputCompaction -XXlazyUnlocking -XXtlasize=64k -Xms1536m -Xmx1536m spec.jbb.JBBmain -propfile SPECjbb.props -id 1-4

As we suspected, Jrockit is better optimized for Intel. A single Xeon 5345 outperforms a dual Opteron 880 by a large margin (26-39%). The victory is significant; however, the Clovertown scaling remains quite mediocre.

Specjbb2005 / Bea
Per core performance
CPU Quad core Octal core Scaling 4->8
Xeon 7130 3.2 GHz 50000 85909 72%
Xeon 5345 2.33 GHz 70035 103957 48%
Opteron 880 2.4 GHz 50346 92213 83%
Opteron 880 2.4 GHz DDR400 55381 101434 83%
Xeon 5160 3 GHz 79154 N/A N/A
Xeon Scaling 2.33 -> 3 GHz 13%    
Opteron 880 vs. Quad core Xeon 2.33 GHz -28% -11% 72%

Even with DDR-400, a dual Opteron 880 is not able to come close to a single Xeon E5345. However, the picture changes when we look at the "octal core" numbers. A dual Xeon E5345 is only 50% faster, while the Opteron increases its performance by 83% when the number of cores doubles.

Specjbb2005 / Bea
Per socket performance
CPU Dual Socket
Quad core Xeon 2.33 GHz vs. Xeon 5160 41%
Quad core Xeon 2.33 GHz vs. Opteron 880 64%

Still, the Quad core Xeon is still a champion, offering 41% more performance for the same price as its 3GHz dual core brother. If you are using the BEA JVM, the Xeon is a much better choice than the AMD Opteron.

Thanks and Testing Setup Secure Socket Layers RSA Performance
POST A COMMENT

15 Comments

View All Comments

  • Antinomy - Wednesday, March 07, 2007 - link

    A great review, very interesting.
    But there are a few things to mention. A mistake in results of Cinebench test. In the overall table the uni Clovertown system got 1272 points, but in the next (per core performance) - 1169. The result was swapped with the one of Xeon 7130. And a comment about the scalability extrapolation. The result of scalability 2.33 Clover vs 3.0 Dual Woodcrest can be hardly compared due to different organization of the systems. These MoBo have two independent FSB so this means, that the two Woodcrests will be provided with twice more peak memory bandwith. This can't make no influence on the result. Also the 4 channel memory mode provides a 5% increase versus 2 channel in real bandwith, so we can't say that theese applications do not suffer from lack of memory bandwith.
    It would be interesting to provide a test of uni Woodcrest system and a test of system based on Woodcrest (both uni and dual) at the same frequency as Clovertown has. And a Kentsfield\Conroe systems (despite they aren't server ones) would be nice to look at because of their more efficient usage of memory bandwith and FSB.
    Reply
  • afuruhed - Thursday, December 28, 2006 - link

    We are getting more Clovertowns. There is a chart at http://www.pantor.com/software.html">pantor.com that indicates that some applications benefit a lot. http://en.wikipedia.org/wiki/FIX_protocol">The FIX protocol is a technical specification for electronic communication of trade-related messages (financial markets). Reply
  • henriks - Thursday, December 28, 2006 - link

    Agree with other responses - good article!

    Some comments on the jbb results page:

    You state that JRockit is (only) available for x86-64 and Itanium. x86 and Sparc should be added to this list.

    The JRockit configuration you're using enables a single-spaced GC. In that configuration, performance is tied to heap size (larger heap means fewer GC events). Increasing the heap size to 3 GB - as for the Sun benchmark results - would increase performance slightly but in particular give much better scalability when you increase the number of warehouses to large numbers.

    It looks like you have not enabled large pages in the OS. Doing this would give a large performance boost and help scalability regardless of chip or JVM vendor.

    Astute readers may note that your results are lower than the published results on www.spec.org. Apart from OS and possibly BIOS tuning, the reason is that the most recent results are using a newer JRockit version (not yet available for public download). This new version improves performance on this benchmark by 20-30% on x86 chips - Intel *and* AMD - with the largest positive effect on high-bin chips from the respective vendors. The effect on other Java applications vary from zero to a lot.

    Cheers!

    Henrik, JRockit team
    Reply
  • dropadrop - Wednesday, December 27, 2006 - link

    Considering how much we just payed for some DL585's compared to DL380's I think the performance is pretty impressive. There is still something the DL380's (and most other two socket servers) can't do, and that is hosting 64GB or more ram.

    I mainly take care of vmware servers, and there the amount of memory becomes a bottleneck long before the processors, atleast in most setups. I don't think I'd have alot of use for octal processors unless I got a minimum of 32GB of ram, probably 64.
    Reply
  • rowcroft - Thursday, December 28, 2006 - link

    I've run into the same challenge when planning for the quads. My take is that I'm getting dual quads for half the price of quad dual cores. With ESX 3's HA functionality I can group the host servers and get the 32GB of ram with double the cores and have host based redundancy for critical vm's.
    Reply
  • mino - Thursday, December 28, 2006 - link

    there is another thing DL380 lacks: no drop-in analog to Barcelona on the horizon... Reply
  • Justin Case - Wednesday, December 27, 2006 - link

    Finally a good article at AT, written by someone who knows what he's talking about. Meaningful benchmarks, meaningful comments, and conclusions that make sense. If only some Johanness could rub off on other AT writers... Reply
  • hans007 - Wednesday, December 27, 2006 - link

    i think an alternative to say a dual dual core AMD though even as a server or workstation is say a quad core socket 775 cpu. I know the lower 3xxx series xeons are made for this (and are exactly the same as core 2 duo) so

    you could do a comparison of 2 amd dual cores vs a single 775 quad with ECC ddr2 etc.
    Reply
  • mino - Thursday, December 28, 2006 - link

    Check QuadFX vs. Kentsfield reviews.

    With ECC both results will be a bit lower but the conparison remains.

    A small hint: NO ONE tested QuadFX as DB server against Kenstfield....

    Gues what: Quad FX is cheaper and would rules the roost on server-like tasks.
    Reply
  • ltcommanderdata - Wednesday, December 27, 2006 - link

    Well it's nice to finally see a review of the 5145, although I was hoping for more detailed power consumption numbers. The performance benchmarks were very detailed though which was great.

    Thought I would point out a few errors I noticed as I was flipping through. First on page 2, in the Cache2Cache Latency chart the 201 for the Xeon DP 5060 that is placed in the "Same die, same package" row should be in the "Different die, same package" row. Dempsey uses a dual die approach like Presler and Cloverton as opposed to a single die approach like Smithfield and Paxville DP. And in the last page in the conclusion, you mentioned Clarksboro having "four DIBs", which implies 8 FSBs. I believe that should read two DIBs or really a Quad Independent Bus (QIB) since I'm pretty sure it only has 4 FSBs. (On a side note, Intel slides showed those 4 FSBs clocked at 1066MHz which is really disappointing. Hopefully, now that Cloverton turns out to come in 1333MHz versions instead of only 1066MHz versions that was first announced, Tigerton (and therefore Clarksboro) which is based on Cloverton will also have 1333MHz versions.)
    Reply

Log in

Don't have an account? Sign up now