vApus Mark I: Performance-Critical Applications Virtualized

Our vApus Mark I benchmark is not a VMmark replacement. It is meant to be complimentary: while VMmark uses runs 60 to 120 light loads, vApus Mark I runs 8 heavy VMs on 24 virtual CPUs (vCPUs). Our current vApus Stressclient is being improved to scale to much higher amount of vCPUs, but currently we limit the benchmark to 24 virtual CPUs.

A vApus Mark I tile consists of one OLTP, one OLAP and two heavy websites are combined in one tile. These are the kind of demanding applications that still got their own dedicated and natively running machine a year ago. vApus Mark I shows what will happen if you virtualize them. If you want to fully understand our benchmark methodology: vApus Mark I has been described in great detail here. We have changed only one thing compared to our original benchmarking: we used large pages as it is generally considered as a best practice (with RVI, EPT).

The current vApus Mark I uses two tiles. Per tile we have 4 VMs with 4 server applications:

  • A SQL Server 2008 x64 database running on Windows 2008 64-bit, stress tested by our in-house developed vApus test (4 vCPUs).
  • Two heavy duty MCS eFMS portals running PHP, IIS on Windows 2003 R2, stress tested by our in house developed vApus test (each 2 vCPUs).
  • One OLTP database, based on Oracle 10G Calling Circle benchmark of Dominic Giles (4 vCPUs).

The beauty is that vApus (stress testing software developed by the Sizing Servers Lab) uses actions made by real people (as can be seen in logs) to stress test the VMs, not some benchmarking algorithm.

Update: We have noticed that the CPU load of Magny-cours is at 70-85%, while the Six-core "Istanbul" is running at 80-95%". As we have noted before, 24 cores is at the limit of our current benchmark until we launch vApus Mark 2. We have reason to believe that the opteron 6174 has quite a bit of headroom left. The results above are not wrong, but do not show the full potential of the 6174. We are checking the CPU load numbers of the six-core Xeon X5670 as we speak. Expect an update in the coming days.

vAPUS Mark I 2 tile test - 24 vCPUs - ESX 4.0

The AMD Opteron 6174 performs well here, but disappoints a bit at the same time. vApus Mark I does not scale as well as VMmark. The reason is simple: as we used 4 virtual CPUs for both the OLTP as the OLAP virtual machine, scaling depends more on the individual applications. One VM with 4 virtual CPUs will not scale as well as 16 VMs sharing the same 4 virtual CPUs. Also, we use heavy database applications that typically like a decent amount of cache. The difference with the Xeon X5670 is small though. Servers based on both CPUs will make excellent virtualization platforms.

Next, the same test with Hyper-V, the hypervisor beneath Windows 2008 R2. We are testing with Hyper-V R2 6.1.7600.16385 (21st of July 2009).

vAPUS Mark I 2 tile test - 24 vCPUs - Hyper-V

Based on the excellent results of the Dual Opteron 2435 we expected AMD to take the crown in this benchmark, but that did not happen. We only had one week to get all of the Opteron testing done (AMD didn't have any hardware until the last minute), so we could not analyze this in depth. For some reason, the Opteron 6174 does not scale very well in our vApus benchmark. Compared to a 2.2GHz six-core, we only see a 30% increase in performance, about the same as Intel gets out of adding 2 extra cores to their Xeon. Part of the reason might be our benchmark: at the moment we are limited to 24 CPUs. We’ll investigate this in more detail in the coming quarter when vApus v2 is available.

The difference with the Xeon X5670 is small though, and the slightly lower price of the Opteron makes up for the slightly lower performance.  

Virtualization & Consolidation HPC and Encryption Benchmarks
Comments Locked

58 Comments

View All Comments

  • zarjad - Friday, April 2, 2010 - link

    I understand that HT can be disabled in BIOS and that some benchmarks don't like HT.
  • elnexus - Wednesday, April 21, 2010 - link

    I can report that one of my customers, performing intensive image processing, found that DISABLING hyper-threading on a Nehalem-based workstation, actually IMPROVED performance considerably.

    It seems that certain applications don't like hyper-threading, while others do. I always recommend that my customers perform sensitivity analyses on their computing tasks with HT on and off, and then use whichever is best.
  • tracerburnout - Wednesday, March 31, 2010 - link

    How is it possible that Intel's Xeon X5670 rig returns 19k+ for a score while AMD's magny-cours returns only 2k+?? I only question the results of this benchmark chart because Intel's Xeon X5570 rig returns only around 1k. How can a X5670 be 19x faster than a X5570?? And I doubt the same is true for the magny-cours by being just 10.5% of what the X5670 can do.

    (is there an extra '0' by accident in there?)



    tracerburnout
    proud supporter of AMD, with a few Intel rigs for Linux only
  • JohanAnandtech - Thursday, April 1, 2010 - link

    No, it is just that Sisoft uses the new AES instructions of West-mere. It is a forward looking benchmark which tests only a small part of a larger website code base. So that 19x faster will probably result in 10 to 20% of the complete website being 19x faster. So the real performance impact will be a lot slower. It is interesting though to see how much faster these dedicated SIMD instructions are on these kinds of workloads.
  • alpha754293 - Thursday, April 1, 2010 - link

    If you guys need help with setting up or running the Fluent/LS-DYNA benchmarks let me know.

    I see that you don't really spend as much time writing or tweaking it as you do with some of the other programs, and that to me is a little concerning only because I don't think that it is showing the true potential of these processors if you run it straight out-of-the-box (especially with Fluent).

    Fluent tends to have a LOT of iterations, but it also tends to short-stroke the CPU (i.e. the time required to complete all of the calculations necessary is less than 1 second and therefore; doesn't make full use of the computational ability.)

    Also, the parallelization method (MPICH2 vs. HP MPI) makes a difference in the results.

    You want to make sure that the CPUs are fully loaded for a period of time such that at each iteration, there should be a noticable dwell time AT 100% CPU load. Otherwise, it won't really demonstrate the computational ability.

    With LS-DYNA, it also makes a difference whether it's SMP parallelization or MPP parallelization as well.
  • k_sarnath - Friday, April 2, 2010 - link

    The most baffling part is how linux could engage 12-CPUs much better than windows. I am obviously curious about the OS platform for other tests.. Similary MS SQL was able to scale well on multi-cores... In this context, I am not sure how we can look at the performance numbers... A badly scaling app or OS could show the 12-core one in bad light.
  • OneEng - Saturday, April 3, 2010 - link

    Hi Johan,

    I have followed your articles from the early day's at Ace's and have a good respect for the technical accuracy of your articles.

    It appears that the X5570 scaling between 4 and 8 cores has very little gain in the Oracle Calling Circle benchmark. Furthermore, the 24 cores of MC at 2.2Ghz are way behind. Westmere appears to do quite well, but really should not be able to best 8 cores in the X5570 with all else being equal.

    I have heard some state that the benchmark is thread bound to a low number of threads (don't know if I am buying this), but surely something fishy is going on here.

    It appears that there is either a real world application limit to core scaling on certain types of Oracle database applications (if there are, could you please explain what features an app has when these limits appear), or that the benchmark is flawed in some way.

    I have a good amount of experience in Oracle applications and have usually found that more cores and more memory make Oracle happy. My experience seems at odds with your latest benchmarks.

    Any feedback would be appreciated .... Thanks!
  • JohanAnandtech - Tuesday, April 6, 2010 - link

    I am starting to suspect the same. I am going to dissect the benchmark soon to see what is up. It is not disk related, or at least that surely it is not our biggest problem. Our benchmark might not be far from the truth though, I think Oracle really likes the big L3-cache of the Westmere CPU.

    If you have other ideas, mail at johanATthiswebsiteP
  • heliosblitz2 - Wednesday, April 7, 2010 - link

    You wrote
    Test-Setup:
    Xeon Server 1: ASUS RS700-E6/RS4 barebone
    Dual Intel Xeon "Gainestown" X5570 2.93GHz, Dual Intel Xeon “Westmere” X5670 2.93 GHz
    6x4GB (24GB) ECC Registered DDR3-1333

    "Also notice that the new Xeon 5600 handles DDR3-1333 a lot more efficiently. We measured 15% higher bandwidth from exactly the same DDR3-1333 DIMMs compared to the older Xeon 5570."

    That is not exactly the reason, I think.
    The reason ist you populated the second memory-bank in both setups.
    Intel specification:
    Westmere-1333MHZ-CPUs run with 1333 MHZ with second bank populated while
    Nehalem-1333MHZ-CPUs run with 1066 MHZ with second bank populated

    That could be updated.

    Compare tech docs on Intel site: datasheet Xeon 5500 Part 2 and datasheet Xeon 5600 Part 2

    Arnold.
  • gonerogue - Saturday, April 10, 2010 - link

    The Viper is a V10 and most certainly not a traditional muscle car ;)

Log in

Don't have an account? Sign up now