vApus Mark I: the choices we made

vApus mark I uses only Windows Guest OS VMs, but we are also preparing a mixed Linux and Windows scenario. vApus Mark I uses four VMs with four server applications:

  • The Nieuws.be OLAP database, based on SQL Server 2008 x64 running on Windows 2008 64-bit, stress tested by our in-house developed vApus test.
  • Two MCS eFMS portals running PHP, IIS on Windows 2003 R2, stress tested by our in house developed vApus test.
  • One OLTP database, based on Oracle 10G Calling Circle benchmark of Dominic Giles.

We took great care to make sure that the benchmarks start, run under full load, and stop at the same moment. vApus is capable of breaking off a test when another is finished, or repeating a stress test until the others have finished.


The OLAP VM is based on the Microsoft SQL Server database of the Flemish/Dutch Nieuws.be site, one of the newest web 2.0 websites launched in 2008. Nieuws.be uses a 64-bit SQL Server 2008 x64 database on top of Windows 2008 Enterprise RTM (64-bit). It is a typical OLAP database, with more than 100GB of data consisting of a few hundred separate tables. 99% of the load on the database consists of selects, and about 5% of these are stored procedures. Network traffic is 6.5MB/s average and 14MB/s peak, so our Gigabit connection still has a lot of headroom. DQL (Disk Queue Length) is at 2.0 in the first round of tests, but we only record the results of the subsequent rounds where the database is in a steady state. We measured a DQL close to 0 during these tests, so there is no tangible impact from the storage system. The database is warmed up with 50 to 150 users. The results are recorded while 250 to 700 users hit the database.

The MCS eFMS portal, a real-world facility management web application, has been discussed in detail here. It is a complex IIS, PHP, and FastCGI site running on top of Windows 2003 R2 32-bit. Note that these two VMs run in a 32-bit guest OS, which impacts the VM monitor mode.

Since OLTP testing with our own flexible stress testing software is still in beta, our fourth VM uses a freely available test: "Calling Circle" of the Oracle Swingbench Suite. Swingbench is a free load generator designed by Dominic Giles to stress test an Oracle database. We tested the same way as we have tested before, with one difference: we use an OLTP database that is only 2.7GB large (instead of 9.5GB). We used a 9.5GB database to make sure that locking contention didn't kill scaling on systems with up to 16 logical CPUs. In this case, 2.7GB is enough as we deploy the database on a 4 vCPU VM. Keeping the database relatively small allows us to shrink the SGA size (Oracle buffer in RAM) to 3GB (normally it's 10GB) and the PGA size to 350MB (normally it's 1.6GB). Shrinking the database ensures that our VM is content with 4GB of RAM. Remember that we want to keep the amount of memory needed low so we can perform these tests without needing the most expensive RAM modules on the market. A calling circle test consists of 83% selects, 7% inserts, and 10% updates. The OLTP test runs on the Oracle 10g Release 2 (10.2) 64-bit on top of Windows 2008 Enterprise RTM (64-bit).

Below is a small table that gives you the "native" characteristics that matter for virtualization in each test. (Page management is still being researched.) With "native" we mean the characteristics measured running on the native OS (Windows 2003 and 2008 server) with perfmon.

Native Performance Characteristics
Native Application / VM Kernel Time Typical CPU Load Interrupt/s Network Disk I/O DQL
Nieuws.be / VM1 0.65% 90-100% 3000 1.6MB/s 0.9MB/s 0.07
MCS eFMS / VM2 & 3 8% 50-100% 4000 3MB/s 0.01MB/s 0
Oracle Calling Circle / VM4 17% 95-100% 11900 1.6MB/s 3.2MB/s 0.07

Our OLAP database ("Nieuws.be") is clearly mostly CPU intensive and performs very little I/O besides a bit of network traffic. In contrast, the OLTP test causes an avalanche of interrupts. How much time an application spends in the native kernel gives a first rough indication of how much the hypervisor will have to work. It is not the only determining factor, as we have noticed that a lot of page activity is going on in the MCS eFMS application, which causes it to be even more "hypervisor intensive" than the OLTP VM. From the data we gathered, we suspect that the Nieuws.be VM will be mostly stressing the hypervisor by demanding "time slices" as the VM can absorb all the CPU power it gets. The same is true for the fourth "OLTP VM", but this one will also cause a lot of extra "world switches" (from the VM to hypervisor and back) due to the number of interrupts.

The two web portal VMs, which sometimes do not demand all available CPU power (4 cores per VM, 8 cores in total), will allow the hypervisor to make room for the other two VMs. However, the web portal (MCS eFMS) will give the hypervisor a lot of work if Hardware Assisted Paging (RVI, NPT, EPT) is not available. If EPT or RVI is available, the TLBs (Translation Lookaside Buffer) of the CPUs will be stressed quite a bit, and TLB misses will be costly.

As the SGA buffer is larger than the database, very little disk activity is measured. It helps of course that the storage system consist of two extremely fast X25-E SSDs. We only measure performance when all VMs are in a "steady" state; there is a warm up time of about 20 minutes before we actually start recording measurements.

Independent Real-World Virtualization Benchmarking vApus: Virtual Stress Testing
Comments Locked

66 Comments

View All Comments

  • GotDiesel - Thursday, May 21, 2009 - link

    "Yes, this article is long overdue, but the Sizing Server Lab proudly presents the AnandTech readers with our newest virtualization benchmark, vApus Mark I, which uses real-world applications in a Windows Server Consolidation scenario."

    spoken with a mouth full of microsoft cock

    where are the Linux reviews ?

    not all of us VM with windows you know..

  • JohanAnandtech - Thursday, May 21, 2009 - link

    A minimum form of politeness would be appreciated, but I am going to assume your were just dissapointed.

    The problem is that right now the calling circle benchmark runs half as fast on Linux as it does on Windows. What is causing Oracle to run slower on Linux than on Windows is a mystery even to some of the experienced DBA we have spoken. We either have to replace that benchmark with an alternative (probably Sysbench) or find out what exactly happened.

    When you construct a virtualized benchmark it is not enough just to throw in a few benchmarks and VMs, you really have to understand the benchmark thoroughly. There are enough halfbaken benchmarks already on the internet that look like a Swiss cheese because there are so many holes in the methodology.
  • JarredWalton - Thursday, May 21, 2009 - link

    Page 4: vApus Mark I: the choices we made

    "vApus mark I uses only Windows Guest OS VMs, but we are also preparing a mixed Linux and Windows scenario."

    Building tests, verifying tests, running them on all the servers takes a lot of time. That's why the 2-tile and 3-tile results are not yet ready. I suppose Linux will have to wait for Mark II (or Mark I.1).
  • mino - Thursday, May 21, 2009 - link

    What you did so far is great. No more words needed.

    What I would like to see is vApus Mark I "small" where you make the tiles smaller, about 1/3 to 1/4 of your current tiles.
    Tile structure shall remain simmilar for simplicity, they will just be smaller.

    When you manage to have 2 different tile sizes, you shall be able to consider 1 big + 1 small tile as one "condensed" tile for general score.

    Having 2 reference points will allow for evaluating "VM size scaling" situations.
  • JohanAnandtech - Sunday, May 24, 2009 - link

    Can you elaborate a bit? What do you menan by "1/3 of my current tile?" . A tile = 4 VMs. are you talking about small mem footprint or number of VCPUs?

    Are you saying we should test with a Tile with small VMs and then test afterwards with the large ones? How do you see such "VM scaling" evaluation?
  • mino - Monday, May 25, 2009 - link

    Thanks for response.

    1/3 I mean smaller VM's. Mostly from the load POW. Probably 1/3 load would go for 1/2 memory footprint.

    The point being that currently the is only a single datapont with a specific load-size per tile/per VM.

    By "VM scaling" I would like to see what effect woul smaller loads have on overal performance.

    I suggest 1/3 or 1/4 the load to get a measurable difference while remaining within reasonable memory/VM scale.

    In the end, if you get simmilar overal performance from 1/4 tiles, it may not make sense to include this in future.
    Even then the information that your benchmark results can be safely extrapolated to smaller loads would be of a great value by itself.
  • mino - Monday, May 25, 2009 - link

    Eh, that last text of mime looks like a nice gibberish...
    Clarification nneded:

    To be able to run more tiles/box smaller memory footprint is a must.
    With smaller mem footprint, smaller DB's are a must.

    The end results may not be directly comparable but shall be able to give some reference point, corectly interpreted

    Please let me know if this makes sense to you.
    There are multiple dimensions to this. I may be easily on the imaginery branch :)
  • ibb27 - Thursday, May 21, 2009 - link

    Can we have a chance to see benchmarks for Sun Virtualbox which is Opensource?
  • winterspan - Tuesday, May 26, 2009 - link

    This test is misleading because you are not using the latest version of VMware that supports Intel's EPT. Since AMD's version of this is supported in the older version, the test is not at all a fair representation of their respective performance.
  • Zstream - Thursday, May 21, 2009 - link

    Can someone please perform a Win2008 RC2 Terminal Server benchmark? I have been looking everywhere and no one can provide that.

    If I can take this benchmark and tell my boss this is how the servers will perform in a TS environment please let me know.

Log in

Don't have an account? Sign up now