The Virtualization Benchmarking Chaos

There are an incredible number of pitfalls in the world of server application benchmarking, and virtualization just makes the whole situation much worse. In this report, we want to measure how well the CPUs are coping with virtualization. That means we need to choose our applications carefully. If we use a benchmark that spends very little time in the hypervisor, we are mostly testing the integer processing power and not how the CPU copes with virtualization overhead. As we have pointed out before, a benchmark like SPECjbb does not tell you much, as it spends less than one percent of its time in the hypervisor.

How is virtualization different? CPU A that beats CPU B in native situations can still be beaten by the latter in virtualized scenarios. There are various reasons why CPU A can still lose, for example CPU A…

  1. Takes much more time for switching from the VM to hypervisor and vice versa.
  2. Does not support hardware assisted paging: memory management will cause a lot more hypervisor interventions.
  3. Has smaller TLBs; Hardware Assisted Paging (EPT, NPT/RVI) places much more pressure on the TLBs.
  4. Has less bandwidth; an application that needs only 20% of the maximum bandwidth will be bottlenecked if you run six VMs of the same application.
  5. Has smaller caches; the more VMs, the more pressure there will be on the caches.

To fully understand this, it helps a lot if you read our Hardware Virtualization: the nuts and bolts article. Indeed, some applications run with negligible performance impact inside a virtual machine while others are tangibly slower in a virtualized environment. To get a rough idea of whether or not your application belongs to the latter or former group, a relatively easy rule of thumb can be used: how much time does your application spend in user mode, and how much time does it need help from the kernel? The kernel performs three tasks for user applications:

  • System calls (File system, process creation, etc.)
  • Interrupts (Accessing the disks, NICs, etc.)
  • Memory management (i.e. allocating memory for buffers)

The more work your kernel has to perform for your application, the higher the chance that the hypervisor will need to work hard as well. If your application writes a small log after spending hours crunching numbers, it should be clear it's a typical (almost) "user mode only" application. The prime example of a "kernel intensive" application is an intensively used transactional database server that gets lots of requests from the network (interrupts, system calls), has to access the disks often (interrupts, system calls), and has buffers that grow over time (memory management).

However, a "user mode only" application can still lose a lot of performance in a virtualized environment in some situations:

  • Oversubscribing: you assign more CPUs to the virtual machines than physically available. (This is a very normal and common way to get more out of your virtualized server.)
  • Cache Contention: your application demands a lot of cache and the other virtualized applications do as well.

These kinds of performance losses are relatively easy to minimize. You could buy CPUs with larger caches, and assign (set affinity) certain cache/CPU hungry applications some of the physical cores. The other less intensive applications would share the CPU cores. In this article, we will focus on the more sensitive workloads out there that do quite a bit of I/O (and thus interrupts), need large memory buffers, and thus talk to the kernel a lot. This way we can really test the virtualization capabilities of the servers.

Index Independent Real-World Virtualization Benchmarking
POST A COMMENT

66 Comments

View All Comments

  • GotDiesel - Thursday, May 21, 2009 - link

    "Yes, this article is long overdue, but the Sizing Server Lab proudly presents the AnandTech readers with our newest virtualization benchmark, vApus Mark I, which uses real-world applications in a Windows Server Consolidation scenario."

    spoken with a mouth full of microsoft cock

    where are the Linux reviews ?

    not all of us VM with windows you know..

    Reply
  • JohanAnandtech - Thursday, May 21, 2009 - link

    A minimum form of politeness would be appreciated, but I am going to assume your were just dissapointed.

    The problem is that right now the calling circle benchmark runs half as fast on Linux as it does on Windows. What is causing Oracle to run slower on Linux than on Windows is a mystery even to some of the experienced DBA we have spoken. We either have to replace that benchmark with an alternative (probably Sysbench) or find out what exactly happened.

    When you construct a virtualized benchmark it is not enough just to throw in a few benchmarks and VMs, you really have to understand the benchmark thoroughly. There are enough halfbaken benchmarks already on the internet that look like a Swiss cheese because there are so many holes in the methodology.
    Reply
  • JarredWalton - Thursday, May 21, 2009 - link

    Page 4: vApus Mark I: the choices we made

    "vApus mark I uses only Windows Guest OS VMs, but we are also preparing a mixed Linux and Windows scenario."

    Building tests, verifying tests, running them on all the servers takes a lot of time. That's why the 2-tile and 3-tile results are not yet ready. I suppose Linux will have to wait for Mark II (or Mark I.1).
    Reply
  • mino - Thursday, May 21, 2009 - link

    What you did so far is great. No more words needed.

    What I would like to see is vApus Mark I "small" where you make the tiles smaller, about 1/3 to 1/4 of your current tiles.
    Tile structure shall remain simmilar for simplicity, they will just be smaller.

    When you manage to have 2 different tile sizes, you shall be able to consider 1 big + 1 small tile as one "condensed" tile for general score.

    Having 2 reference points will allow for evaluating "VM size scaling" situations.
    Reply
  • JohanAnandtech - Sunday, May 24, 2009 - link

    Can you elaborate a bit? What do you menan by "1/3 of my current tile?" . A tile = 4 VMs. are you talking about small mem footprint or number of VCPUs?

    Are you saying we should test with a Tile with small VMs and then test afterwards with the large ones? How do you see such "VM scaling" evaluation?
    Reply
  • mino - Monday, May 25, 2009 - link

    Thanks for response.

    1/3 I mean smaller VM's. Mostly from the load POW. Probably 1/3 load would go for 1/2 memory footprint.

    The point being that currently the is only a single datapont with a specific load-size per tile/per VM.

    By "VM scaling" I would like to see what effect woul smaller loads have on overal performance.

    I suggest 1/3 or 1/4 the load to get a measurable difference while remaining within reasonable memory/VM scale.

    In the end, if you get simmilar overal performance from 1/4 tiles, it may not make sense to include this in future.
    Even then the information that your benchmark results can be safely extrapolated to smaller loads would be of a great value by itself.
    Reply
  • mino - Monday, May 25, 2009 - link

    Eh, that last text of mime looks like a nice gibberish...
    Clarification nneded:

    To be able to run more tiles/box smaller memory footprint is a must.
    With smaller mem footprint, smaller DB's are a must.

    The end results may not be directly comparable but shall be able to give some reference point, corectly interpreted

    Please let me know if this makes sense to you.
    There are multiple dimensions to this. I may be easily on the imaginery branch :)
    Reply
  • ibb27 - Thursday, May 21, 2009 - link

    Can we have a chance to see benchmarks for Sun Virtualbox which is Opensource? Reply
  • winterspan - Tuesday, May 26, 2009 - link

    This test is misleading because you are not using the latest version of VMware that supports Intel's EPT. Since AMD's version of this is supported in the older version, the test is not at all a fair representation of their respective performance. Reply
  • Zstream - Thursday, May 21, 2009 - link

    Can someone please perform a Win2008 RC2 Terminal Server benchmark? I have been looking everywhere and no one can provide that.

    If I can take this benchmark and tell my boss this is how the servers will perform in a TS environment please let me know.
    Reply

Log in

Don't have an account? Sign up now