The Quest for an Independent Real-World Virtualization Benchmark

As we explained in our Xeon Nehalem review, comprehensive real-world server benchmarks that cover the market are beyond what one man can perform. Virtualization benchmarking needs much more manpower, and it is always good to understand the motivation of the group doing the testing. Large OEMs want to show off their latest server platforms, and virtualization software vendors want to show how efficient their hypervisor is. So why did we undertake this odyssey?

This virtualization benchmark was developed by an academic research group called the Sizing Server Lab. (I am also part of this research group.) Part of the lab is academic work; the other part is research that is immediately applied in the field, in this case software developers. The main motivation and objective of the applied research is to tell developers how their own software behaves and performs in a virtual environment. Therefore, the focus of our efforts was to develop a very flexible stress test that tells us how any real-world application behaves in a virtualized environment. A side effect of all this work is that we came up with a virtualization server benchmark, which we think will be very interesting for the readers of AnandTech.

Although the benchmark was a result of research by an academic lab, the most important objectives in designing our own virtualization benchmarks are that they be:

  • Repeatable
  • Relevant
  • Comparable
  • Heavy

Repeatable is the hardest one. Server benchmarks tend to run into all kinds of non-hardware related limits such as not enough connections, locking contention, and driver latency. This results in a benchmark that rarely runs at 100% CPU utilization and the CPU percentage load changes for different CPUs. In "Native OS" conditions, this is still quite okay; you can still get a decent idea of how two CPUs perform if one runs at 78% and the other runs at 83% CPU load. However, in virtualization this becomes a complete mess, especially when you have more virtual than physical CPUs. Some VMs will report significantly lower CPU load and others will report significantly higher CPU load when you are comparing two servers. As each VM is reporting different numbers (for example queries per second, transactions per second, and URL/s), average CPU load does not tell you the whole story either. To remedy this, we went through a careful selection of our applications and decided to keep only those benchmarks that allowed us to push the system close to 95-99% load. Note that this was only possible after a lot of tuning.

Comparable: our virtualization benchmark can run on Xen, Hyper-V and ESX.

Heavy: While VMmark and others go for the scenario of running many very light virtual machines with extremely small workloads, we go for a scenario with four or eight VMs. The objective is to find out how the CPUs handle "hard to consolidate" applications such as complex dynamic websites, OnLine Transaction Processing (OLTP), and OnLine Analytical Processing (OLAP) databases.

Most importantly: Relevant. We have been working towards benchmarks using applications that people run every day. In this article we had to make one compromise: as we are comparing the virtualization capabilities of different CPUs, we had to push CPU utilization close to 100%. Few virtualized servers will run close to 100% all the time, but it allows us to be sure that the CPU is the bottleneck. We are using real-world applications instead of benchmarks, but the other side of coin is that this virtualization benchmark is not easily reproducible by third parties. We cannot release the benchmark to third parties, as some of the software used is the intellectual property of other companies. However, we are prepared to fully disclose the details of how we perform the benchmarks to every interested and involved company.

The Virtualization Benchmarking Chaos vApus Mark I: the choices we made
Comments Locked

66 Comments

View All Comments

  • GotDiesel - Thursday, May 21, 2009 - link

    "Yes, this article is long overdue, but the Sizing Server Lab proudly presents the AnandTech readers with our newest virtualization benchmark, vApus Mark I, which uses real-world applications in a Windows Server Consolidation scenario."

    spoken with a mouth full of microsoft cock

    where are the Linux reviews ?

    not all of us VM with windows you know..

  • JohanAnandtech - Thursday, May 21, 2009 - link

    A minimum form of politeness would be appreciated, but I am going to assume your were just dissapointed.

    The problem is that right now the calling circle benchmark runs half as fast on Linux as it does on Windows. What is causing Oracle to run slower on Linux than on Windows is a mystery even to some of the experienced DBA we have spoken. We either have to replace that benchmark with an alternative (probably Sysbench) or find out what exactly happened.

    When you construct a virtualized benchmark it is not enough just to throw in a few benchmarks and VMs, you really have to understand the benchmark thoroughly. There are enough halfbaken benchmarks already on the internet that look like a Swiss cheese because there are so many holes in the methodology.
  • JarredWalton - Thursday, May 21, 2009 - link

    Page 4: vApus Mark I: the choices we made

    "vApus mark I uses only Windows Guest OS VMs, but we are also preparing a mixed Linux and Windows scenario."

    Building tests, verifying tests, running them on all the servers takes a lot of time. That's why the 2-tile and 3-tile results are not yet ready. I suppose Linux will have to wait for Mark II (or Mark I.1).
  • mino - Thursday, May 21, 2009 - link

    What you did so far is great. No more words needed.

    What I would like to see is vApus Mark I "small" where you make the tiles smaller, about 1/3 to 1/4 of your current tiles.
    Tile structure shall remain simmilar for simplicity, they will just be smaller.

    When you manage to have 2 different tile sizes, you shall be able to consider 1 big + 1 small tile as one "condensed" tile for general score.

    Having 2 reference points will allow for evaluating "VM size scaling" situations.
  • JohanAnandtech - Sunday, May 24, 2009 - link

    Can you elaborate a bit? What do you menan by "1/3 of my current tile?" . A tile = 4 VMs. are you talking about small mem footprint or number of VCPUs?

    Are you saying we should test with a Tile with small VMs and then test afterwards with the large ones? How do you see such "VM scaling" evaluation?
  • mino - Monday, May 25, 2009 - link

    Thanks for response.

    1/3 I mean smaller VM's. Mostly from the load POW. Probably 1/3 load would go for 1/2 memory footprint.

    The point being that currently the is only a single datapont with a specific load-size per tile/per VM.

    By "VM scaling" I would like to see what effect woul smaller loads have on overal performance.

    I suggest 1/3 or 1/4 the load to get a measurable difference while remaining within reasonable memory/VM scale.

    In the end, if you get simmilar overal performance from 1/4 tiles, it may not make sense to include this in future.
    Even then the information that your benchmark results can be safely extrapolated to smaller loads would be of a great value by itself.
  • mino - Monday, May 25, 2009 - link

    Eh, that last text of mime looks like a nice gibberish...
    Clarification nneded:

    To be able to run more tiles/box smaller memory footprint is a must.
    With smaller mem footprint, smaller DB's are a must.

    The end results may not be directly comparable but shall be able to give some reference point, corectly interpreted

    Please let me know if this makes sense to you.
    There are multiple dimensions to this. I may be easily on the imaginery branch :)
  • ibb27 - Thursday, May 21, 2009 - link

    Can we have a chance to see benchmarks for Sun Virtualbox which is Opensource?
  • winterspan - Tuesday, May 26, 2009 - link

    This test is misleading because you are not using the latest version of VMware that supports Intel's EPT. Since AMD's version of this is supported in the older version, the test is not at all a fair representation of their respective performance.
  • Zstream - Thursday, May 21, 2009 - link

    Can someone please perform a Win2008 RC2 Terminal Server benchmark? I have been looking everywhere and no one can provide that.

    If I can take this benchmark and tell my boss this is how the servers will perform in a TS environment please let me know.

Log in

Don't have an account? Sign up now