The Quest for an Independent Real-World Virtualization Benchmark

As we explained in our Xeon Nehalem review, comprehensive real-world server benchmarks that cover the market are beyond what one man can perform. Virtualization benchmarking needs much more manpower, and it is always good to understand the motivation of the group doing the testing. Large OEMs want to show off their latest server platforms, and virtualization software vendors want to show how efficient their hypervisor is. So why did we undertake this odyssey?

This virtualization benchmark was developed by an academic research group called the Sizing Server Lab. (I am also part of this research group.) Part of the lab is academic work; the other part is research that is immediately applied in the field, in this case software developers. The main motivation and objective of the applied research is to tell developers how their own software behaves and performs in a virtual environment. Therefore, the focus of our efforts was to develop a very flexible stress test that tells us how any real-world application behaves in a virtualized environment. A side effect of all this work is that we came up with a virtualization server benchmark, which we think will be very interesting for the readers of AnandTech.

Although the benchmark was a result of research by an academic lab, the most important objectives in designing our own virtualization benchmarks are that they be:

  • Repeatable
  • Relevant
  • Comparable
  • Heavy

Repeatable is the hardest one. Server benchmarks tend to run into all kinds of non-hardware related limits such as not enough connections, locking contention, and driver latency. This results in a benchmark that rarely runs at 100% CPU utilization and the CPU percentage load changes for different CPUs. In "Native OS" conditions, this is still quite okay; you can still get a decent idea of how two CPUs perform if one runs at 78% and the other runs at 83% CPU load. However, in virtualization this becomes a complete mess, especially when you have more virtual than physical CPUs. Some VMs will report significantly lower CPU load and others will report significantly higher CPU load when you are comparing two servers. As each VM is reporting different numbers (for example queries per second, transactions per second, and URL/s), average CPU load does not tell you the whole story either. To remedy this, we went through a careful selection of our applications and decided to keep only those benchmarks that allowed us to push the system close to 95-99% load. Note that this was only possible after a lot of tuning.

Comparable: our virtualization benchmark can run on Xen, Hyper-V and ESX.

Heavy: While VMmark and others go for the scenario of running many very light virtual machines with extremely small workloads, we go for a scenario with four or eight VMs. The objective is to find out how the CPUs handle "hard to consolidate" applications such as complex dynamic websites, OnLine Transaction Processing (OLTP), and OnLine Analytical Processing (OLAP) databases.

Most importantly: Relevant. We have been working towards benchmarks using applications that people run every day. In this article we had to make one compromise: as we are comparing the virtualization capabilities of different CPUs, we had to push CPU utilization close to 100%. Few virtualized servers will run close to 100% all the time, but it allows us to be sure that the CPU is the bottleneck. We are using real-world applications instead of benchmarks, but the other side of coin is that this virtualization benchmark is not easily reproducible by third parties. We cannot release the benchmark to third parties, as some of the software used is the intellectual property of other companies. However, we are prepared to fully disclose the details of how we perform the benchmarks to every interested and involved company.

The Virtualization Benchmarking Chaos vApus Mark I: the choices we made
Comments Locked

66 Comments

View All Comments

  • JohanAnandtech - Friday, May 22, 2009 - link

    Most of the time, the number of sessions on TS are limited by the amount of memory. Can you give some insight in what you are running inside a session? If it is light on CPU or I/O resources, sizing will be based on the amount of memory per session only.
  • dragunover - Thursday, May 21, 2009 - link

    would be interesting if this was done on desktop CPU's with price / performance ratios
  • jmke - Thursday, May 21, 2009 - link

    nope, that would not be interesting at all. You don't want desktop motherboards, RAM or CPUs in your server room;
    nor do you run ESX at home. So there's no point to test performance of desktop CPUs.
  • simtex - Thursday, May 21, 2009 - link

    Why so harsh, virtualization will eventually become a part of desktops users everyday life.

    Imagine, tabbing between different virtualization, like you do in your browser. You might have a secure virtualization for your webapplications, a fast virtualization for your games. Another for streaming music and maybe capturing television. All on one computer, which you seldom have to reboot because everything runs virtualized.
  • Azsen - Monday, May 25, 2009 - link

    Why would you run all those applications on your desktop in VMs? Surely they would just be separate application processes running under the one OS.
  • flipmode - Thursday, May 21, 2009 - link

    Speaking from the perspective of how the article can be the most valuable, it is definitely better off to stick to true server hardware for the time being.

    For desktop users, it is a curiosity that "may eventually" impart some useful data. The tests are immediately valuable for servers and for current server hardware. They are merely of academic curiosity for desktop users on hardware that will be outdated by the time virtualization truly becomes a mainstream scenario on the desktop.

    And I do not think he was being harsh, I think he was just being as brief as possible.

Log in

Don't have an account? Sign up now