Real-world virtualization benchmarking: the best server CPUs comparedby Johan De Gelas on May 21, 2009 3:00 AM EST
- Posted in
- IT Computing
The Quest for an Independent Real-World Virtualization Benchmark
As we explained in our Xeon Nehalem review, comprehensive real-world server benchmarks that cover the market are beyond what one man can perform. Virtualization benchmarking needs much more manpower, and it is always good to understand the motivation of the group doing the testing. Large OEMs want to show off their latest server platforms, and virtualization software vendors want to show how efficient their hypervisor is. So why did we undertake this odyssey?
This virtualization benchmark was developed by an academic research group called the Sizing Server Lab. (I am also part of this research group.) Part of the lab is academic work; the other part is research that is immediately applied in the field, in this case software developers. The main motivation and objective of the applied research is to tell developers how their own software behaves and performs in a virtual environment. Therefore, the focus of our efforts was to develop a very flexible stress test that tells us how any real-world application behaves in a virtualized environment. A side effect of all this work is that we came up with a virtualization server benchmark, which we think will be very interesting for the readers of AnandTech.
Although the benchmark was a result of research by an academic lab, the most important objectives in designing our own virtualization benchmarks are that they be:
Repeatable is the hardest one. Server benchmarks tend to run into all kinds of non-hardware related limits such as not enough connections, locking contention, and driver latency. This results in a benchmark that rarely runs at 100% CPU utilization and the CPU percentage load changes for different CPUs. In "Native OS" conditions, this is still quite okay; you can still get a decent idea of how two CPUs perform if one runs at 78% and the other runs at 83% CPU load. However, in virtualization this becomes a complete mess, especially when you have more virtual than physical CPUs. Some VMs will report significantly lower CPU load and others will report significantly higher CPU load when you are comparing two servers. As each VM is reporting different numbers (for example queries per second, transactions per second, and URL/s), average CPU load does not tell you the whole story either. To remedy this, we went through a careful selection of our applications and decided to keep only those benchmarks that allowed us to push the system close to 95-99% load. Note that this was only possible after a lot of tuning.
Comparable: our virtualization benchmark can run on Xen, Hyper-V and ESX.
Heavy: While VMmark and others go for the scenario of running many very light virtual machines with extremely small workloads, we go for a scenario with four or eight VMs. The objective is to find out how the CPUs handle "hard to consolidate" applications such as complex dynamic websites, OnLine Transaction Processing (OLTP), and OnLine Analytical Processing (OLAP) databases.
Most importantly: Relevant. We have been working towards benchmarks using applications that people run every day. In this article we had to make one compromise: as we are comparing the virtualization capabilities of different CPUs, we had to push CPU utilization close to 100%. Few virtualized servers will run close to 100% all the time, but it allows us to be sure that the CPU is the bottleneck. We are using real-world applications instead of benchmarks, but the other side of coin is that this virtualization benchmark is not easily reproducible by third parties. We cannot release the benchmark to third parties, as some of the software used is the intellectual property of other companies. However, we are prepared to fully disclose the details of how we perform the benchmarks to every interested and involved company.