Understanding the VMmark Score

Before we try to demystify the published VMmark scores, let me state upfront that the VMmark benchmark has it flaws, but we know from firsthand experience how hard it is to build a decent virtualization benchmark. It would be unfair and arrogant to call VMmark a bad benchmark. The benchmark first arrived back in 2006. The people of VMware were pioneers and solved quite a few problems, such as running many applications simultaneously and getting one score out of the many different benchmarks, all with scores in different units. The benchmark results are consistent and the mix of applications reflects more or less the real world.

Let's refresh your memory: VMware VMmark is a benchmark of consolidation. It consolidates several virtual machines performing different tasks, creating a tile. A VMmark tile consists of:

  • MS Exchange VM
  • Java App VM
  • Idle VM
  • Apache web server VM
  • MySQL database VM
  • SAMBA fileserver VM

The first three run on a Windows 2003 guest OS and the last three run on SUSE SLES 10.


Now let's list the few flaws:

  • The six applications plus virtual machines in one tile only need 5GB of RAM. Most e-mail servers running right now will probably use 4GB or more on their own! The vast majority of MySQL database servers and java web servers have at least 2GB at their disposal.
  • It uses SPECjbb, which is an "easy to inflate" benchmark. The benchmark scores of SPECjbb are obtained with extremely aggressively tuned JVMs, the kind of tuning you won't find on a typical virtualized, consolidated java web server.
  • SysBench only works on one table and is thus an oversimplified OLTP test: it only performs transactions on one table.

Regarding our SysBench remark, as OLTP benchmarks are very hard, we also use SysBench and we are very grateful for the efforts of Alexey Kopytov. SysBench is in many cases close enough for native situations. The problem is that some effects that a real world OLTP database has on a hypervisor (such as network connections and complex buffering that requires a lot more memory management) may not show up if you run a benchmark on such an oversimplified model.

The VMmark benchmark is also starting to show its age with its very low memory requirements per server. To limit the amount of development time, the creators of VMmark also went with some industry standard benchmarks, which have been starting to lose their relevance as vendors have found ways to artificially inflate the scores. VMmark needs an update, but as VMware is involved in the SPEC Virtualization Committee to develop a new industry standard virtualization benchmark, it does not make sense to further develop VMmark.

The easiest way to see that VMmark is showing its age is in the consolidation ratio of the VMmark runs. Dual CPU machines are consolidating 8 to 17 tiles. That means a dual CPU system is running 102 virtual machines, of which 85 are actively stressed! How many dual CPU machines have you seen that even operate half that many virtual machines?

That said, we'll have to work with VMmark until something better comes up. That brings up two questions. How can you spot reliable and unreliable VMmark scores? Can you base decisions on the scores?

Index The VMmark Scoring Chaos
Comments Locked

23 Comments

View All Comments

  • JohanAnandtech - Sunday, May 10, 2009 - link

    Thanks for the compliment and especially the confidence!

    - Johan
  • tynopik - Friday, May 8, 2009 - link

    one problem with VMmark is that it tries to reduce a complex combination of variables to ONE NUMBER

    that's great if everyone has the SAME WORKLOAD, but they don't

    how about more focused benchmarks that stress ONE particular area of vm performance?

    then people can looks at the numbers that most impact them

    also it would be interesting to see where the gains are coming from

    is nehalem better at virtualization simply because it's a faster cpu? or are the vm-specific enhancements making a difference? inquiring minds want to know
  • mlambert - Friday, May 8, 2009 - link

    No one pays attention to this. It's all about whats available from HP in terms of blades & enclosures when it's time for your 3yr hardware refresh.

    Right now it's BL2's the 495's. This is how corp IT works.

Log in

Don't have an account? Sign up now