Heavy Virtualization Benchmarking

All tests run on ESX 3.5 Update 4 (Build 153875), which has support for AMD's RVI. It also supports the Intel Xeon X55xx Nehalem but has no support yet for EPT.


Getting one score out of a virtualized machine is not straightforward: you cannot add up URL/s, transactions per second, and queries per second. If virtualized system A turns out twice as many web responses but fails to deliver half of the transactions machine B delivers, which one is the fastest? Luckily for us, Intel (vConsolidate) and VMware (VMmark) have already solved this problem. We use a very similar approach. First, we test each application on its native operating system with four physical cores. Those four physical cores belong to one Opteron Shanghai 8389 2.9GHz. This becomes our reference score.

Opteron Shanghai 8389 2.9GHz Reference System
Test Reference score
OLAP - Nieuws.be 175.3 Queries /s
Web portal - MCS 45.8 URL/s
OLTP - Calling Circle 155.3 Transactions/s

We then divide the score of the first VM by the "native" score. In other words, divide the number of queries per second in the first OLAP VM by the number of queries that one Opteron 8389 2.9GHz gets when it is running the Nieuws.be OLAP Database.

Performance Relative to Reference System
Server System Processors OLAP VM Web portal VM 2 Web portal VM 3 OLTP VM
Dual Xeon X5570 2.93 94% 50% 51% 59%
Dual Xeon X5570 2.93 HT off 92% 43% 43% 43%
Dual Xeon E5450 3.0 82% 36% 36% 45%
Dual Xeon X5365 3.0 79% 35% 35% 32%
Dual Xeon L5350 1.86 54% 24% 24% 20%
Dual Xeon 5080 3.73 47% 12% 12% 7%
Dual Opteron 8389 2.9 85% 39% 39% 51%
Dual Opteron 2222 3.0 50% 17% 17% 12%

So for example, the OLAP VM on the dual Opteron 8389 got a score of 85% of that of the same application running on one Opteron 8389. As you can see the web portal server only has 39% of the performance of a native machine. This does not mean that the hypervisor is inefficient, however. Don't forget that we gave each VM four virtual CPUs and that we have only eight physical CPUs. If the CPUs are perfectly isolated and there was no hypervisor, we would expect that each VM gets 2 physical CPUs or about 50% of our reference system. What you see is that OLAP VM and OLTP VM "steal" a bit of performance away from the web portal VMs.

Of course, the above table is not very user-friendly. To calculate one vApus Mark I score per physical server we take the geometric mean of all those percentages, and as we want to understand how much work the machine has done, we multiply it by 4. There is a reason why we take the geometric mean and not the arithmetic mean. The geometric mean penalizes systems that score well on one VM and very badly on another VM. Peaks and lows are not as desirable as a good steady increase in performance over all virtual machines, and the geometric mean expresses this. Let's look at the results.

Sizing Servers vAPUS Mark
I

After seeing so many VMmark scores, the result of vApus Mark I really surprised us. The Nehalem based Xeons are still the fastest servers, but do not crush the competition as we have witnessed in VMmark and VConsolidate. Just to refresh your memory, here's a quick comparison:

VMmark vs. vApus Mark I Summary
Comparison VMmark vApus Mark I
Xeon X5570 2.93 vs. Xeon 5450 3.0 133-184% faster (*) 31% faster
Xeon X5570 2.93 vs. Opteron 8389 2.9 +/- 100% faster (*)(**) 21% faster
Opteron 8389 2.9 vs. Xeon 5450 3.0 +/- 42% 9% faster

(*) Xeon X5570 results are measured on ESX 4.0; the others are on ESX 3.5.
(**) Xeon X5570 best score is 108% faster than Opteron at 2.7GHz. We have extrapolated the 2.7GHz scores to get the 2.9GHz ones.

Our first virtualization benchmark disagrees strongly with the perception that the large OEMs and recent press releases have created with the VMmark scores. "Xeon 54xx and anything older are hopelessly outdated virtualization platforms, and the Xeon X55xx make any other virtualization platform including the latest Opteron 'Shanghai' look silly". That is the impression you get when you quickly glance over the VMmark scores.

However, vApus Mark I tells you that you should not pull your older Xeons and newer Opterons out of your rack just yet if you are planning to continue to run your VMs on ESX 3.5. This does not mean that either vApus Mark I or VMmark is wrong, as they are very different benchmarks, and vApus Mark I was run exclusively on ESX 3.5 update 4 while some of the VMmark scores have been run on vSphere 4.0. What it does show us how important it is to have a second data point and a second independent "opinion". That said, the results are still weird. In vApus Mark I, Nehalem is no longer the ultimate, far superior virtualization platform; at the same time, the Shanghai Opteron does not run any circles around the Xeon 54xx. There is so much to discuss that a few lines will not do the job. Let's break things up a bit more.

Benchmarked Hardware Configurations Analysis: "Nehalem" vs. "Shanghai"
Comments Locked

66 Comments

View All Comments

  • binaryguru - Monday, June 1, 2009 - link

    It seems to me, x86-based virutalization software is getting more and more complicated. Not only is x86 virtualization getting more complicated, it is getting more and more difficult to get reliable performance from it.

    Let me explain my point.

    The industry is clearly trying to do more with less hardware these days. Getting raw VM performance on commodity hardware is getting to a point where there is no predictable way to plan for an efficient VM environment.

    Current VM technology is trying to simulate the flexibility and performance of mainframes. To me, this is clearly an impossible goal to achieve with the current or future x86 platform model.

    All of the problems the industry is experiencing with VM consolidation does not exist on the mainframe. Running 4 'large' VMs for 'raw' performance. How about running 40 'large' VMs for 'raw' performance. Clearly, we all know that is impossible to achieve with current VM setups.

    Now I'm not saying that virtuallization is a bad idea, it clearly is the ONLY solution for the future of computing. However, I think that the industry is going about it the wrong way. Server farms are becoming increasingly more difficult to manage, never mind the challenge of getting 100s of blade servers to play nice with each other while providing good processing throughput.

    This problem has been solved about 20 years ago; and yet, here we are, struggling again with the "how can I get MORE from my technology investment" scenario.

    In conclusion, I think we need to go back to utilizing huge monolithic computing designs; not computing clusters.
  • mikidutzaa2 - Friday, May 29, 2009 - link

    Hello,

    It would be useful (if possible) to have latency numbers/response times on the tests as well because rarely we are interested in throughput on our servers. What we usually care more is how long it takes the server to respond to user actions.

    What is your opinion?
  • JohanAnandtech - Friday, May 29, 2009 - link

    I agree. I admit it is easier for us or any benchmark person to use throughput as immediately comparable (X is 10% faster than Y) and you have only one datapoint. That is why almost

    Responsetime however can only be understood by drawing curves relative to the current throughtput / User concurrency. So yes, we are taking this excellent suggestion into consideration. The trade off might that articles get harder to read :-).
  • mikidutzaa2 - Friday, May 29, 2009 - link

    Looking forward to your new articles then, glad to hear :).

    The articles don't necessarily have to be harder to read, you could put the detailed graphs on a separate page and maybe show only one response time for a "decent"/medium user concurrency.

    Also, I would find interesting (if you have time) to have the same benchmarks with 2vcpu machines, I think this is a more common setup for virtualization. Very few people I think virtualize their most critical/highly used platforms - at least that's how we do it. We need virtualization for lightly used platforms (i.e. not very many users) but we are still very much interested in response time because the users perceive latency, not throughput.

    So the important question is: if you have a virtual server (as opposed to a physical one) will the users notice? If so, by how much is it slower?

    Thank you.
  • RobAm - Tuesday, May 26, 2009 - link

    It's good to see some unbiased analysis with respect to virtualization. It's also especially interesting that your workloads (which look much more like real world apps my company runs as opposed to SPECjbb, vmark, vconsolidate) shows a much more competitive landscape than vmware and Intel portray. Also, doesn't vmware prohibit benchmarking without their permission. Did they give you permission? Has VMware called offering to re-educate you? :-)
  • Brovane - Tuesday, May 26, 2009 - link

    I was hoping for a some benchmarks on the Xeon x7xxx CPU for the Quad Socket Intel boxes. We are currently have Dell R900's and we where looking at adding to our ESX cluster. We where debating between the R900 with Hex cores our Xeon x55xx series CPU's in the R710. I see the x55xx series where bench marked but nothing on the Xeon MP series unless I am missing that part of the article.
  • JohanAnandtech - Tuesday, May 26, 2009 - link

    Expect a 24-core CPU comparison soon :-).
  • Brovane - Tuesday, May 26, 2009 - link

    You also might want to a 12-core comparison also. We have found that with a 4-socket box that you usually run out of memory before you run out CPU power. With the R900 having 32-Dimm Sockets, the R900's we purchased last year have 64GB of RAM and just use 2x2.93Ghz CPU's we max memory before CPU easily in our environment. Since Vmware licensing and Data Center licensing is done per Socket we only populate 2 of the sockets with CPU's and this seems to do great for us. You basically double your licensing costs if you go with all 4 sockets occupied. Just a thought as to how sometimes virtualization is done in the real world. There is such a price premium for 8GB memory Dimm's it isn't worth it to put 256GB in one box with all 4 sockets occupied. The 4GB Dimm's did reach price parity this year so we were looking at going for 128GB of memory on our new R900's however Intel also released Hex-core so we still don't see much reason to occupy all 4 sockets.
  • yasbane - Tuesday, May 26, 2009 - link

    I know positive feedback is always appreciated for the hard work put in but it seems very rare that we see any non-microsoft benchmarks for server stuff these days on Anandtech. Is there any particular reason for this...? I don't mean to carp but I recall the days when non-microsoft technologies actually got a mention on Anandtech. Sadly, we don't seem to see that anymore :(

    Cheers
  • JohanAnandtech - Tuesday, May 26, 2009 - link

    Yasbane, my first server testing articles (DB2, MySQL) were all pure Linux benches. However, we have moved on to a new kind of realworld benchmarks and it takes a while to master the new benchmarks we have introduced. Running Calling Circle and Dell DVD store posed more problems on Linux than on Windows: we have lower performance, a few weird error messages and so on. In our lab, about 50% of the servers are running linux (and odd machines is running OS-X and another Solaris :-) and we definitely would love to see some serious linux benchmarking again. But it will take time.

    Xen benchmarks are happening as I write this BTW.

Log in

Don't have an account? Sign up now