VMmark

Before we take a look at our own virtualization benchmarking, let us look at the currently (beginning of August 2010) available VMmark scores.

VMWare VMmark

It is interesting to note that most AMD “Istanbul” Opteron servers benchmarked were using DDR2-667. That somewhat limited their VMmark scores as consolidated virtualized servers have higher bandwidth demands than most “native running” servers. The dual Opteron 6176 has the same amount of cores as the Quad Opteron 8439. At the same time, those cores are identical, only the uncore part has changed. So from a pure processing power point of view, the dual Opteron 6176 performance should be about 15% slower. The reality is that the dual socket is 3% faster than the older quad socket server. This shows that VMmark really benefits from the improved memory subsystem, as the support for DDR3-1333 memory essentially doubles the bandwidth and lowers latency. That still is not enough to beat the Intel armada, as the fastest “Westmere” Xeon is about 16% faster than the best Opteron “Magny-Cours”.

The Quad Xeon X7560 leaves everything behind in VMmark, by offering more than twice the performance of all dual configurations. Virtualization favors high core counts: you are running many different applications which do not have to exchange data most of the time. This reduces the thread synchronization overhead. Nonetheless, the scores that the Xeon X7560 gets are impressive. But of course, this is VMmark, an industry benchmark. The results also depend on how much time and effort is spent on tuning the benchmark. Since the introduction of the Xeon X7500 series, the VMmark scores have already improved by 7% (from 70.78 to 75.77). Let us check out vApus Mark II where each platform is treated the same.

vApus Mark II

vApus Mark II—VMware ESX 4.0

The overall picture remains the same, although there are some clear differences. First of all, the “Magny-Cours Opteron” and “Westmere Xeon” are closer to each other. The difference between the two best server CPUs with a “decent” TDP is only 4%. But the surprise is the landslide victory of the X7560. Let us analyze the results in more detail.

For the OLAP test, we took a dual Xeon X5570 without Hyper-Threading as reference. The reason for this is that the VM got eight vCPUs, and we compare this with a native server that has eight cores. For the web test, we used two Xeon X5570 cores as reference, or a Xeon X5570 cut in two. The OLTP scores, obtained in a VM with four virtual CPUs, uses the Swingbench scores of one Xeon X5570 as reference. The reason why we chose the Xeon “Nehalem” as reference is that this server CPU is the natural yardstick for all new server CPUs: it outperformed all contemporary server CPUs by a large margin at its launch (March 2009).

Let us take a look at the more detailed results per VM. The vApus Mark II score is a geometric mean of the different VMs.

CPU config Tiles OLAP (1 VM)
Web (3 VMs)
OLTP (1 VM)
vApus Mark II score
Dual 6174 2 57% 30% 22% 67.5
Dual 6136 2 45% 23% 14% 48.6
Dual 7560 2 58% 51% 32% 91.8
Dual X5670 2 53% 43% 19% 70.0
Dual L5640 2 48% 33% 15% 57.6
Quad 7560 2 73% 73% 39% 118.6
Quad 7560 4 47% 50% 29% 162.7

The ESX scheduler works with Hardware Execution Contexts, which map to one logical (Hyper-Threading) or physical core. In our current test, more HECs are demanded than available, so this test is quite hard on the ESX scheduler. We have still to investigate why the OLTP scores are quite a bit lower than the other VMs. This VM is the most disk intensive and as such requires more VMkernel time than the others. This might explain why there is less processing power left for the application running inside the VM. Another reason is that this application requires more “co-scheduling”. In OLTP applications, threads rarely run independently, but have to synchronize frequently. In that case it is important that each virtual CPUs gets equal processing power. If one vCPU gets ahead of the others, this may result in a thread waiting longer than necessary for the other to release a spinlock.

Although ESX 3.5 and 4.0 feature “relaxed co-scheduling”, the best performance for these kind of applications is achieved when the scheduler can “co-schedule” the syncing threads. The fact that the system with the highest logical core count gets the best percentages in the OLTP VM is another indication that the co-scheduling issue may play an important role. Notice how the dual Xeon X7560 with 32 threads does significantly better than the higher clocked Xeon X5670 (24 threads) when running the OLTP VM. While the overall performance of the dual Xeon X7560 is 31% better than the Xeon X5670 (91.8 vs 70), the OLTP performance is almost 70% (!) better. Another indication is consistency: the differences between the VMs are much smaller on the Dual Xeon X7560.

The AMD systems show a similar picture. The 16-core 6136, despite the decent 2.4GHz clock speed, offers the lowest OLTP performance to its users as it has the fewest threads to offer the scheduler. The dual 6174 runs at a 9% lower clock speed but has 24 cores to offer. The result is that the OLTP VM performs a lot better (more “perfect” co-scheduling possible): we noticed 57% better OLTP performance. The OLTP VM was even faster on the Dual 6174 with its 24 “real” cores than on the Xeon X5670. Although this is only circumstantial evidence, we have strong indications that transactional workloads favor high core and thread counts.

Our measurements show that the quad Xeon X7560 is about 2.3 times faster than the best dual platforms. That makes one quad Xeon X7560 a very interesting alternative for each two dual CPU servers you wish to buy for virtualization consolidation.

vApus Mark II Conclusion
Comments Locked

51 Comments

View All Comments

  • blue_falcon - Tuesday, August 10, 2010 - link

    The R715 is an AMD box.
  • webdev511 - Tuesday, August 10, 2010 - link

    Yes, and the R715 has 2x AMD Opteron™ 6176SE, 2.3GHz with 12 cores per socket with an approx price of $8,000
  • fic2 - Tuesday, August 10, 2010 - link

    4. Part of the Anandtech 13 year anniversary giveaway?!! ;o)
  • mino - Wednesday, August 11, 2010 - link

    Big Thanks for that !
  • Etern205 - Tuesday, August 10, 2010 - link

    *stares at cpu graph*
    ~Drrroooollllliiiieeeeeeee~~~~
  • yuhong - Tuesday, August 10, 2010 - link

    The incorrect references to Xeon 7200 should be Xeon 7100.
    "Other reasons include the fact that some decision makers never really bothered to read the benchmarks carefully"
    You didn't even need to do that. Knowing the difference between NetBurst vs Core 2 vs Nehalem would have made it obvious.
  • ELC - Tuesday, August 10, 2010 - link

    Isn't the price of software licenses a major factor in the choice of optimum server size?
  • webdev511 - Tuesday, August 10, 2010 - link

    So does the NUMA barrier.

    I'd go for less sockets with more cores any day of the week and as a result Intel= second string.
  • Ratman6161 - Wednesday, August 11, 2010 - link

    For the software licensing reasons I mentioned above, there is a distinct advantage to fewer sockets with more cores.
  • davegraham - Wednesday, August 11, 2010 - link

    so NUMA is an interesting one. Intel's QPI bus is actually quite good and worth spending some time to get to know.

    dave

Log in

Don't have an account? Sign up now