VMmark

Before we take a look at our own virtualization benchmarking, let us look at the currently (beginning of August 2010) available VMmark scores.

VMWare VMmark

It is interesting to note that most AMD “Istanbul” Opteron servers benchmarked were using DDR2-667. That somewhat limited their VMmark scores as consolidated virtualized servers have higher bandwidth demands than most “native running” servers. The dual Opteron 6176 has the same amount of cores as the Quad Opteron 8439. At the same time, those cores are identical, only the uncore part has changed. So from a pure processing power point of view, the dual Opteron 6176 performance should be about 15% slower. The reality is that the dual socket is 3% faster than the older quad socket server. This shows that VMmark really benefits from the improved memory subsystem, as the support for DDR3-1333 memory essentially doubles the bandwidth and lowers latency. That still is not enough to beat the Intel armada, as the fastest “Westmere” Xeon is about 16% faster than the best Opteron “Magny-Cours”.

The Quad Xeon X7560 leaves everything behind in VMmark, by offering more than twice the performance of all dual configurations. Virtualization favors high core counts: you are running many different applications which do not have to exchange data most of the time. This reduces the thread synchronization overhead. Nonetheless, the scores that the Xeon X7560 gets are impressive. But of course, this is VMmark, an industry benchmark. The results also depend on how much time and effort is spent on tuning the benchmark. Since the introduction of the Xeon X7500 series, the VMmark scores have already improved by 7% (from 70.78 to 75.77). Let us check out vApus Mark II where each platform is treated the same.

vApus Mark II

vApus Mark II—VMware ESX 4.0

The overall picture remains the same, although there are some clear differences. First of all, the “Magny-Cours Opteron” and “Westmere Xeon” are closer to each other. The difference between the two best server CPUs with a “decent” TDP is only 4%. But the surprise is the landslide victory of the X7560. Let us analyze the results in more detail.

For the OLAP test, we took a dual Xeon X5570 without Hyper-Threading as reference. The reason for this is that the VM got eight vCPUs, and we compare this with a native server that has eight cores. For the web test, we used two Xeon X5570 cores as reference, or a Xeon X5570 cut in two. The OLTP scores, obtained in a VM with four virtual CPUs, uses the Swingbench scores of one Xeon X5570 as reference. The reason why we chose the Xeon “Nehalem” as reference is that this server CPU is the natural yardstick for all new server CPUs: it outperformed all contemporary server CPUs by a large margin at its launch (March 2009).

Let us take a look at the more detailed results per VM. The vApus Mark II score is a geometric mean of the different VMs.

CPU config Tiles OLAP (1 VM)
Web (3 VMs)
OLTP (1 VM)
vApus Mark II score
Dual 6174 2 57% 30% 22% 67.5
Dual 6136 2 45% 23% 14% 48.6
Dual 7560 2 58% 51% 32% 91.8
Dual X5670 2 53% 43% 19% 70.0
Dual L5640 2 48% 33% 15% 57.6
Quad 7560 2 73% 73% 39% 118.6
Quad 7560 4 47% 50% 29% 162.7

The ESX scheduler works with Hardware Execution Contexts, which map to one logical (Hyper-Threading) or physical core. In our current test, more HECs are demanded than available, so this test is quite hard on the ESX scheduler. We have still to investigate why the OLTP scores are quite a bit lower than the other VMs. This VM is the most disk intensive and as such requires more VMkernel time than the others. This might explain why there is less processing power left for the application running inside the VM. Another reason is that this application requires more “co-scheduling”. In OLTP applications, threads rarely run independently, but have to synchronize frequently. In that case it is important that each virtual CPUs gets equal processing power. If one vCPU gets ahead of the others, this may result in a thread waiting longer than necessary for the other to release a spinlock.

Although ESX 3.5 and 4.0 feature “relaxed co-scheduling”, the best performance for these kind of applications is achieved when the scheduler can “co-schedule” the syncing threads. The fact that the system with the highest logical core count gets the best percentages in the OLTP VM is another indication that the co-scheduling issue may play an important role. Notice how the dual Xeon X7560 with 32 threads does significantly better than the higher clocked Xeon X5670 (24 threads) when running the OLTP VM. While the overall performance of the dual Xeon X7560 is 31% better than the Xeon X5670 (91.8 vs 70), the OLTP performance is almost 70% (!) better. Another indication is consistency: the differences between the VMs are much smaller on the Dual Xeon X7560.

The AMD systems show a similar picture. The 16-core 6136, despite the decent 2.4GHz clock speed, offers the lowest OLTP performance to its users as it has the fewest threads to offer the scheduler. The dual 6174 runs at a 9% lower clock speed but has 24 cores to offer. The result is that the OLTP VM performs a lot better (more “perfect” co-scheduling possible): we noticed 57% better OLTP performance. The OLTP VM was even faster on the Dual 6174 with its 24 “real” cores than on the Xeon X5670. Although this is only circumstantial evidence, we have strong indications that transactional workloads favor high core and thread counts.

Our measurements show that the quad Xeon X7560 is about 2.3 times faster than the best dual platforms. That makes one quad Xeon X7560 a very interesting alternative for each two dual CPU servers you wish to buy for virtualization consolidation.

vApus Mark II Conclusion
Comments Locked

51 Comments

View All Comments

  • Ratman6161 - Wednesday, August 11, 2010 - link

    Many products license on a per CPU basis. For Microsoft anyway, what they actually count is the number of sockets. For example SQL Server Enterprise retails for $25K per CPU. So an old 4 socket system with single cores would be 4 x $25K = $100K. A quad socket system with quad core CPUs would be a total of 16 cores but the pricing would still be 4 sockets x $25K = 100K. It used to be that Oracle had a complex formula for figuring this but I think they have now also gone to the simpler method of just counting sockets (though their enterprise edition is $47.5K).

    If you are using VMWare, they also charge per socket (last I knew) so two dual socket systems would cost the same as a single 4 socket system. Thing is though you need to have at least two boxes in order to enable the high availability (i.e. automatic failover) functionality.
  • Stuka87 - Wednesday, August 11, 2010 - link

    For VMWare they have a few pricing structures. You can be charged per physical socket, or you can get an unlimited socket license (which is what we have, running one seven R910's). You just need to figure out if you really need the top tier license.
  • semo - Tuesday, August 10, 2010 - link

    "Did I mention that there is more than 72GHz of computing power in there?"

    Is this ebay?
  • Devo2007 - Tuesday, August 10, 2010 - link

    I was going to comment on the same thing.

    1) A dual core 2GHz CPU does not equal "4GHz of computing power" - unless somehow you were achieving an exact doubling of performance (which is extremely rare if it exists at all).

    2) Even if there was a workload that did show a full doubling of performance, performance isn't measured in MHz & GHz. A dual-core 2GHz Intel processor does not perform the same as a 2GHz AMD CPU.

    More proof that the quality of content on AT is dropping. :(
  • mino - Wednesday, August 11, 2010 - link

    You seem to know very little about the (40yrs old!) virtualization market.
    It flourishes from *comoditising* processing power.

    Why clearly meant a joke, that statement of Johan, is much closer to the truth than most market "research" reports on x86.
  • JohanAnandtech - Wednesday, August 11, 2010 - link

    Exactly. ESX resource management let you reserve CPU power in GHz. So for ESX, two 2.26 GHz cores are indeed a 4.5 GHz resource.
  • duploxxx - Thursday, August 12, 2010 - link

    sure you can count resources together as much as you want... virtually. But in the end a single process is still only able to handle the max ghz a single cpu can offer but can finish the request faster. That is exactly the thing why those Nehalem and gulf still hold against the huge core count of Magny cours.
  • maeveth - Tuesday, August 10, 2010 - link

    So I have nothing at all against AnandTech's recent articles on Virtualization however so far all of them have only looked at Virtualization from a compute density point of view.

    I currently am the administrator of a VMware environment used for development work and I run into I/O bottle necks FAR before I ever run into a compute bottleneck. In fact computational power is pretty much the LAST bottleneck I run into. My environment currently holds just short of 300 VMs, OS varies. We peak at approximately 10-12K IOPS.

    From my experience you always have to look at potential performance in a virtual environment at a much larger perspective. Every bottleneck effects others in subtle ways. For example if you have a memory bottleneck, either host or guest based you will further impact your I/O subsystem, though you should aim to not have to swap. In my opinion your storage backend is the single most important factor when determining large-scale-out performance in a virtualized environment.

    My environment has never once run into a CPU bottleneck. I use IBM x3650/x3650M2 with Dual Quad Xeons. The M2s use X5570s specifically.

    While I agree having impressive magnitudes of "GHz" in your environment is kinda fun it hardly says anything about how that environment will preform in a real world environment. Granted it is all highly subject to work load patterns.

    I also want to make it clear that I understand that testing on a such a scale is extremely cost prohibitive. As such I am sure AnandTech, Johan speficially, is doing the best he can with what resources he is given. I just wanted to throw my knowledge out there.

    @ELC
    Yes, software licensing is a huge factor when purchasing ESX servers. ESX is licensed per socket. It's a balancing act that depends on your work load however. A top end ESX license costs about $5500/year per socket.
  • mino - Wednesday, August 11, 2010 - link

    However, IMO storage performance analysis is pretty much beyond AT's budget ballpark by an order of magnitude (or two).

    There is a reason this space is so happily "virtualized" by storage vendors AND customers to a "simple" IOPS number.
    It is a science on its own. Often closer to black (empiric) magic than deterministic rules ...

    Johan,
    on the other hand, nothing prevents you form mentioning this sad fact:

    Except edge cases, a good virtualization solution is build from the ground up with
    1. SLA's
    2. storage solution
    3. licensing considerations
    4. everything else (like processing architecture) dictated by the previous
  • JohanAnandtech - Wednesday, August 11, 2010 - link

    I can only agree of course: in most cases the storage solution is the main bottleneck. However, this is aloso a result of the fact that most storage solutions out there are not exactly speed demons. Many storage solutions out there consist of overengineered (and overpriced) software running on outdated hardware. But things are changing quickly now. HP for example seems to recognize that a storage solution is very similar to a server running specialized software. There is more, with a bit of luck, Hitachi and Intel will bring some real competition to the table. (currently STEC has almost a monopoly on the enterprise SSD disks). So your number 2 is going to tumble down :-).

Log in

Don't have an account? Sign up now