vApus Mark I: Performance-Critical applications virtualized

You might remember from our previous article that the vApus Mark I, our in-house developed virtualization benchmark, is designed to measure the performance of "heavy" performance-critical applications. Virtualization vendors are very actively promoting that you should virtualize these OLTP and heavy websites too, so that you can let the virtualization software dynamically manage them. In other words, if you want high-availability, load balancing, and low power (by shutting down servers which are not used), everything should be virtualized.

That is where vApus Mark I comes in: one OLAP, one DSS, and two heavy websites are combined in one tile. These are the kind of demanding applications that still received their own dedicated and natively running machine a year ago. vApus Mark I shows what will happen if you virtualize them. If you want to fully understand our benchmark methodology, vApus Mark I has been described in great detail here. We enabled large pages as it is generally considered a best practice with AMD's RVI and Intel's EPT.

vApus Mark I uses four VMs with four server applications:

 

  • An SQL Server 2008 x64 database running on Windows 2008 64-bit, stress tested by our in-house developed vApus test.
  • Two heavy duty MCS eFMS portals running PHP, IIS on Windows 2003 R2, stress tested by our in-house developed vApus test.
  • One OLTP database, based on Oracle 10G Calling Circle benchmark of Dominic Giles.

 

The beauty is that vApus (stress testing software developed by the Sizing Servers Lab) uses actions made by real people (as can be seen in logs) to stress test the VMs, not some benchmarking algorithm.

To make things more interesting, we enabled and disabled HT-assist on the quad Opteron 8435 platform. HT-assist (described here in detail) steals 1MB from the L3 cache, reducing the size of the L3 cache to 5MB. The 1MB of cache is used as a very fast directory which eliminates a lot of snoop traffic. Eliminating snoop traffic reduces the "bandwidth pressure" on the CPU interconnects (hence the name HyperTransport Assist), but more importantly it reduces the latency of a cache request.

 

vAPUS Mark I 2 tile test - ESX 4.0

 

Thanks to HT Assist, the 24 Opteron cores communicate and perform about 9% faster. That is not huge, but it widens the gap with the dual Xeon somewhat. The dual Xeon X5570 keeps up with the much more expensive quad socket Intel server: eight cores are just as fast as 24.

Two tiles, 4 VMs and 4 vCPUs per VM: a total of 32 vCPUs are active in the previous test. 32 vCPUs are harder to schedule on a hex-core CPU, and especially on 24 cores in total. So let us see what happens if we reduce the total amount of vCPUs to 24 vCPUs.


8 VMs, 2 tiles of vApus Mark I, 24 vCPUs

We reduced the number of vCPUs on the web portal VMs from 4 to 2. That means that we have:

 

  • Two times 4 vCPUs for the OLAP test
  • Two times 4 vCPUs for the OLTP test
  • Two times 2 vCPUs for the web test

 

That makes a total of 24 vCPUs. The 32 vCPU test is somewhat biased towards the quad-core CPUs such as the Xeon X5570 while the test below favors the hex-cores.

 

vAPUS Mark I 2 tile test - 24 vCPUs - ESX 4.0

 

The "Dunnington" platform beats the 16 thread, 8 core Nehalem server but it is nothing to write home about: the 24 core machine outperforms Intel's latest dual socket by 6%. The advantage of the Opteron 8435 compared to the Xeon X7460 shrinks from 28 to 21%, but that is still a tangible performance advantage. Our understanding of virtualization performance is growing. Take a look at the table below.

Virtualization Testing Results
Server System Comparison vApus Mark I -
24 vCPUs
vApus Mark I -
32 vCPUs
VMmark
Quad Xeon X7460 vs.
Dual Xeon X5570 2.93
6% -2% -15%
Quad Opteron 8435 vs.
Dual Xeon X5570 2.93
29% 26% 21%
Quad Opteron 8435 vs.
Quad X7460
21% 28% 42%
Dual Xeon X5570 2.93 vs.
Dual Opteron 2435
11% 30% 54%

Notice how the VMmark benchmark absolutely prefers the new "Nehalem" platform: the Dual Xeon X5570 is 54% faster, while it is only 11-30% on vApus Mark I. The quad Opteron 8435 is up to 30% faster than Intel's speed demon, while VMmark indicates only a 21% lead. But notice that vApus Mark I is also more friendly towards the Intel hex-core: VMmark tell us that eight Nehalems are 15% faster than 24 Dunnington cores. vApus Mark I tells us that the quad X7460 is about as fast as the dual Xeon X5570. So why is VMmark so much happier on the Xeon X5570 server? The answer might be found in the table below.

One VMmark tile generates about 21,000 interrupt per second, 22 MB/s of Storage I/O and 55 Mbit/s of network traffic. We have profiled vAPUS Mark in depth before. The table below compares both benchmarks from a Hypervisor point of view.

Virtualization Benchmarks Profiling
  vApus Mark I
(Dual Xeon X5570)
VMmark
(Dual Xeon X5570)
Total interrupts per second 2 x 19 K = 38 K/s 17 * 21 = 357 K/s
Storage 2 x 4.1MB/s = 8.2MB/s 17*22 = 374 MB/s
Network 2 x 50M bit/s = 100Mbit/s 17* 55 MB/s = 935 Mbit/s

VMmark places a lot more stress on the hypervisor and the way it handles I/O. It produces about 10 times more interrupts and almost a 100 times more storage I/O. We know from our profiling that vApus Mark I does a lot of page management, which is a logical result of the application choice (databases that open and close connections) and the amount of memory per VM.

The result is that VMmark with its huge number of VMs per server (up to 102 VMs!) places a lot of stress on the I/O systems. The reason for the Intel Xeon X5570's crushing VMmark results cannot be explained by the processor architecture alone. One possible explanation may be that the VMDq (multiple queues and offloading of the virtual switch to the hardware) implementation of the Intel NICs is better than the Broadcom NICs that are typically found in the AMD based servers.

The Number One Reason for Quad Socket Power Consumption
Comments Locked

32 Comments

View All Comments

  • Photubias - Wednesday, October 7, 2009 - link

    This is surely to be tested, but the Fiorano platform (as this AMD Chipset is called), is yet to be released.
  • solori - Wednesday, October 7, 2009 - link

    Fiorano (SR5690/SP5100, et al) are out now for Socket-F and really require an Istanbul to show their stuff (like IOV, etc). With a minor tweak on HT bus speeds, don't expect to see much improvement in memory bandwidth for Fiorano/Socket-F pairings. Where you should see improvement is in power consumption - pairing HE/EE Istanbul parts with Fiorano/Kroner should create a better performance/watt result in virtualization.

    Collin C. MacMillan
    http://blog.solori.net">http://blog.solori.net
  • bpdski - Tuesday, October 6, 2009 - link

    It is pretty amazing how fast the new 55xx chips are. Personally, I am holding out on any new server purchases and deployments until the EX systems come out next year. I am pretty excited about the performance potential of a dual or quad octal-core system. I feel for AMD, but if the EX systems scale as well as they should, they are really going to crush the Opterons.
  • duploxxx - Wednesday, October 7, 2009 - link

    2 answers to that, first off all looking at the design EX will be way more expensive creating a gap between 2 socket-4 socket platform even when only deploying 2 octa will be a very expensive baseline due to the motherboard layout. To expensive actually and a lot of focus trying to get risc/sparc marketshare.

    Second don't you think AMD knows this? The c32 G34 platform launch is much closer then people think, AMD made a clear roadmap and since 45nm all looks like going well on shape, keep in mind the cpu for the new platform is almost ready since it is based on istanbul and the new platform chipset was also released few weeks ago for the socket F platform, you will also see much more OEM activity with this platform due to one brand supplier, no longer need of the old nvidia/broadcom.

    EX was delayed-delayed-delayed if it continues like this it will be launched more or less at the same time, so keep the feeling. BTW even if the 55xx sereis would be again a bad performing server part (which it is finally not thank you intel) 75% of the market would be still buying it just for the brand name.....:)
  • cosminliteanu - Tuesday, October 6, 2009 - link

    Many thanks for this article !
    :)
  • BrightCandle - Tuesday, October 6, 2009 - link

    A dual socket will easily fit in a 1U. But 1.25A is some serious extra cost within a colo.

    The 2U quad sockets on the other hand are a busting 500W+, again serious extra money in a colo.

    The Colo's want you using 0.5A per 1U, there is a major mismatch from these machines to the reality of the power you can actually get. Love the speed, not liking the cost of running them.
  • sonicdeth - Tuesday, October 6, 2009 - link

    Thanks for this. Personally I can't recommend any of the quad socket systems until we see Intels Nehalem-EX early next year. The dual core 55xx series is just fantastic for the price (especially with VMware). We've deployed several HP 380G6's and couldn't be happier.
  • Bazili - Tuesday, October 6, 2009 - link

    Great article. Congrats!!!

    Could you pleas include a software price analysis? I guess it can show huge differences among a 24 core box and a 8 core box.


  • tobrien - Tuesday, October 6, 2009 - link

    these are amazing articles, you guys do such an awesome job with these.

    thanks a ton!
  • JohanAnandtech - Wednesday, October 7, 2009 - link

    Thanks for the kudos! much appreciated :-)

Log in

Don't have an account? Sign up now