Quad Xeon 7500, the Best Virtualized Datacenter Building Block?

Name: Quad Xeon 7500, the Best Virtualized Datacenter Building Block?
Item: Quad Xeon 7500, the Best Virtualized Datacenter Building Block?
Author: Johan De Gelas

by Johan De Gelas on August 10, 2010 5:10 PM EST

Posted in
IT Computing

51 Comments | Add A Comment

51 Comments

VMmark

Before we take a look at our own virtualization benchmarking, let us look at the currently (beginning of August 2010) available VMmark scores.

VMWare VMmark

It is interesting to note that most AMD “Istanbul” Opteron servers benchmarked were using DDR2-667. That somewhat limited their VMmark scores as consolidated virtualized servers have higher bandwidth demands than most “native running” servers. The dual Opteron 6176 has the same amount of cores as the Quad Opteron 8439. At the same time, those cores are identical, only the uncore part has changed. So from a pure processing power point of view, the dual Opteron 6176 performance should be about 15% slower. The reality is that the dual socket is 3% faster than the older quad socket server. This shows that VMmark really benefits from the improved memory subsystem, as the support for DDR3-1333 memory essentially doubles the bandwidth and lowers latency. That still is not enough to beat the Intel armada, as the fastest “Westmere” Xeon is about 16% faster than the best Opteron “Magny-Cours”.

The Quad Xeon X7560 leaves everything behind in VMmark, by offering more than twice the performance of all dual configurations. Virtualization favors high core counts: you are running many different applications which do not have to exchange data most of the time. This reduces the thread synchronization overhead. Nonetheless, the scores that the Xeon X7560 gets are impressive. But of course, this is VMmark, an industry benchmark. The results also depend on how much time and effort is spent on tuning the benchmark. Since the introduction of the Xeon X7500 series, the VMmark scores have already improved by 7% (from 70.78 to 75.77). Let us check out vApus Mark II where each platform is treated the same.

vApus Mark II

vApus Mark II—VMware ESX 4.0

The overall picture remains the same, although there are some clear differences. First of all, the “Magny-Cours Opteron” and “Westmere Xeon” are closer to each other. The difference between the two best server CPUs with a “decent” TDP is only 4%. But the surprise is the landslide victory of the X7560. Let us analyze the results in more detail.

For the OLAP test, we took a dual Xeon X5570 without Hyper-Threading as reference. The reason for this is that the VM got eight vCPUs, and we compare this with a native server that has eight cores. For the web test, we used two Xeon X5570 cores as reference, or a Xeon X5570 cut in two. The OLTP scores, obtained in a VM with four virtual CPUs, uses the Swingbench scores of one Xeon X5570 as reference. The reason why we chose the Xeon “Nehalem” as reference is that this server CPU is the natural yardstick for all new server CPUs: it outperformed all contemporary server CPUs by a large margin at its launch (March 2009).

Let us take a look at the more detailed results per VM. The vApus Mark II score is a geometric mean of the different VMs.

CPU config	Tiles	OLAP (1 VM)	Web (3 VMs)	OLTP (1 VM)	vApus Mark II score
Dual 6174	2	57%	30%	22%	67.5
Dual 6136	2	45%	23%	14%	48.6
Dual 7560	2	58%	51%	32%	91.8
Dual X5670	2	53%	43%	19%	70.0
Dual L5640	2	48%	33%	15%	57.6
Quad 7560	2	73%	73%	39%	118.6
Quad 7560	4	47%	50%	29%	162.7

The ESX scheduler works with Hardware Execution Contexts, which map to one logical (Hyper-Threading) or physical core. In our current test, more HECs are demanded than available, so this test is quite hard on the ESX scheduler. We have still to investigate why the OLTP scores are quite a bit lower than the other VMs. This VM is the most disk intensive and as such requires more VMkernel time than the others. This might explain why there is less processing power left for the application running inside the VM. Another reason is that this application requires more “co-scheduling”. In OLTP applications, threads rarely run independently, but have to synchronize frequently. In that case it is important that each virtual CPUs gets equal processing power. If one vCPU gets ahead of the others, this may result in a thread waiting longer than necessary for the other to release a spinlock.

Although ESX 3.5 and 4.0 feature “relaxed co-scheduling”, the best performance for these kind of applications is achieved when the scheduler can “co-schedule” the syncing threads. The fact that the system with the highest logical core count gets the best percentages in the OLTP VM is another indication that the co-scheduling issue may play an important role. Notice how the dual Xeon X7560 with 32 threads does significantly better than the higher clocked Xeon X5670 (24 threads) when running the OLTP VM. While the overall performance of the dual Xeon X7560 is 31% better than the Xeon X5670 (91.8 vs 70), the OLTP performance is almost 70% (!) better. Another indication is consistency: the differences between the VMs are much smaller on the Dual Xeon X7560.

The AMD systems show a similar picture. The 16-core 6136, despite the decent 2.4GHz clock speed, offers the lowest OLTP performance to its users as it has the fewest threads to offer the scheduler. The dual 6174 runs at a 9% lower clock speed but has 24 cores to offer. The result is that the OLTP VM performs a lot better (more “perfect” co-scheduling possible): we noticed 57% better OLTP performance. The OLTP VM was even faster on the Dual 6174 with its 24 “real” cores than on the Xeon X5670. Although this is only circumstantial evidence, we have strong indications that transactional workloads favor high core and thread counts.

Our measurements show that the quad Xeon X7560 is about 2.3 times faster than the best dual platforms. That makes one quad Xeon X7560 a very interesting alternative for each two dual CPU servers you wish to buy for virtualization consolidation.

vApus Mark II Conclusion

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

51 Comments

View All Comments

fynamo - Wednesday, August 11, 2010 - link
WHERE ARE THE POWER CONSUMPTION CHARTS??????

Awesome article, but complete FAIL because of lack of power consumption charts. This is only half the picture -- and I dare to say it's the less important half.
davegraham - Wednesday, August 11, 2010 - link
+1 on this.
JohanAnandtech - Thursday, August 12, 2010 - link
Agreed. But it wasn't until a few days before I was going to post this article that we got a system that is comparable. So I kept the power consumption numbers for the next article.
watersb - Wednesday, August 11, 2010 - link
Wow, you IT Guys are a cranky bunch! :-)

I am impressed with the vApus client-simulation testing, and I'm humbled by the complexity of enterprise-server testing complexity.

A former sysadmin, I've been an ignorant programmer for lo these past 10 years. Reading all these comments makes me feel like I'm hanging out on the bench in front of the general store.

Yeah, I'm getting off your lawn now...
Scy7ale - Wednesday, August 11, 2010 - link
Does this also apply to consumer HDDs? If so is it a bad idea to have an intake fan in front of the drives to cool them as many consumer/gaming cases have now?
JohanAnandtech - Thursday, August 12, 2010 - link
Cold air comes from the bottom of the server aisle, sometimes as low as 20°C (68F) and gets blown at high speed over the disks. Several studies now show that this is not optimal for a HDD. In your desktop, the temperature of the air that is blown over the hdd should be higher, as the fans are normally slower. But yes, it is not good to keep your harddisk at temperatures lower than 30 °C . use hddsentinel or speedfan to check on this. 30-45°C is acceptable.
Scy7ale - Monday, August 16, 2010 - link
Good to know, thanks! I don't think this is widely understood.
brenozan - Thursday, August 12, 2010 - link
http://en.wikipedia.org/wiki/UltraSPARC_T2
2 sockets =~ 153GHz
4 sockets =~ 306GHz
Like the T1, the T2 supports the Hyper-Privileged execution mode. The SPARC Hypervisor runs in this mode and can partition a T2 system into 64 Logical Domains, and a two-way SMP T2 Plus system into 128 Logical Domains, each of which can run an independent operating system instance.

why SUN did not dominate the world in 2007 when it launched the T2? Besides the two 10G Ethernet builtin processor they had the most advanced architecture that I know, see in
http://www.opensparc.net/opensparc-t2/download.htm...
don_k - Thursday, August 12, 2010 - link
"why SUN did not dominate the world in 2007 when it launched the T2?"

Because it's not actually that good :) My company bought a few T2s and after about a week of benchmarking and testing it was obvious that they are very very slow. Sure you get lots and lots of threads but each of those threads is oh so very slow. You would not _want_ to run 128 instances of solaris, one on each thread, because each of those instances would be virtually unusable.

We used them as webservers.. good for that. Or file servers that you don't need to do any cpu intensive work.

The theory is fine and all but you obviously have never used a T2 or you would not be wondering why it failed.
JohanAnandtech - Thursday, August 12, 2010 - link
"http://en.wikipedia.org/wiki/UltraSPARC_T2
2 sockets =~ 153GHz
4 sockets =~ 306GHz"

You are multiplying threads times clockspeed. IIRC, the T2 is a finegrained multithread CPU where 8 (!!) threads share two pipelines of *one* core.

Compare that with the Nehalem core where 2 threads share 4 "pipelines" (sustained decode/issue/execution/retire) per cycle. So basically, a dual socket T2 is nothing more than 16 relatively weak cores which can execute 2 instructions per clockcycle at the most, or 32 instructions per cycle. The only advantage of having 8 threads per core is that (with enough indepedent software threads) the T2 is able to come relatively close to that kind of throughput.

A dual six-core Xeon has a maximum throughput of 12 cores x 4 instructions or 48 instructions per cycle. As the Xeon has only 2 threads per core, it is less likely that the CPU will ever come close to that kind of output (in business apps). On the other hand, it performs excellent when you have some amount of dependent threads, or simply not enough threads in parallel. The T2 will only perform well if you have enough independent threads.

Quad Xeon 7500, the Best Virtualized Datacenter Building Block?

VMmark

vApus Mark II

Post Your Comment

51 Comments

View All Comments

fynamo - Wednesday, August 11, 2010 - link

davegraham - Wednesday, August 11, 2010 - link

JohanAnandtech - Thursday, August 12, 2010 - link

watersb - Wednesday, August 11, 2010 - link

Scy7ale - Wednesday, August 11, 2010 - link

JohanAnandtech - Thursday, August 12, 2010 - link

Scy7ale - Monday, August 16, 2010 - link

brenozan - Thursday, August 12, 2010 - link

don_k - Thursday, August 12, 2010 - link

JohanAnandtech - Thursday, August 12, 2010 - link

Log in

Don't have an account? Sign up now