Expensive Quad Sockets vs. Ubiquitous Dual Sockets

Name: Expensive Quad Sockets vs. Ubiquitous Dual Sockets
Item: Expensive Quad Sockets vs. Ubiquitous Dual Sockets
Author: Johan De Gelas

by Johan De Gelas on October 6, 2009 1:00 AM EST

Posted in
IT Computing

32 Comments | Add A Comment

32 Comments

AMD's dual and quad platform: consistency

AMD's PR is making a lot of noise about consistency, and rightly so. The quad socket and dual socket processors are - besides the obviously different multiprocessor capabilities - exactly the same. In the case of virtualization, this allows you to optimize your virtual machines and hypervisors once and then clone them as much as you like. There are fewer worries when moving virtual machines around, and there is no fiddling with masking processor capabilities. This is also well illustrated when you check what mode the VMware ESX virtual machines run. The table is pretty simple when you look at VMs running on top of an AMD processor: the virtual machines running on dual Opterons will run software virtualization, while the quad-cores will almost always run in the fastest mode (hardware virtualization combined with hardware assisted paging). The same is true for Hyper-V: it won't run on the dual-core Opterons and it will run at full speed on the quad-cores. It is remarkably simple compared to the complete mess Intel made: some of the old Pentium 4 based CPUs support VT-x, some don't. Some of the lower end Xeons launched in 2007 and 2008 don't and so on.

There is some inconsistency on HyperTransport and L3 cache speeds, but those will only cause small performance variations and no software management troubles. Of course, AMD's very consistent dual and quad socket platform is not without flaws either. The NVIDIA MCP55 Pro chipset was at times pretty quirky when installing new virtualization software. Most of the time, a patch took care of that, and the Opteron servers were running rock solid afterwards, but in the meantime a lot of valuable time was wasted. Also, the current platform has not evolved for years and is starting to show its age: we found out that the motherboards consume a bit more power than they should. In 2010, all Opteron server platforms will use AMD chipsets only.

The core part of the new hex-core Opteron is the identical to that of the quad-core, but the "uncore" part has some improvements. With the exception of the 2.8GHz 2387/8387 and 2.9GHz 2389/8389, most quad-core Opterons still connect with 1GHz HyperTransport links. The hex-core Opteron runs with speeds between 2 and 2.4GHz. The hex-core Opteron always connects to the other CPUs in the server via 2.4GHz HyperTransport links. That makes little difference in a 2P server, but performance gets quickly limited by interconnection speeds in 4P. Even at 2.4GHz (9.6GB/s interconnect), probe broadcasting can limit performance, and that is why you can reserve up to 1MB of cache for a snoop filter. These improvements make the hex-core Opteron a more interesting choice than the quad-core Opterons - even at lower clock speeds - for quad socket servers.

In fact, we feel that besides the very low power Opteron 2377 EE, the quad-core Opterons are of little use. If your application scales relatively badly, there is the X55xx series which offers much better "per thread" performance. If your application scales well, two 2.6GHz Opteron 2435 will offer 15% better (and sometimes more) performance than a 2.9GHz Opteron 2389 with the same power consumption. Using relatively "old" technology such as DDR2, the hex-core Opteron based servers are very affordable, especially if you compare them with similar Xeon servers.

The Intel Dual socket platform: pricey performance and performance/watt champion

We have already tested the new dual socket "Nehalem" Xeon platform. It is the platform with the fastest interconnects, the most threads per socket (thanks to Hyper-Threading), the most bandwidth (triple-channel) and the most modern virtualization features (Intel VT-D). Even the top models are far from power hogs: at full load, the X5570 offers an excellent performance/watt ratio. The low-power L5520 at 2.26GHz was a real champion in our performance per watt tests and is available at reasonable prices.

The relatively new platform (chipset, DDR3) is still on the expensive side: a similarly configured Dell R710 (two Xeon 5550 2.66GHz, 8 x 4GB 1066MHz DDR3) costs about one third more than a Dell R805 (Two Opteron 2435, 8 x 4GB 800MHz DDR2): $5047 versus $3838 (pricing at the end of September 2009). If you chose the Xeon platform, you should be aware of the fact that Intel's low end is much less interesting: the best Xeon 55xx CPUs have a clock speed between 2.26 and 2.93GHz. The low end models, the 5504 and 5506 are pretty crippled, with no Hyper-Threading, no Turbo Boost, and only half as much L3 cache (4MB). These crippled CPUs can keep up with the quad-core Opterons at about 2.5GHz, but they are the worst Xeons when you look at idle and full load power. The performance per Watt of the Xeon EE550x is pretty bad compared to the more expensive parts.

The Intel Quad socket platform

There is no quad socket version of Intel's excellent "Xeon Nehalem" platform. We will have to wait until the Nehalem-EX servers ship in the beginning of 2010. At that time, servers with the octal-core 24MB L3 cache CPU will almost certainly end up in a higher price class than the current quad socket servers. One indication is that Intel positions the Nehalem-EX as a RISC market killer. Then again, Intel might as well bring out quad-core versions too. We will have to wait and see.

So there's no Hyper-Threading, Turbo Boost, EPT, NUMA, or fast interconnects for the current Xeon "Dunnington" platform, which is still based on a "multi independent FSB" topology. It has massive amounts of bandwidth in theory (up to 21GB/s), but unfortunately less than 10GB/s is really available. Snooping traffic consumes lots of bandwidth and increases the latency of cache accesses. The 16MB L3 cache should lessen the impact of the relatively slow memory subsystem, but it is only clocked at half the clock speed of the core. A painful 100 cycle latency is the result, but luckily every two cores also have a shared and fast 3MB L3 cache.

When it was first launched, the Xeon MP defeated the AMD alternatives by a good margin in ERP and heavy database loads. It reigned supreme in TPC-C and broke a few new records. More importantly it took back 9% of market share in the quad socket market according to the IDC Worldwide Server Tracker. But at that time, the 2.66GHz hex-core had to compete with a 2.5GHz quad-core Opteron with a paltry 2M of shared L3, and AMD has been working hard on a comeback. The massive Intel chip (503 mm²) has to face a competitor that has three times as much L3 cache and 50% more cores at higher clock speeds, and that is not all: the DDR2-800 DIMMs deliver up to 42GB/s or four times as much bandwidth to the four AMD chips. At the same time, the Xeon behemoth has to outpace the ultra modern Dual Xeon platform by a decent margin to justify its much higher price.

Index What Intel and AMD Are Offering

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

32 Comments

View All Comments

blasterrr - Thursday, January 28, 2010 - link
how about itanium 2 benchmarks.
we use itanium 2 in our company for our SAP Systems. i d like to compare itanium 2 performance with x86 performance.
does anyone know which architecture is better for most sap applications?
joekraska - Thursday, October 8, 2009 - link
Gentlemen,

I run a large virtualization enterprise for a fortune 500 company. The platform of choice for virtualization is two socket systems. There are several reasons for this. First, VMWare charges roughly $2600 per socket. Second, 4 socket systems don't generally double the performance of two 2 socket systems. Third, 4 socket systems cost significantly more than two socket systems. Finally, the best 2 socket systems for virtualization have a large number of DIMM slots per cpu (e.g., our choice: Dell M/R710, 9 DIMM slots per CPU, or theoretically CISCO UCS 250, 24 slots per cpu), and virtualization enterprises want memory. VMWare doesn't charge you for the amount of memory you install, and that's what you need: memory.

As an aside I favor 2 socket systems categorically. If and only if someone has a high-count single system SMP need would I consider or permit anything else. 4 socket systems cost too much for what you get. It requires a problem that can not be solved without one to justify the investment.

Joe Kraska
San Diego CA
USA
solori - Wednesday, October 7, 2009 - link
The vAPUS tile graphics marked as 2345 are really 2435... What happended to the 2389 in those tests?
solori - Wednesday, October 7, 2009 - link
John,

Good follow-up to your earlier comparisons. A lot of work goes into these things and your team's done compiling the information here. I have just a few comments:

With respect to the VMmark reference, you've taken a vector value (X@Y) and made a scalar out of if. The performance number (X) is granted across a number of VMs running (Y, in tiles) which, in turn, helps to increase the scalar part you refer to a "speed" (i.e. 13% slower, etc.) In fact, your speed component could be determined by taking the X/Y and looking at the "tile ratio" to determine unit performance per tile. In doing so, you should see the "performance" gap close a bit.

This evaluation method also lends itself to what VMmark was created to achieve - a determination of performance as the platform scales across VMs. In other words, the implication of VMmark is that a system cannot scale due to its constituent applications being thread bound. By employing virtualization, the net number of active threads is maximized with little degradation on the per-application performance. When resource availability is impacted, the number of application groups (tiles) is at its maximum.

Perhaps a significant reason VMmark and vAPUS differ so widely is that VMmark creates a case for resource exhaustion and vAPUS use of resources is more arbitrary. Fitting a benchmark to the available resources for one system seems very hard to avoid, and your attention to hex-core versus quad-core scheduling is right on point -hence the significant difference in vAPUS.1 results. Kudos again for taking that into account - it is something that systems architects need to be more aware of and a lot of benchmarks step around.

One interesting result of the 24 vCPU case is that the difference between 2P,12-core opteron and 2P,16-thread Xeon is down to the ratio of their clock speeds. Likewise, the difference between the 4P and 2P cases would indicate that the number of vAPUS tiles could have been increased for those systems.

The issue still puzzling me about vAPUS is the sizing of the OLTP VMs. On the AMD machines you have in the lab, you could easily use the full database size with memory to spare and increase the size of RAM to the VMs accordingly. Doing so in the memory-cramped EP box would likely cripple its performance, but produce an admittedly more "real world" result. We don't see databases getting smaller out there anytime soon, nor do we see them being split-up to fit nominal hardware... The 24GB Xeon is kind of base-model compared to the 64GB Opterons - you might want to reconsider your testing policy where that's concerned.

On the virtualization use case, you cannot divorce the CAPEX economics of "right-sizing" your memory component. Too little memory and the Xeon has more threads than you can practically use. Too much memory, and you either out-strip the thread capacity (AMD or Intel) and get into higher $/VM due to memory costs. With 8GB/DDR2 about 1/2 the cost of 8GB/DDR3 (reg, ecc for both) you are looking at memory being the largest single factor between 5500's and 2400's where $/VM is concerned. Mixing consolidation and performance workloads across a VMM cluster (i.e. DRS in VMware) make the value of additional GB/core per-platform important.

Likewise, if you look at mature virtualization market approaches - rack or blade systems - you will not see many 2P systems force-fitted into 4P use cases. Likewise, you will not see 4P systems used where 2P systems would suffice. Therein, the advantage to having a 2P and 4P eco-system that supports seamless migration (i.e. vMotion or Live Migration) requires (today) coming down in one camp or the other. In this case, the advantage lies with AMD (for now), and your report shows that to be a decent choice.

I agree that EX will create a significant price gulf between EP and likely not help the Intel case in the 4P use virtualization use case. With AMD's Magny-Cours on track for Q1/2010 in 2P and later 4P (same basic platform) use cases today that are solid 4P Istanbul contenders have a drop-in for 2P Magny-Cours with solid enhanced migration capabilities. This can't do anything but put pressure on Intel to create a 4P competitor in both capability and price for AMD's offering.

We've done significant research in terms of CPU/memory pairings to find the "sweet spot" in $/VM which points to a lag in the market for consolidation utilization (or at least market intelligence). If the "typical" utilization scenario is 12-18 VM's, it is clear from your results that a significant amount of potential is wasted in either Nehalem or Istanbul platform. To maximize return, $/VM and Watt/VM must be considered in the deployment, pushing those numbers up by at least 50% per host. That said, memory re-enters the equation as a limiting factor - well beyond the 20GB in today's vAPUS test case.

As for the hex-core Xeon, the writing was on the wall in the virtualization use case as 8P/quad-core Opterons have proven all but equal on performance (about 95%) to 8P/hex-core Xeons. Dunnington's power use did not help its cause either...

Like you indicated in your piece, specialized systems like Twin2 and blades create a better performance/watt opportunity for both 5500 and 2400 platforms (especially with Fiorano and SR5600 socket-F options.) Perhaps as great follow-up for this series would be a Twin2 comparison of the 5500 and 2400 variants...

Collin C. MacMillan
Solution Oriented LLC
http://blog.solori.net">http://blog.solori.net
JohanAnandtech - Thursday, October 8, 2009 - link
Hi Collin,

There is too much interesting stuff in your reaction to address every good point you make, so I will take a bit more time to digest this and send you an e-mail.

A few things on top of my head. Yes, a 64 GB Quad Opteron machine using only 20 GB or so is not optimal. At the same time we verify DQL (Disk Queue Length) so we are pretty sure that you are not going to gain much from making the cache larger. I'll check, maybe we simplified too much there. The reason for doing this is keeping things simple, as it is already hard enough to control the complexity of virtualized benchmarking. It is good suggestion to increase the cache size of the OLTP component for systems with larger amounts of memory, I'll think about it.

The resource exhaustion as done by VMmark is not perfect either as you might be going for maximum throughput at the cost of the response time of individual applications. It is a pretty hard exercise, I Guess we'll have to set a certain SLA: a max response time for each app and then measure total throughput.
skrewler2 - Wednesday, October 7, 2009 - link
Why do you never use a Sun box for your benchmarks?
JohanAnandtech - Wednesday, October 7, 2009 - link
Which Sun box do you have in mind? And of course, like everyone, we are waiting to see what Oracle will do with Sun . While the Sun people used to send us testservers quite a few times a year or two ago, it is been very silent the past year.
duploxxx - Wednesday, October 7, 2009 - link
Great article, however it might have been a bit more interesting if you would also start to add priceranges, comparing the best at all time is nice, but many people start to think that whatever version thye might buy will always be a better choice for them because they saw the highest benchmarks.
duploxxx - Wednesday, October 7, 2009 - link
edit, wasn't finished yet :)

knowing that you can buy a 2s E5530 2.4ghz system at the same price as a 2s 2435 2.6GHZ might bring already a whole different perspective.

Also comparing 2s against 4s is nice and i really like your virtual benchmark, it gives much more realistic results just as we have seen in our own sw benchmarking that Vmmark is no longer representative to real world. You still can't compare power consumption. First of all LP dimms costs now as much as normal dimms and secondly you only require 20GB ram in your test as you mentioned so 44GB is wasted but is still consuming a lot of power.

Choosing between 2s and 4s is a difficult choice, we deploy about 400-500 2s servers a year on VM, preferring more availability amount then bigger servers, 4s also needs a lot more fine tuning on IO then a 2s does, for sure if you use DRS, hitting the farm much harder on HA failure etc. Since 2004 we started on AMD and not moving back to intel just because they now have 1 decent server platform and as mentioned, check price/performance and think again if 55xx are so more far superior then 24xx series if you buy mid level servers like 90% of the server market does. Oh and we like Enhanced Vmotion off course.
SLIM - Tuesday, October 6, 2009 - link
Any thoughts on the effect of AMD's new server chipset on vmmark performance (http://www.amd.com/us/products/server/platforms/Pa...">http://www.amd.com/us/products/server/p...core-pro... They claim it will help with I/O and particularly virtualized I/O performance.

Expensive Quad Sockets vs. Ubiquitous Dual Sockets

Post Your Comment

32 Comments

View All Comments

blasterrr - Thursday, January 28, 2010 - link

joekraska - Thursday, October 8, 2009 - link

solori - Wednesday, October 7, 2009 - link

solori - Wednesday, October 7, 2009 - link

JohanAnandtech - Thursday, October 8, 2009 - link

skrewler2 - Wednesday, October 7, 2009 - link

JohanAnandtech - Wednesday, October 7, 2009 - link

duploxxx - Wednesday, October 7, 2009 - link

duploxxx - Wednesday, October 7, 2009 - link

SLIM - Tuesday, October 6, 2009 - link

Log in

Don't have an account? Sign up now