Original Link: http://www.anandtech.com/show/3846/quad-xeon-7500-the-best-virtualized-datacenter-building-block-



21st Century Server Choices

Lots of people base their server form factor choice on what they are used to buying. Critical database applications equal a high-end server. Less critical applications: midrange server. High-end machines used to find a home at larger companies and cheaper servers would typically be attractive to SMEs. I am oversimplifying but those are the clichés that pop up when you speak of server choices.

Dividing the market into who should or should not buy high-end servers is so... 20th century. Server buying decisions today are a lot more flexible and exciting for those who keep an open mind. In the world of virtualization your servers are just resource pools of networking, storage and processing. Do you buy ten cheap 1U servers, four higher performance 2U, one “low cable count” blade chassis, or two high-end servers to satisfy the needs of your services?

A highly available service can be set up with cheap and simple server nodes, as Google and many others show us every day. On the flipside of the coin, you might be able to consolidate all your services on just a few high-end machines, reducing the management costs while at the same time taking advantage of the advanced RAS features these kind of machines offer. It takes a detailed study to determine which strategy is the best one for your particular situation, so we are not saying that one strategy is better than all the others. The point is that the choice between cheap clustered nodes and only a few high-end machines cannot be answered by simply looking at the size of the company you are working for or the "mission critical level" of your service. There are corner cases where the choice is clear, but that is not the case for the majority of virtualized datacenters.

So is buying high-end servers as opposed to buying two or three times more 2-socket systems an interesting strategy for your virtualized cluster if you are not willing to pay a premium for RAS features? Until very recently, the answer was simple: no. High-end quad socket systems were easily three times and more as expensive and never offered twice as much performance compared to dual socket systems. There are many reasons for that. If we focus on Intel, the MP series were always based on mature but not the cutting edge technology. Also, quad socket systems have more cache coherency overhead, and the engineering choices favor reliability and expandability over performance. That results in slower but larger memory subsystems and sometimes lower clock speeds too. The result was that the performance advantage of the quad system was in many cases minimal.

At the end of 2006, the Dual Xeon X5300 were more than a match for the Xeon X7200 quad systems. And recently, dual Xeon 5500 servers made the massive Xeon 7400 servers look slow. The most important reason why these high-end systems were still bought were the superior RAS features. Other reasons include the fact that some decision makers never really bothered to read the benchmarks carefully and simply assumed that a quad system would automatically be faster since that is what the OEM account manager told them. You cannot even blame them: a modern CIO has to bury his head in financial documents, must solve HR problems, and is constantly trying to explain to the upper management why the complex IT sytems are not aligned with the business goals. Getting the CIO down from the “management penthouse” to the “cave down under”, also called the datacenter, is no easy task. But I digress.

Virtualization can shatter the old boundaries between the midrange and high-end servers. They can be interesting for the rest of us, the people that do not normally consider these high-end expensive systems. The condition is that the high-end systems can consolidate more services than the dual socket systems, so performance must be much better. How much better? If we just focus on capital investment, we get the figures below.

Type Server CPUs Memory Approx. Price
Midrange Dell R710 2x X5670 18 x 4GB = 72GB $9000
Midrange Dell R710 2x X5670 16 x 8GB = 128GB $13000
High-end Dell R910 4x X7550 64 x 4GB = 256GB $32000

So these numbers seem to suggest that we need 2.5 to 3 times better performance. In reality, that does not need to be the case. The TCO of two high-end servers is most likely a bit better than that of four midrange servers. The individual components like the PSU, fans, and motherboard should be more reliable and thus result in less downtime and less time spent on replacing those components. Even if that is not the case, it is statistically more likely that a component fails in a cluster with more servers, and thus more components. Less cables and less hypervisor updates should also help. Of course, the time spent in managing the VMs is probably more or less the same.

While a full TCO calculation is not the goal of this article, it is pretty clear to us that a high-end system should outperform the midrange dual socket systems by at least a factor two to be an economical choice in a virtualization cluster where hardware RAS capabilities are not the only priority. There is a strong trend that the availability of the (virtual) machine is guaranteed by easy to configure and relatively cheap software techniques such as VMware’s HA and fault tolerance. The availability of your service is then guaranteed by using application level high availability such as Microsoft’s clustering services, load balanced web servers, Oracle fail-over, and other similar (but still affordable) techniques.

The ultimate goal is not keeping individual hardware running but keeping your services running. Of course hardware that fails too frequently will place a lot of stress on the rest of your cluster, so that is another reason to consider this high-end hardware... if it delivers price/performance wise. Let us take a closer look at the hardware.



The 32-Core, 64-Thread Beast: QSSC-S4R

The heavy—50kg—QSSC-S4R server found its way to our lab. The ODM (Original Design Manufacturer) is the Taiwanese firm Quanta, who designed the server jointly with Intel. The 4U server is equipped for maximum expandability with 10 PCIe slots, quad gigabit Ethernet onboard, and 64 DIMM slots.

The enormous amount of DIMM slots is a result of the use of eight separate memory boards. Each memory board has two memory buffers and eight DIMMs onboard.

A 7+1 hot-swap, redundant fan module setup cools this system down. Notice that the disk system is not in front of the cooling as in most server systems. That is a plus, as the disks should not get the coldest air: disks perform best with medium temperatures (30-40°C, 86-104F) as the lower viscosity of the grease in the rotation motor puts less stress on the rotating components. Google’s study also suggests that disks should be kept at a higher temperature than the rest of the server.

The CPUs and DIMMs however should be kept as cool as possible to reduce the leakage power. The fans are well positioned: the memory boards and the heatsinks of the CPUs right behind them get the coolest air. In the back of the server you find the motherboard. You can see that the heatsinks on the 7500 chipset receive extra airflow.

Four 850W high efficiency power supplies feed this massive machine in a 2+2 or 3+1 configuration. You can find more detailed information about this QSSC-S4R server here. The other benchmarked configurations are identical to this page.



Nehalem EX Confusion

One of the reasons that the Xeon X7560 did not show its full potential at launch was a small error in the firmware of the Dell R810 testing platform. This caused the memory subsystem to underperform. As a result some of the bandwidth sensitive benchmarks, including many HPC applications, were not performing optimally. Intel claimed that a dual CPU config should be able to reach 39GB/s, and a quad CPU configuration should reach up to 70GB/s. We could not reach those stream numbers as we test with our somewhat older stream binary as described here. Using the same stream binary as before allows us to compare our findings with all our previous measurements.

We reran our stream benchmarks on the new QSSC-S4R server system.

Stream TRIAD on 64 bit Linux—maximum threads
* New measurements.

The new results tell us that available memory bandwidth is about 21% higher (29GB/s) than what we previously measured on the DELL R810 (24GB/s). That means that many benchmarks published at the launch of the Xeon 7500 and using the Dell R810 were too low, especially the HPC ones. The Xeon X7560 will not be able to beat the quad Opteron 6174 when it comes to raw bandwidth, but it is far from a bandwidth starved platform.



Stress Testing the High End

Our previous vApus Mark I gave an idea on how well systems perform when running several virtualized “heavy duty applications”: complex network bandwidth gobbling web servers, large OLAP databases, and write intensive OLTP databases. Our benchmark was mostly based on vApus, a software client that fires off requests as if real users were stressing the server. Several client machines run with a vApus “slave” instance and a “master” vApus instance manages them (for example: start tests in sync) and collects the end results.

The first version of vApus had several limitations: it could simulate a maximum of about 1500 users per client (a limit of 32-bit Windows based software) and the number of clients to could be kept in sync was also limited. In the meantime, the core count of the servers that we test has been increasing at an almost ridiculous pace. When the first lines of vApus were written (at the end of 2006), octal core servers were considered the high-end. Only four years later we are now looking at 64-thread and 48-core monsters. Our ambitious way of benchmarking—simulating real-world users, not scripting benchmarks—resulted in scalability problems.

The lead developer of vApus, Dieter Vandroemme, decided to take all the lessons learned from 2.5 years of vApus development and apply them to a new vApus, built from scratch. Based on a new .Net 4.0 and 64-bit Windows foundation, and spending a lot of time on software tuning, Dieter came up with a new vApus Client that was capable of producing 10,000 threads in about 3.5 seconds; up to 15000 threads can be active on one client. If you know that every simulated user needs one thread, you’ll understand why this is very cool: we can now test extremely strong servers with only one humble client. A Core i7-750 (2.66GHz) needs only 20% CPU load to sustain 15000 “users” sending off SQL statements to the server. Our mighty 64-thread, 32-core quad Xeon X7560 at 2.26GHz was brought to its knees, as you can see below.

We were excited to see this happen: finally we tamed the beast with 64 threads. Yes, you can easily stress out a server with HPC benchmarks such as Linpack or SpecFP, but measuring the potential of a server using popular business software is no easy feat. We had to deal with severe thread contention at the client side for example. With several vApus instances, we are now ready to test the strongest servers including those coming out in the next few years. We are even able to stress test complete clusters of modern servers with just a few clients.

vApus' ultimate goal is not to stress servers to their maximum; we use it mostly for measuring response time at a given workload and to test stability of applications. But of course, we could not resist the chance to use it as a benchmark too. It was time to build a new benchmark, and vApus Mark II was born.



vApus Mark II

vApus Mark II uses the same applications as vApus Mark I, but they have been updated to newer versions. vApus Mark I uses five VMs with three server applications:

  • One VM with the Nieuws.be OLAP database, based on SQL Server 2008 x64 running on Windows 2008 64-bit R2, stress tested by our in-house developed vApus test.
  • Three MCS eFMS portals running PHP, IIS on Windows 2003 R2, stress tested by our in house developed vApus test.
  • One OLTP database, based on the Swing bench 2.2 “Calling Circle benchmark” of Dominic Giles. We updated the Oracle database to version 11G R2 running on Windows 2008 R2.

All VMs are tested with several sequential user concurrencies. All VMs are “warmed up” with lower user counts. We measure only at the higher concurrencies, later in the test. At that point, results are repetitive as the databases are using their caches and buffers optimally.

The OLAP VM is based on the Microsoft SQL Server database of the Dutch Nieuws.be site, one of the newest web 2.0 websites launched in 2008. We updated to SQL Server 2008 R2. This VM gets now eight virtual CPUs (vCPUs), a feature that is supported by the newest hypervisors such as VMware ESX 4.0 and Xen 4.0. This kind of high vCPU count is one of the conditions that needs to be met before administrators will virtualize these kind of “heavy duty” applications. The application hardly touches the disk, as the vast majority of activity is in memory during the test cycle. About 135GB of disk space is necessary, but the most used data is cached in about 4GB of RAM.

The MCS eFMS portal, a real-world facility management web application, has been discussed in detail here. It is a complex IIS, PHP, and FastCGI site running on top of Windows 2003 R2 32-bit. Note that these two VMs run in a 32-bit guest OS, which impacts the VM monitor mode. We left this application running on Windows 2003, as virtualization allows you to minimize costs by avoiding unnecessary upgrades. We use three MCS VMs, as web servers are more numerous than database servers in most setups. Each VM gets two vCPUs and 2GB of RAM space.

Since OLTP testing with our own vApus stress testing software is still in beta, our fourth VM uses a freely available test: "Calling Circle" of the Oracle Swingbench Suite. Swingbench is a free load generator designed by Dominic Giles to stress test an Oracle database. We tested the same way as we have tested before, with one difference: we use an OLTP database that is only 2.7GB (instead of 9.5GB). The OLTP test runs on the Oracle 11g R2 64-bit on top of Windows 2008 Enterprise R2 (64-bit). Data is placed on an Intel X25-E SLC SSD, with logs on a separate SSD. This is done for each Calling Circle VM to avoid storage bottlenecks. The OLTP VM gets four vCPUs.

Notice that our total vCPU count is 18 (8 + 3 x 2 + 4). The advantage of using 18 vCPUs per tile is it will not be straightforward to schedule virtual CPUs on almost every CPU configuration. You might remember from our previous testing that if the number of virtual CPUs is a multiple of the number of physical cores, the server gets a performance advantage over other systems.

Careful monitoring (ESXtop) showed us that four tiles of vApus Mark II (72 vCPUs) were enough to keep the fastest system at an average of 96.5% CPU utilization during performance measurements.



VMmark

Before we take a look at our own virtualization benchmarking, let us look at the currently (beginning of August 2010) available VMmark scores.

VMWare VMmark

It is interesting to note that most AMD “Istanbul” Opteron servers benchmarked were using DDR2-667. That somewhat limited their VMmark scores as consolidated virtualized servers have higher bandwidth demands than most “native running” servers. The dual Opteron 6176 has the same amount of cores as the Quad Opteron 8439. At the same time, those cores are identical, only the uncore part has changed. So from a pure processing power point of view, the dual Opteron 6176 performance should be about 15% slower. The reality is that the dual socket is 3% faster than the older quad socket server. This shows that VMmark really benefits from the improved memory subsystem, as the support for DDR3-1333 memory essentially doubles the bandwidth and lowers latency. That still is not enough to beat the Intel armada, as the fastest “Westmere” Xeon is about 16% faster than the best Opteron “Magny-Cours”.

The Quad Xeon X7560 leaves everything behind in VMmark, by offering more than twice the performance of all dual configurations. Virtualization favors high core counts: you are running many different applications which do not have to exchange data most of the time. This reduces the thread synchronization overhead. Nonetheless, the scores that the Xeon X7560 gets are impressive. But of course, this is VMmark, an industry benchmark. The results also depend on how much time and effort is spent on tuning the benchmark. Since the introduction of the Xeon X7500 series, the VMmark scores have already improved by 7% (from 70.78 to 75.77). Let us check out vApus Mark II where each platform is treated the same.

vApus Mark II

vApus Mark II—VMware ESX 4.0

The overall picture remains the same, although there are some clear differences. First of all, the “Magny-Cours Opteron” and “Westmere Xeon” are closer to each other. The difference between the two best server CPUs with a “decent” TDP is only 4%. But the surprise is the landslide victory of the X7560. Let us analyze the results in more detail.

For the OLAP test, we took a dual Xeon X5570 without Hyper-Threading as reference. The reason for this is that the VM got eight vCPUs, and we compare this with a native server that has eight cores. For the web test, we used two Xeon X5570 cores as reference, or a Xeon X5570 cut in two. The OLTP scores, obtained in a VM with four virtual CPUs, uses the Swingbench scores of one Xeon X5570 as reference. The reason why we chose the Xeon “Nehalem” as reference is that this server CPU is the natural yardstick for all new server CPUs: it outperformed all contemporary server CPUs by a large margin at its launch (March 2009).

Let us take a look at the more detailed results per VM. The vApus Mark II score is a geometric mean of the different VMs.

CPU config Tiles OLAP (1 VM)
Web (3 VMs)
OLTP (1 VM)
vApus Mark II score
Dual 6174 2 57% 30% 22% 67.5
Dual 6136 2 45% 23% 14% 48.6
Dual 7560 2 58% 51% 32% 91.8
Dual X5670 2 53% 43% 19% 70.0
Dual L5640 2 48% 33% 15% 57.6
Quad 7560 2 73% 73% 39% 118.6
Quad 7560 4 47% 50% 29% 162.7

The ESX scheduler works with Hardware Execution Contexts, which map to one logical (Hyper-Threading) or physical core. In our current test, more HECs are demanded than available, so this test is quite hard on the ESX scheduler. We have still to investigate why the OLTP scores are quite a bit lower than the other VMs. This VM is the most disk intensive and as such requires more VMkernel time than the others. This might explain why there is less processing power left for the application running inside the VM. Another reason is that this application requires more “co-scheduling”. In OLTP applications, threads rarely run independently, but have to synchronize frequently. In that case it is important that each virtual CPUs gets equal processing power. If one vCPU gets ahead of the others, this may result in a thread waiting longer than necessary for the other to release a spinlock.

Although ESX 3.5 and 4.0 feature “relaxed co-scheduling”, the best performance for these kind of applications is achieved when the scheduler can “co-schedule” the syncing threads. The fact that the system with the highest logical core count gets the best percentages in the OLTP VM is another indication that the co-scheduling issue may play an important role. Notice how the dual Xeon X7560 with 32 threads does significantly better than the higher clocked Xeon X5670 (24 threads) when running the OLTP VM. While the overall performance of the dual Xeon X7560 is 31% better than the Xeon X5670 (91.8 vs 70), the OLTP performance is almost 70% (!) better. Another indication is consistency: the differences between the VMs are much smaller on the Dual Xeon X7560.

The AMD systems show a similar picture. The 16-core 6136, despite the decent 2.4GHz clock speed, offers the lowest OLTP performance to its users as it has the fewest threads to offer the scheduler. The dual 6174 runs at a 9% lower clock speed but has 24 cores to offer. The result is that the OLTP VM performs a lot better (more “perfect” co-scheduling possible): we noticed 57% better OLTP performance. The OLTP VM was even faster on the Dual 6174 with its 24 “real” cores than on the Xeon X5670. Although this is only circumstantial evidence, we have strong indications that transactional workloads favor high core and thread counts.

Our measurements show that the quad Xeon X7560 is about 2.3 times faster than the best dual platforms. That makes one quad Xeon X7560 a very interesting alternative for each two dual CPU servers you wish to buy for virtualization consolidation.



Conclusion

The first impression that the Xeon 7500 series made on the world was seriously blurred. Part of the reason is that the testing platform had a firmware bug that decreased the memory bandwidth by 20% and more. Another reason were the weird benchmarking choices of reviewers. Lightwave, folding@home and Cinebench were somehow popular measuring sticks portraying the Xeon X7560 as the more expensive and at the same time slower brother of the Xeon X5670. That kind of software is run mostly on sub $4000 workstations and cheap 1U server farms, and we seriously doubt that anyone in their right mind would spend $30,000 on a server to run these kind of workloads.

Our own benchmarking was not complete either, as our virtualization benchmarking fell short of giving 32—let alone 64—threads enough work. Still, the impressive SAP S&D benchmark numbers, one of the most reliable and most relevant industry standard benchmarks out there, made it clear to us that we should give the Xeon X7560 another chance to prove itself.

Our new virtualization benchmark vApus Mark II shows that we should give credit where it is due: servers based on the X7560 are really impressive when consolidating services using virtualization: a quad Xeon X7560 can offer 2.3 times better performance than the best dual socket systems today! You might even call the performance numbers historical: for the first time in history, Intel’s multi-socket servers run circles around the dual socket servers. Remember how the quad Xeon 7200 hardly outperformed the dual Xeon 5300 at the end of 2006, and how the quad 7400 was humiliated by the dual Xeon X5500 in 2009? And even if we go even further back in history, the Xeon MP never outperformed the dual socket offerings by a large margin. Memory capacity and RAS features were almost always the main selling points. For the first time, scalability is more than just a hollow phrase; a Xeon X7560 server can replace two or more smaller servers in terms of memory capacity and processing power.

The end result is that these servers can be attractive for people who are not the traditional high-end server buyers. Using a few quad Xeon X7560 servers instead of a lot of dual socket servers to consolidate your software services may turn out to be a very healthy strategy. Based on our current data, two quad Xeon X7560 ($65k- $70k) are worth about five Xeon 5600 servers ($50k-$65k). The acquisitions costs are slightly higher, but you need fewer physical servers and that lowers the management costs somewhat. There are two questions that remain:

1) How bad or good is the power/performance ratio?

2) If RAS is not your top priority, does a quad Opteron 6174 make more sense?

A Dell R815 with four twelve-core Opteron 6174 processors has arrived in our labs. So our search for the best virtualization building block continues.

 

A big thanks to Tijl Deneut and Dieter Vandroemme.

Log in

Don't have an account? Sign up now