Original Link: http://www.anandtech.com/show/3894/server-clash-dellr815



The Quad Opteron Alternative

Servers with the newest Intel six-core Xeon hit the market in April. The fastest six-cores Xeons were able to offer up to twice the performance of six-core Opteron “Istanbul”. The reason for this was that the age of the integer core in AMD's Opteron was starting to show. While the floating point part got a significant overhaul in 2007 with the AMD "Barcelona" quad-core chip, the integer part was a tuned version of the K8, launched back in 2003. This was partly compensated by large improvements in the multi-core performance scaling departement: HT-assist, faster CPU interconnects, larger L3 caches, and so on.

To counter this lower per-core performance, AMD's efforts focused on the "Magny-Cours" MCMs that scaled even better thanks to HT 3.0 and four DDR3 memory controllers. AMD’s twelve-core processors were launched at the end of March 2010, but servers based on these “Magny-Cours” Opterons were hard to find. So for a few months, Intel dominated the midrange and high-end server market. HP and Dell informed us that they would launch the "Magny-Cours" servers in June 2010. That is history now, and server buyers have an alternative again for the ubiquitous Xeon Servers.

AMD’s strategy to make their newest platform attractive is pretty simple: be very generous with cores. For example, you get 12 Opteron cores at 2.1GHz for the price of a six-core Xeon 2.66GHz (See our overview of SKUs). In our previous article, we measured that on average, a dual socket twelve-core Opteron is competitive with a similar Xeon server. It is a pretty muddy picture though: the Opteron wins in some applications, the Xeon wins in others. The extra DDR3 memory channel and the resulting higher bandwidth makes the Opteron the choice for most HPC applications. The Opteron has a small advantage in OLAP databases and the virtualization benchmarks are a neck and neck race. The Xeon wins in applications like rendering, OLTP and ERP, although again with a small margin.

But if the AMD platform really wants to lure away significant numbers of customers, AMD will have to do better than being slightly faster or slightly slower. There are many more Xeon based servers out there, so AMD Opteron based servers have to rise above the crowd. And they did: the “core generosity” didn’t end with offering more cores per socket. All 6100 Opterons are quad socket capable: the price per core stays the same whether you want 12, 24 or 48 cores in your machine. AMD says they have “shattered the 4P tax, making 2P and 4P processors the same price.”

So dual socket Opterons servers are ok, offering competitive performance at a slightly lower price, most of the time. Nice, but not a head turner. The really interesting servers of the AMD platforms should be the quad socket ones. For a small price premium you get twice as many DIMM slots and processors as a dual socket Xeon server. That means that a quad socket Opteron 6100 positions itself as a high-end alternative for a Dual Xeon 5600 server. If we take a quick look at the actual pricing of the large OEMs, the picture becomes very clear.

Compared to the DL380 G7 (72GB) speced above, the Dell R815 offers twice the amount of RAM while offering—theoretically—twice as much performance. The extra DIMM slots pay off: if you want 128GB, the dual Xeon servers have to use the more expensive 8GB DIMMs.



Quad Opteron Style Dell

Offering an interesting platform is one thing. The next challenge is to have an OEM partner that makes the right trade-offs between scalability, expandability, power efficiency and rack space. And that is where the DELL R815 makes a few heads turn: the Dell R815 is a 2U server just like the dual Xeon servers. So you get almost twice the amount of DIMM slots (32) and twice the amount of theoretical performance in the same rack space. Dell also limited the R815 to four 115W Opteron 6100 CPUs (quad 137W TDP Opteron SE is not possible). This trade-off should lower the demands on the fans and the PSU, thus benefiting the power efficiency of this server.

Compared to its most important rival, the HP DL585, it has fewer DIMM slots (32 vs. 48) and PCIe slots. But it is again a balanced trade-off: the HP DL585 is twice as large (4U) and quite a bit pricier. An HP DL585 is 30 to 40% more expensive depending on the specific model. HP positions the quad opteron DL585 right in the middle between the HP DL380 G7 (Dual Xeon 5600) and the HP DL580 (Quad Xeon 7500). The HP DL585 seems to be targeted to the people who need a very scalable and expandable server but are not willing to pay the much higher price that comes with the RAS focused Xeon 7500 platform.

Dell’s R815 is more in line with the “shattering the 4P tax” strategy: it really is a slightly more expensive, more scalable alternative to the Dual Xeon 5600 servers. Admittedly, that analysis is based on the paper specs. But if the performance is right and the power consumption is not too high, the Dell R815 may appeal to a lot of people that have not considered a quad socket machine before.

Most HPC people care little about RAS as a node more or less in a large HPC cluster does not matter. Performance, rack space and power efficiency are the concerns, in that order of importance. The HPC crowd typically goes for 1U or 2U dual socket servers. But in search for the highest performance per dollar, twice the amount of processing power for a 30% higher price must look extremely attractive. So these dual socket buyers might consider the quad socket R815 anyway.


Click to enlarge

As a building block for a virtualized datacenter, the R815 makes a good impression on paper too: virtualized servers are mostly RAM limited. So if you do not want to pay the huge premium for 16GB DIMMs or Quad Xeon 7500 servers with their high DIMM slot counts, the R815 must look tempting.

In short, the quad Opteron 6100 Dell R815 could persuade a lot of people on two conditions. The first one is that the two extra CPUs really offer a tangible performance advantage, and that this happens with a minor power increase. So can the Dell R815 offer a superior performance/watt ratio compared to the dual Xeon 5600 competition? Well, that is what this article will try to find out. Let us take a closer look at the benchmarked configurations of the three competitors: the Dell PowerEdge R815, the HP Proliant DL380 G7 (dual Xeon X5670) and the QSCC-4R / SGI Altix UV10.



Dell PowerEdge R815 Benchmarked Configuration

CPU Four Opteron 6174 at 2.2GHz
RAM 16 x 4GB Samsung 1333MHz CH9
Motherboard Dell Inc 06JC9T
Chipset AMD SR5650
BIOS version v1.1.9
PSU 2x Dell L1100A-S0 1100W

The R815 is a very compact design: six fans in the middle of the chassis pull the cool air across the four Opteron socket and 32 DDR3 DIMM slots. Two risers offer two full height PCIe x8 slots each.


Click to enlarge

Two half-height PCIe x4 slots are also available. The server contains six drive bays, all in 2.5 inch format.


Click to enlarge

The DELL server line distinguishes itself with an LCD display that allows you to read system alerts and boot-up options. The dual internal SD modules are unfortunately still only 1GB and thus only suited for ESXi. The 1100W hot-pluggable PSU are the only available PSUs. The entire front with disk bays can slide forward to give easy access to the first row of CPU sockets and DIMM slots.

AMD and Dell also confirmed that you will be able to upgrade this server with the next generation "Bulldozer" CPUs.



HP Proliant DL380 G7

CPU Two Xeon X5670 at 2.93GHz
RAM 15 x 4GB Samsung 1333MHz CH9
Motherboard HP proprietary?
Chipset Intel 5520
BIOS version P67
PSU 2 x HP PS-2461-1C-LF 460W HE

The 15x 4GB is not a typo. We wanted to give each server the same amount of memory while making sure that each system was working at the highest performance. In other words, each memory channel had be populated. In case of the HP DL380 G7 we populated the nine DIMMs of the first CPU, and the second CPU only got six DIMMs. This way each channel was populated, and the amount of memory (60GB) was close enough to the other systems (64GB). The extra power that one DIMM would add to the power consumption is taken in to account in the energy measurements. A DDR3 DIMM adds about 4W on average while being active.

The HP DL380 line is probably the most popular server in the world. It comes standard with four fans and one CPU. If you buy a second CPU, two fans are added to the design. The DL380 has eight 2.5 inch drive bays.


Click to enlarge

HP’s engineers have implemented quite a few great ideas: the number of sensors and the integration with the management software (ILO) is great. Lots of LEDs at the front panel give feedback to the administrator. The HP server ships with a PSU that is 92% or 94% efficient, and thus qualifies as an “80PLUS Gold” PSU. The second redundant PSU can be configured as being "cold redudant", not consuming a single watt when it is not necessary.


Click to enlarge

The CPU heatsinks can be placed on the CPU, and you simply have to close a metal “heatsink cage” to make sure that the heatsinks are applying the proper pressure on the CPUs. That makes replacing CPUs effortless and very safe.

But we are less enthusiastic about some of the “product differentiation choices”. For some weird reason, HP’s servers always ship with a few 1GB DIMMS even if you have customized the server with several tens of gigabytes. The servers is delivered with eight dummy drive bays, and you only get the functional drive bays for each disk that you order. The I/O cage is only fitted with one riser card: another optional riser card must be ordered separately. While this makes sense for HP as a vendor, in our opinion it is not customer friendly. This leads in many cases to extra deployment delays as buyers have to order something extra after the server has arrived.



Server number 3: the Quanta QSCC-4R or SGI Altix UV 10

CPU Four Xeon X7560 at 2.26GHz
RAM 16 x 4GB Samsung 1333MHz CH9
Motherboard QCI QSSC-S4R 31S4RMB0000
Chipset Intel 7500
BIOS version QSSC-S4R.QCI.01.00.0026.04052010655
PSU 4 x Delta DPS-850FB A S3F E62433-004 850W

The 50kg 4U beast of Quanta that we reviewed a month ago is the representative of the quad Xeon 7500 platform. The interesting thing about this server is that the hardware is identical to the SGI Altix UV10. SGI confirms that the motherboard inside is designed by QSSC and Intel here. The pretty pictures here at SGI indeed show us an identical server.

Maximum expandability and scalability is the focus of this server: 10 PCIe slots, quad gigabit Ethernet onboard, and 64 DIMM slots. The disadvantage of the enormous amount of DIMM slots is the use of eight separate memory boards.


Click to enlarge

Each easily accessible memory board has two memory buffers. All these buffers require power, as shown by the heatsinks on top of them. The PSUs use a 2+2 configuration, but that is not necessarily a disadvantage. The PSU management logic is smart enough to make sure that the redundant PSUs do not waste any power at all: "Cold Redundancy".

The QSCC-4R server uses 130W TDP processors, but it is probably the best, albeit most expensive, choice for this server. The lower power 105W TDP Xeon E7540 only has six cores, less L3 cache (18MB), and runs at 2GHz. So it is definitely questionable whether the performance/watt of the E7540 is better compared to the Xeon X7560 which has 33% more cores and runs at a 10% higher clock.



The Storage Setup

Some of our readers commented that we should give more insight into the storage configuration of vApus Mark II. We'll gladly provide more information.

All servers are equipped with an LSI SAS3442E-R PCIe 8x SAS Controller connected via a 12 gbit/s external SAS cable to the Promise j300S. The Promise J300s is equipped with nine 15000RPM 300GB Seagate Cheetah (SAS) drives. These drives contain all the VMs (and thus also the 135GB OLAP databases). VMDKs are configured with thick provisioning, independent, and persistent.

The OLTP Oracle databases however are located on four internal 2.5 inch bays, which contain Intel X25-E SLC 64GB drives. You can see one sticking out of our Dell R815 above.



Bandwidth is an important factor for quite a few HPC and virtualization workloads. So although Stream is a synthetic benchmark, we feel it is interesting to include Stream. If gives a rough idea which system will shine in multi-thread bandwidth limited applications.

 

Stream TRIAD on 64 bit linux - maximum threads

Four CPUs with each four DDR-3 channels at 1333 MHz, combined with an excellent (2 x 6.4 GT/s + 2 x 3.2 GT/s)  HT-3 CPU interconnect links gives the highest Stream score we have encountered so far.  The maximum theoretical bandwidth is limited by the 1.8 GHz speed of each 64 bit memory controller: 8 bytes x 1.8 GHz, or 28.8 GB/s. Four memory controller should thus achieve about  115 GB/s/.  So we get about 71% of the theoretical bandwidth with a  decently but not extremely optimized binary.  AMD tested with such a “benchmark” binary and achieved 110 GB/s.
 
The result is that AMD’s 12-core scores extremely well in typically bandwidth hungry HPC applications such as Ansys Fluent and LS-Dyna. Although the HPC server market is relatively small (about 5% ), it is an important one for AMD. AMD has been dominating this market since the introduction of the Opteron back in 2003.  The low clockspeeds and delayed introduction of the AMD “Barcelona” in 2007 caused a lot of trouble in most server markets, but Barcelona was in a lot of HPC applications still the fastest compared to Intel’s Xeon 5400. AMD’s did not lose significant marketshare in this niche market until the introduction of the Xeon 5500 “Nehalem”. The Magny-cours Opteron has put an end to that period.



SAP S&D

The SAP S&D (Sales and Distribution, 2-tier internet configuration) benchmark is an interesting benchmark as it is a real world client-server application. We decided to look at SAP's benchmark database. The results below were all run on Windows Server 2008 EE (R2) and the MS SQL Server 2008 database (both 64-bit). Every "2-tier Sales & Distribution" benchmark was performed with SAP's latest ERP 6 enhancement package 4. These results are not comparable with any benchmark performed before 2009.

The new "2009" version of the benchmark obtains scores that are 25% lower. We analyzed the SAP Benchmark in-depth in one of our previous server oriented article. In a nutshell, SAP S&D scales well, on the condition that the processors have large caches and fast inter-cache coherency traffic.

SAP Sales & Distribution 2 Tier benchmark

The quad socket champion of the SAP S&D is the IBM Power System 750 server, which scores 85220 SAPS (15600 users). In the X86 world, the 130W TDP 8-core Quad Xeons rule the roost with a 20% advantage over the nearest competitor.The 95W TDP 6-core, 12 thread versions are a lot less impressive: the quad E7540 at 2GHz have a healthy but still somewhat underwhelming 30% performance advantage over the much cheaper dual socket configurations.

That does not matter too much in this kind of market: SAP software is costly to license and is even more costly to deploy and adapt to your needs. The hardware budget is a fraction of the total project budget, and as such SAP buyers are willing to pay a premium for the RAS features and top performance. This is Xeon X7560 territory.

Nevertheless, the Quad Opteron offers very competitive performance with both the Quad Xeon and dual Xeon: it is 85% faster than dual Xeons and offers 83% of the best Quad Xeons system which uses CPUs with a higher power envelope (130W versus 115W).



VMWare's VMmark

Before we take a look at our own virtualization benchmarking, let us look at the currently (end of August 2010) available VMmark scores.

VMWare VMmark

According to VMmark, the quad Xeon 7560 is about 25% faster than the quad Opteron 6174. VMmark gives a rough idea, but only a rough one. We already wrote down our doubts about VMmark, but here comes another one. The score of 75.34 is achieved with 300 VMs (50 tiles) and 512GB of RAM. That means that each physical Xeon 7500 core is shared by 9.4 VMs ! Now to be honest, this is not the real problem as quite a few servers out there have lots of virtual CPUs mapped onto one CPU. The problem is that VMmark scores use throughput, so the OEM benchmarking experts are completely focusing on throughput and not response times. The result of this is that you get very low (slow), possibly even unacceptable, performance per VM.

Let us make this clearer. If you look at the first pages of the VMmark result disclosure of the Dell R815 or Dell R910, you’ll see that the geometric mean score of one tile is around 1.5 (look at the number at the far right). To refresh your memory, a tile consist of 5 active and one idle workload:

  • MS Exchange (2 CPUs)
  • SpecJBB (Java Server, 2 CPUs)
  • Apache web server VM (2 CPUs)
  • MySQL database VM (2 CPUs)
  • SAMBA file server VM (1 CPU)
  • Idle VM

If one tile gets a score of 1.5, it means that it is 50% faster than the reference system which ran only one tile. However, the reference system was an old HP Proliant DL580 G2. This system contained two 2.2GHz single-core Intel Xeon CPUs with Hyper-Threading support, and had 16GB of memory. That is a 130nm Xeon “Galatin”, a CPU very similar to the Pentium 4 “Northwood” Desktop CPU. This is a pretty old Xeon: it was introduced in March 2004. Galatin had a 512KB L2 cache like “Northwood”, but a 2MB L3 cache was added to improve scalability, as this was a Xeon MP processor made for quad socket configurations.

Now Galatin was a pretty decent CPU when it came out, but this CPU was not made nor suitable for a virtualizated consolidation scenario. It had no hardware virtualization whatsoever, and the VM Exit and Entry overhead was no less than six times (and more!) worse than on the Xeon “Nehalem”. You can imagine that running five applications on two of those single core CPUs is not exactly a speedy experience. The file server achieved a "blazing 10MB/s" and the website (the e-commerce website of SPECweb2005) could keep up with about 17 hits per second. Now achieving 50% more than that with an ultra modern system will not please many users. Imagine the surprise of tens of users that have to share a 15MB/s stream while they connect via their 1 gigabit Ethernet ports to the spanking new “state-of-the-art” server that has 10 Gbit Ethernet available.

So the trouble with VMmark is that the highest scores are only a measure of the total throughput; the throughput of the individual applications however is pretty miserable. It is not just a server that is running at 100%, it is a server that is completely overutilized. So the benchmark favors throughput to the extreme, which may well exaggerate differences between the competing systems.



vApus Mark II

vApus Mark II is our newest benchmarksuite that tests how well servers cope with virtualizing "heavy duty applications". We explained the benchmark methodology here.

vApus Mark II score - VMware ESX 4.1
* 2 tiles instead of 4 tiles test
** 128GB instead of 64GB

Before we can even start analyzing these numbers, we must elaborate about some benchmark nuances. We had to test several platforms in two different setups to make sure the comparison was as fair as possible. First, let's look at the Xeon 7560.

The Xeon 7560 has two memory controllers, and each controller has two serial memory interfaces. Each SMI connects to two memory buffers, and each buffer needs two DIMMs. Each CPU needs thus eight DIMMs to achieve maximum bandwidth. So our Quad Xeon X7560 needs 32 DIMMs. Now, we also want to do a performance/watt comparison of these servers. So to accomplish this, we decided to test with 16 DIMMs (64GB) in all servers. With 16 channels, bandwidth goes down from 58GB/s to 38GB/s and bandwidth has a tangible impact in a virtualized environment. Therefore, we tested with both 128GB and 64GB. The 128GB number represents the best performance of the quad Xeon 7560; the 64GB number will allow us to determine performance/watt.

Next the dual Opteron and dual Xeon numbers. We tested with both 2- and 4-tile virtualization scenarios. With 2-tiles we demand 36 virtual CPUs, which is more than enough to stress the dual socket servers. As these dual socket servers will be limited by memory space, we feel that the 2-tile numbers are more representative. By comparing the 2-tile numbers with the 4-tile numbers, we take into account that the quad socket systems will be able to leverage their higher number of DIMM slots. So comparing the 2-tile (Dual Socket) with the 4-tile (quad socket) is closest to the real world. However, if you feel that keeping the load the same is more important we added the 4-tile numbers. Four tile numbers result in slightly higher scores for the dual socket systems, and this is similar to how high VMmark scores are achieved. But if you look at the table below, you’ll see that there is another reason why this is not the best way to benchmark:

The four tiles benchmark achieves higher throughput, but the individual tiles perform very badly. If you remember, our reference scores (100%) are based on the quad-core Xeon 5570 2.93. You can see that the 4-tile benchmark runs achieve only 13% (Opteron) or 11% (Xeon) of a quad Xeon 5500 on the Oracle OLTP test. That means the OLTP VM gets less than a 1.5GHz Xeon 5570 (half a Xeon 5570). In the 2-tile test, the OLTP VM gets the performance of a full Xeon 5570 core (in the case of AMD, probably 1.5 Opteron “Istanbul” cores).

In the real world, getting much more throughput at the expense of the response times of individual applications is acceptable for applications such as underutilized file servers and authentication servers (an active directory server might only see a spike at 9 AM). But vApus always had the objective of measuring the performance of virtualized performance critical applications such as important web services, OLAP, and OLTP databases. So since performance matters, we feel that the individual response time of the VMs is more important than pure throughput. For our further performance analysis we will use the 2-tile numbers of the dual Xeon and dual Opteron.

The quad Xeon has a 15% advantage over the quad Magny-cours. In our last article, we noted that the quad Xeon 7560 might make sense even to the people who don’t feel that RAS is their top priority. The reason was that the performance advantage over the dual socket server was compelling enough to consider buying a few quad Xeons instead of 2/3 times more dual Xeons. However, the Dell R815 and the 48 AMD cores inside block the way downwards for the quad Intel platform. The price/performance of the Opteron platform is extremely attractive: you can almost buy two Dell R815 for the price of a quad Xeon server and you get 85% of the performance.

The performance advantage over the Dual Xeon X5670 is almost 80% for a price premium of about 30%. You need about twice as many dual Intel servers, so this is excellent value. Only power can spoil AMD’s value party. We’ll look into this later in this article.

Although the quad Opteron 6136 may not enjoy the same fame as its twelve-core 6174 sibling, it is worth checking out. A Dell R815 equipped with four 6136 Opterons and 128GB costs about $12000. Compared to the dual Xeon 5670 with 128GB, you save about $1000 and get essentially 40% more performance for free. Not bad at all. But won’t that $1000 dissipate in the heat of extra power? Let us find out!



Power Extremes: Idle and Full Load

Now that we have real OEM servers in the lab for all platforms, we can finally perform a decent power consumption comparison. All servers have 64GB and the disk configuration is also exactly the same (four SSDs). In the first test we report the power consumption running vApus Mark II, which means that the servers are working at 95-99% CPU load. Please note that although the CPU load is high, we are not measuring maximum power: you can attain higher power consumption numbers by using a floating point intensive benchmark such as Linpack. But for virtualized—mostly integer—workloads, this should be more or less the maximum power draw.

We test with redundant power supplies working. So the Dell R815 uses 1+1 1100W PSUs, the SQI Altix UV10 uses 2+2 850W PSU and the HP uses the 1+1 460W PSUs.

vApus Mark II—VMware ESX 4.1—full load

You might think that the four 850 W PSU (2+2) are a serious disadvantage for the SGI server, but they are an advantage. The DELL and HP servers split their load over the two PSUs, resulting in somewhat lower efficiency, while the redundant PSUs of the SGI server consume exactly … 0 Watt. The power distribution board of the SGI Altix UV 10 /QSSC-4R has a very “cool” feature called “cold redundancy”. Although the redundancy is fully functional, the 2 redundant PSUs do not consume anything until you pull the active PSUs out.

The Dell R815 consumes less than two HP DL380 G7s, so the performance/watt ratio is competitive with the dual Xeon platform and without any doubt superior to the quad Xeon platform. If you compare the R815 with two Opterons with the HP DL380, you will notice that the R815 server is very efficient . The dual Opteron is hardly consuming more than the HP dual Xeon server while it has an 1100W PSU (not ideal when you are consuming only 360W) and contains of course a slightly more complex quad socket board. The quad socket R815 server is thus very efficient as the difference with a dual socket Xeon server is minimal.

Comparing the dual with the quad Opteron 6174 power numbers, we notice a relatively high increase in power: 244 Watt. So for each Opteron that we add, we measure 122 W at the wall. This 122 W includes a few Watts of PSU losses, VRM and DIMM wiring losses. So the real power consumed by the processor is probably somewhere between 100 and 110W. Which is much closer to the TDP (115W) than the ACP (80W) of this CPU.

Idle power measurements are hardly relevant for consolidated virtualized environments but they are a very interesting point of reference.

 vApus Mark II—VMware ESX 4.1—idle Power 

As you can see, it is not only the 130W TDP Xeon 7560 that make the SGI Altix UV10/QSSR-4R consume so much. We measure 372W difference between idle and full load, which is about 93W per CPU. That is not a huge difference if you consider that the difference is 350W for the four Opterons, and 170W for the two Xeons 5670. The Xeon 7500 CPU is capable of power gating the cores and will not consume much in idle. So we may say that the difference is not made by the CPUs: all CPU will consume in the range of 90-110W.

The problem is the complexity of the server. The QSSC-4R/ SGI Altix UV 10 pays a price for its scalability and serviceability: the memory riser boards alone consume almost 20W per board. So eight memory boards can add up to 160W. Being able to power 11 PCIe cards means that your power budget grows even more as the complexity of the I/O board is higher and the engineers have to size the power supply for the use of many more I/O cards and memory. The result is that the performance ratio of the quad Xeon 7500 is rather mediocre: you need three times the power of an HP DL 380 G7 and you only get twice the performance. At idle, it is even worse.

The Opteron 6174 needs a bit more power than its 80W ACP tag promises, but the performance/Watt ratio is very good, on par with the HP DL 380 G7. You need almost two HP DL380 G7 to achieve the same performance, but the DELL R815 needs 10% less power than two HP DL380 G7. So the DELL R815 is definitely a match for two DL380 G7s in the performance/watt category. And it beats two HP DL380 G7 with a healthy margin in other departments: CAPEX ($14000 with 128GB versus 2 x $9000 with 64GB), OPEX costs (only one machine to set up and manage), and rack space (2U vs 4U).

But… maximum power and minimum power are not very realistic. How about a real world scenario?



Real World Power

In the real world you do not run your virtualized servers at their maximum just to measure the potential performance. Neither do they run idle. The user base will create a certain workload and expect this workload to be performed with the lowest response times. The service provider (that is you!) wants the server to finish the job with the least amount of energy consumed. So the general idea behind this new benchmark scenario is that each server runs exactly the same workload and that we then measure the amount of energy consumed. It is similar to our previous article about server power consumption, but the methodology has been enhanced.

We made a new benchmark scenario. In this scenario, we changed three things compared to the vApus Mark II scenario:

  1. The number of users or concurrency per VM was lowered significantly to throttle the load
  2. The OLTP VMs are omitted
  3. We ran with two tiles

vApus Mark II loads the server with up to 800 users per second on the OLAP test, up to 50 users per second on the website, and the OLTP test is performing transactions as fast as it can. The idea is to give the server so much work that it is constantly running at 95-99% CPU load, allowing us to measure throughput performance quite well. vApus Mark II is designed as a CPU/memory benchmark.

To create a real world “equal load” scenario, we throttle the number of users to a point where you typically get somewhere between 30% and 60% CPU load on modern servers. As we cannot throttle our OLTP VM (Swingbench) as far we as know, we discarded the OLTP VM in this test. If we let the OLTP test run at maximum speed, the OLTP VM would completely dominate the measurements.

We run two tiles with 14 vCPUs (eight vCPUs for OLAP, three webservers with two vCPUs per tile), so in total 28 virtual CPUs are active. There are some minor tasks in the background: a very lightly loaded Oracle databases that feeds the three websites (one per tile), the VMware console (which idles most of the time), and of course the ESX hypervisor kernel. So all in all, you have a load on about 30-31 vCPUs. That means that some of the cores of the server system will be idleing, just like in the real world. On the HP DL380 G7, this “equal workload” benchmark gives the following CPU load graph:

On the Y-axis is CPU load, and on the X-axis is the periodic CPU usage. ESXtop was set up to measure CPU load every five seconds. Each test was performed three times: two times to measure performance and energy consumption, and the third time we did the same thing but with extensive ESXtop monitoring. To avoid having the CPU load in the third run much higher than the first two, we measured every five seconds. We measure the energy consumption over 15 minutes.

  vApus Mark   

Again, the dual Opteron numbers are somewhat high as we are running them in a quad socket machine. A Dell R715 is probably going to consume about 5% less. If we get the chance, we'll verify this. But even if the dual Opterons are not ideal measurements in comparison to the dual Xeon, they do give us interesting info.

Two Opterons CPUs are consuming 26.5 Wh (96.7 - 70.2). So if we extrapolate, this means roughly 55% (53 Wh out of 97Wh) of the total energy in our quad Opteron server is consumed by the four processors. Notice also that despite the small power handicap of the Opteron (a dual socket server will consume less), it was able to stay close to the Xeon X5670 based server when comparing maximum power (360W vs 330W). But once we introduce a 30-50% load, the gap between the dual Opteron setup and dual Xeon setup widens. In other words, the Opteron and Xeon are comparable at high loads, but the Xeon is able to save more power at lower loads. So there is still quite a bit of room for improvement: power gating will help the “Bulldozer” Opteron drive power consumption down at lower load.

Ok, enough interesting tidbits, who has the best performance per watt ratio?



Response Times

At low 30% to 60% utilization, we cannot compare throughput. The throughput is more or less the same on all machines. Response times make the difference here. It is important to interprete the numbers carefully though.

This might come as a surprise, but the dual Xeon X5670 inside the HP DL380 G7 comes out as the best (fastest) server here. The Xeon X5670 extracts more parallelism out of the code of one thread and clocks one core quite a bit higher than the other cores. Response times are measured per URL/query, thus single threaded performance is the determining factor until all cores are working as hard as they can.

We are working on about 30 virtual CPUs, or “worlds” in the eye of the ESX scheduler. The dual Xeon X5670 can offer 24 Hardware Execution Contexts (HECs), the quad Opteron 6174 can offer 48. However, the Opteron cannot leverage the HEC advantage enough in this scenario. The Xeon X7560 has more or less the same core, but a lower clock but it does not suffer from the small scheduling overhead that the Xeon 5670 suffers having less HECs than VMs running. So that is why the 2.26 Xeon 7560 offers only 10-15% higher response times.

So how important is this? Is the Xeon twice as fast as the Opteron? Not really. Remember that we measured this over low latency LAN. A typical web request send from Europe to the AnandTech server in North Carolina will take up to 400 ms. In that scenario the extra 100 ms difference between the Xeon and Opteron will start to fade.

The higher the load, the more the Opteron will narrow the gap as it starts to leverage the higher throughput.

The difference in user experience is hardly as dramatic as the numbers indicate. Whether you will care or not will also depend on the application. Some web requests can take up to 2 seconds (220 ms is only an average), so it really depends on how complex your application is. If you run at a light load and the heaviest requests are answered within half a second, nobody will notice if it is 300 or 180 ms. But if some of your requests take more than a second even under "normal" load, this difference will be noticeable.

So response time under "normal" load might not be as important as under heavy load, but the numbers above also show you that throughput is not everything. Single threaded performance is still important, and we definitely feel that the UltraSparc T2 approach is the wrong one for most business applications out there. A good balance between single-threaded and multi-core is still advisable for our web applications that get heavier as we build upon feature rich Content Management Systems.

Once we load the systems close their maximum, a totally different picture emerges. Below you can see the response times with much higher concurrencies and the four tiles of full vApus Mark II testing. Remember that the concurrencies are 10 times higher and the OLTP test is included.

The quad Xeon wins in the web tests while the quad Opteron leads in the OLAP tests. The OLAP test is more bandwidth sensitive and that is one of the reasons that the quad Opteron configurations excel there.

The dual Xeon 5670 has only 24 HECs to offer and 72 worlds are constantly demanding CPU power. No wonder that the dual Xeon is completely swamped and as a result has the worst response times.



Putting It All Together

Finally, we have been able to offer you a comparison of real OEM servers. In this article, we tried some new approaches with our testing methods: we measured and compared response times and energy consumption, instead of the usual focus on "throughput" and "maximum/idle power". It is important to take a step back and look at all our benchmark data from the point of view of a server buyer.

Let's start with the quad Xeon 7500 server: the SGI Altix UV 10 or QSSC-4R. Based on our performance numbers alone, we felt that one quad Xeon 7500 server could replace two or more dual Xeon servers as the performance/price was right. The price is about 2.5x higher than a dual Xeon, but you get twice the performance, more expandibility (PCIe and DIMM slots), and superior RAS as bonus. Remember, a Xeon MP with a price/performance ratio that could rival that of a dual Xeon was a first.

But the appearance of the Dell R815 and the high energy consumption make the SGI / QSSC server retreat to its typical target (and very profitable) markets: ERP, databases with large memory footprints where RAS is not a bonus but the top priority. The performance was a pleasant surprise and the power consumption of CPUs was decent. Make sure you populate at least 32 DIMMs, as bandwidth takes a dive at lower DIMM counts.

The power consumption of the platform, especially looking at the idle numbers, remains a weak spot. We know that scalability and availability come with a price, but three times higher energy consumption than a dual socket server is too much to convince us that the quad Xeon platform is an attractive virtualization building block.

The HP Proliant DL380 G7 surprised with better than expected energy consumption and some really clever engineering (CPU cage, cold redundancy, energy management...). The high single threaded performance of the Xeon X5670 leads to low response times in many real world circumstances. At high loads, it is outperformed by the Dell R815 that is hardly more expensive.

With 80% higher DIMM counts and 80% to 85% higher throughput, the Dell PowerEdge R815 surpasses the rival HP DL380 G7 by a large margin, while at the same time costing only 20-30% more and needing just as much rack space. That is amazing value. While the price/performance ratio blew us away, we were also hoping that a single R815 could beat the performance/watt ratio of two HP DL380 G7s by a significant margin. That would have been the cherry on the cake, but it did not happen.

The server is not too blame; rather, the CPUs consume more than the ACP ratings that AMD mentions everywhere. The truth is that the CPUs at high load consume much closer to their TDP numbers than ACP ones. However, the performance per watt ratio of the complete server is still competitive. The lower single-thread performance per core is a disadvantage in applications with complex webpages. We would avoid the low end Opteron 6100s.

The bottom line is that Dell's R815 can replace two HP DL380 G7s at a much lower investment cost, with about the same energy costs and lower management costs. Having to manage half as much physical servers should after all also lower the operation costs. Dell's PowerEdge R815 materializes AMD's promise of the "Value 4P server".

 

My special thanks goes out to Tijl Deneut for his benchmarking assistance.



In Summary

Below we have a quick summary of the pros and cons for each of the three servers we talked about today. As always, the full story can't be distilled down to a few lines but if you're looking for a recap hopefully this helps.

SGI Altix UV 10/ QSSC-4R

Pro:

  • Extreme expandability : 64 DIMMs, 10 PCIe slots
  • Cold Redundancy and 2 + 2 PSU
  • RAS features
  • Low response times in all kinds of circumstances (X7560)

 

Con:

  • High power consumption in general; low utilization power consumption too high
  • Only the X7560 delivers really high end performance

 

Dell PowerEdge R815

Pro:

  • 2U rack space saving form factor and...
  • Performance and expandability of a 4U
  • Best price/performance ratio of the market
  • Competitive performance/watt

Con:

  • Higher response times for the individual VMs in some real world situations (20-40% load)

 

HP Proliant DL380 G7

Pro:

  • Low response times in most "non-peak" situations
  • Cold redundancy
  • Competitive performance/watt
  • Very low energy use when underutilized

Con:

  • Relatively high price/performance ratio compared to Dell R815
  • Too many "options" (risers, unnecessary DIMMs, drive bays)

Log in

Don't have an account? Sign up now