Original Link: http://www.anandtech.com/show/5553/the-xeon-e52600-dual-sandybridge-for-servers



Intel's Sandy Bridge architecture was introduced to desktop users more than a year ago. Server parts however have been much slower to arrive, as it has taken Intel that long to transpose this new engine into a Xeon processor. Although the core architecture is the same, the system architecture is significantly different from the LGA-1155 CPUs, making this CPU quite a challenge, even for Intel. Completing their work late last year, Intel first introduced the resulting design as the six-core high-end Sandy Bridge-E desktop CPU, and since then have been preparing SNB-E for use in Xeon processors. This has taken a few more months but Xeon users' waits are at an end at last, as today Intel is launching their first SNB-E based Xeons .

Compared to its predecessor, the Xeon X5600, the Xeon E5-2600 offers a number of improvements:

A completely improved core, as described here in Anand's article. For example, the µop cache lowers the pressure on the decoding stages and lowers power consumption, killing two birds with one stone. Other core improvements include an improved branch prediction unit and a more efficient Out-of-Order backend with larger buffers.

A vastly improved Turbo 2.0. The CPU can briefly go beyond the TDP limits, and when returning to the TDP limit, the CPU can sustain higher "steady-state" clockspeed. According to Intel, enabling turbo allows the Xeon E5 to perform 14% better in the SAP S&D 2 tier test. This compares well with the Turbo inside the Xeon 5600 which could only boost performance by 4% in the SAP benchmark.

Support for AVX Instructions combined with doubling the load bandwidth should allow the Xeon to double the peak floating point performance compared to the Xeon "Westmere" 5600.

A bi-directional 32 byte ring interconnect that connects the 8 cores, the L3-cache, the QPI agent and the integrated memory controller. The ring replaces the individual wires from each core to the L3-cache. One of the advantages is that the wiring to the L3-cache can be simplified and it is easier to make the bandwidth scale with the number of cores. The disadvantage is that the latency is variable: it depends on how many hops a certain piece of data inside the L3-cache must cross before ends up at the right core.

A faster QPI: revision 1.1, which delivers up to 8 GT/s instead of 6.4 GT/s (Westmere).

Lower latency to PCI-e devices. Intel integrated a PCIe 3.0 I/O subsystem inside the die which sits on the same bi-directional 32 bit ring as the cores. PCIe 3.0 runs at 8 GT/s (PCIe 2.0: 5 GT/s), but the encoding has less overhead. As a result, PCIe 3.0 can deliver up to 1 GB full duplex per second per lane, which is twice as much as PCIe 2.0.

Removing the I/O lowered PCIe latency by 25% on average according to Intel. If you only access the local memory, Intel measured 32% lower read latency.

The access latency to PCIe I/O devices is not only significantly lower, but Intel's Data Direct I/O Technology allows the PCIe NICs to read and write directly to the L3-cache instead of to the main memory. In extremely bandwidth constrained situations (using 4 infiniband controllers or similar), this lowers power consumption and reduces latency by another 18%, which is a boon to HPC users with 10G Ethernet or Infiniband NICs.

The new Xeon also supports faster DDR-3 1600, up to 2 DIMMs per channel can run at 1600 MHz.

Last but certainly not least: 2 additional cores and up to 66% more L3 cache (20 MB instead of 12 MB). Even with 8 cores and a PCIe agent (40 lanes), the Xeon E5 still runs at 2.2 GHz within a 95W TDP power envelope. Pretty impressive when compared with both the Opteron 6200 and Xeon 5600.



The massive 416 mm² large chip contains no less than 2263 million transistors. Each generation of Intel and AMD server CPUs seem to get a bit larger as you can see below.

The Xeon 5400, 5500/5600 and E5-2600 package on top, the Opteron 2300/8300 and 6100/6200 below.

So how does the new Xeon compare to the older Xeons and the latest Opterons? Let's take a look at the paper specs:

  Xeon E5-2600
"Sandy Bridge EP"
Opteron 6200
"Interlagos"
Opteron 6100
"Magny-cours"
Xeon 5600
"Westmere"
Cores (Modules)/Threads 8/16 8/16 12/12 6/12
L1 Instruction 8x 32 KB 4-way 8x 64 KB 2-way 12x 64 KB 2-way 6x 32 KB 4-way
L1 Data 8x 32 KB 8-way 16x 16 KB 4-way 12x 64 KB 2-way 6x 32 KB 8-way
L2 Cache 8x 256 KB 4x 2MB 12x 0.5MB 6x 256 KB
L3 Cache 20 MB 2x 8MB 2x 6MB 12MB
Max. Memory Bandwidth
(Per socket)
51.2 GB/s 51.2 GB/s 42.6 GB/s 32 GB/s
IMC Clock Speed = corespeed 2GHz 1.8GHz 2GHz
Interconnect 2x QPI 2.0 (8 GT/s) 4x HT 3.1 (6.4 GT/s) 4x HT 3.1 (6.4 GT/s) 2x QPI (4.8-6.4 GT/s)
Transistors (Billion) 2,26 2x 1,2 2x 904 1,17
Die Size (mm²) 416 2x 315 2x 346 248

The new Xeon comes with a huge die, and with its ring interconnect and improved RAS, it starts to look more like a successor of the Westmere-EX than the Westmere-EP Xeon. In fact the ring of the Xeon E5 is more advanced: it has a PCIe agent, PCU and IMC on the same ring as the 8 cores.

The massive die, the two extra cores, the integration of the PCIe controller and no competition in the high-end have made it easier for Intel to justify a price increase. The Sandy Bridge EP is somewhat more expensive than its predecessor, as you can see in the table below. The first clockspeed mentioned is the regular clock, the second the turbo clock with all cores active (most realistic one) and the last the maximum turbo clock.

Intel new vs. Intel 2-socket SKU Comparison
Xeon
5600
Cores/
Threads
TDP Clock
(GHz)
Price Xeon
E-5
Cores/
Threads
TDP Clock
(GHz)
Price
High Performance High Performance
          2690 8/16 135W 2.9/3.3/3.8 $2057
X5690 6/12 130W 3.46/3.6/3.73 $1663 2680 8/16 130W 2.7/3.1/3.5 $1723
          2670 8/16 115W 2.6/3/3.3 $1552
          2665 8/16 115W 2.4/2.8/3.1 $1440
X5675 6/12 95W 3.06/3.33/3.46 $1440          
X5660 6/12 95W 2.8/3.06/3.2 $1219 2660 8/16 95W 2.2/2.6/3.0 $1329
X5650 6/12 95W 2.66/2.93/3.06 $996 2650 8/16 95W 2/2.4/2.8 $1107
Midrange Midrange
E5649 6/12 80W 2.53/2.66/2.8 $774 2640 6/12 95W 2.5/2.5/3 $885
          2630 6/12 95W 2.3/2.3/2.8 $612
E5645 6/12 80W 2.4/2.53/2.66 $551          
          2620 6/12 95W 2/2/2.5 $406
E5620 4/8 80W 2.4/2.53/2.66 $387          
High clock / budget High clock / budget
X5647 4/8 130W 2.93/3.06/3.2 $774 2643 4/8 130W 3.3/3.3/3.5 $885
E5630 4/8 80W 2.53/2.66/2.8 $551          
E5607 4/4 80W 2.26 $276 2609 4/4 80W 2.4 $294
Power Optimized Power Optimized
L5640 6/12 60W 2.26/2.4/2.66 $996 2650L 8/16 70W 1.8/2/2.3 $1107
5630 4/8 40W 2.13/2.26/2.4 $551 2630L 8/16 60W 2/2/2.5 $662

The Xeon E5-2690's somewhat out of the ordinary TDP (135W) is easy to explain. With a very small TDP increase (+5W) Intel's engineers noticed they could raise the clock of the best SKU with another 200 MHz from 2.7 GHz (130W) to 2.9 GHz. The E5-2690 was more or less a safeguard in the event that the Interlagos Opteron turned out to be a real "Bulldozer". As the Opteron could not meet these expectations, the high performance of the 135W chip allows Intel to ask more than $2000 for its best Xeon EP. Which is quite a bit more than what the best Xeon EP used to sell for so far ($1500-1600).

Since the new Xeon has two extra cores and integrates the I/O hub (IOH), it is understandable that the TDP values are a bit higher compared to the older Xeon.

How does these new Xeon SKUs compare to the Opteron? See below.

AMD vs. Intel 2-socket SKU Comparison
Xeon
E5
Cores/
Threads
TDP Clock
(GHz)
Price Opteron Modules/
Integer
cores
TDP Clock
(GHz)
Price
High Performance High Performance
                   
2665 8/16 115W 2.4/2.8/3.1 $1440          
2650 8/16 95W 2/2.4/2.8 $1107 6282 SE 8/16 140W 2.6/3.0/3.3 $1019
Midrange Midrange
2640 6/12 95W 2.5/2.5/3 $885 6276 8/16 115W 2.3/2.6/3.2 $788
2630 6/12 95W 2.3/2.3/2.8 $639 6274 8/16 115W 2.2/2.5/3.1 $639
          6272 8/16 115W 2.0/2.4/3.0 $523
2620 6/12 95W 2/2/2.5 $406 6238 6/12 115W 2.6/2.9/3.2 $455
          6234 6/12 115W 2.4/2.7/3.0 $377
High clock / budget High clock / budget
2643 4/8 130W 3.3/3.3/3.5 $885          
          6220 4/8 115W 3.0/3.3/3.6 $455
2609 4/4 80W 2.4 $294 6212 4/8 115W 2.6/2.9/3.2 $266
Power Optimized Power Optimized
2630L 8/16 60W 2/2/2.5 $662 6262HE 8/16 85W 1.6/2.1/2.9 $523

Let's start with the midrange first, as the competition is the fiercest there and these SKUs are among the most popular on the market. Based on the paper specs, AMD's 6276, 6274 and Intel's 2640 and 2630 are in a neck-and-neck race. AMD offers 16 smaller integer clusters, while Intel offers 6 or 8 heavy, slightly higher clocked cores with SMT. And while we did not receive a Xeon E5-2630 for benchmarking purposes, we were able to quickly simulate one by disabling the 2 cores of our Xeon 2660, which gave us a six-core processor at 2.2 GHz with 20 MB L3-cache. This pseudo-2660 should perform very similar to the real Xeon 2630, which is clocked 4.5% higher, but has 5 MB less L3-cache.

Meanwhile in the high performance segment we'll be comparing our six-core 2660 with the Opteron 6276. The CPUs in this comparison aren't going to be in the same price bracket, but as the AMD platform is typically a bit cheaper the 2660 and the Opteron 6276 end up having similar total platform costs. Otherwise for a more straightforward comparison based solely on CPU prices the 2660's closest competitor would be the Opteron 6274. We don't have one of those on hand, but you can get a pretty good idea of how that would compare by knocking 4% off of the performance of the 6276..

Finally, for the "Power Optimized" market there seems to be little contest over who is going to win there. Intel's chip is a bit more expensive, but it offers a much lower TDP, just as many threads, and a higher clockspeed. Considering that the Intel chip also integrates the PCIe controller, it looks like Intel will have no trouble winning this battle by a landslide. Fortunately for AMD, this review is mostly about the more popular midrange market.



The Intel S2600GZ board

Our Intel R2208GZ4GSSPP had the Intel S2600GZ "Grizzly Pass" board inside. The board has been qualified for all major virtualization solutions: Citrix Xenserver, Hyper-V, SLES 11, Oracle's VM server, RHEL and VMware vSphere of course. It can also be used as basis for almost every independent storage software vendor: DataCore, Falconstore, Gluster, Microsoft's iSCSI target, Nexenta, Open-E and Stormagic.

The board has a cooling and power-friendly spread core design: the airflow of one heatsink does not get used to cool another heatsink. The board features up to 24 DIMMs, which support Low Power, Unregistered and Registered DDR3 (up to 1600 MHz) and LR-DIMMs. Four GBe interfaces are on board and an optional I/O module can add dual 10 GBe (Base-T or optical) or QDR infiniband. Meanwhile the C600 chipset offers 8 SAS/SATA ports (2x 6G) and a PCIe 3.0 x8 module slot for stroage purposes. This slot can be used for setups such as the LSI 2208 dual core ROC controller based RAID card with two 8087 SAS connectors and 1 GB of 1033 MHz DDR3 cache.

Last but not least: the board has two PCIe x24 "super slots" which allows for the use of two risers. Each riser contains 3 PCIe 3.0 x8 slots: two half height slots and a full height slot. Finally, powering the system is a small 750 Watt PSU rated for 80 PLUS Platinum.



Supermicro's Latest Twin

We got a sneak peak at the Supermicro's brand new Twin 2U server: the SYS-6027TR-D71FRF. The 2U chassis has two dual Xeon E5 based servers inside that are fed by two fully redundant 1280W PSUs (at 180-230V, 1000W at 100-140V).

The two servers are held in place using screwless clips.

You get the density of a 1U server without needing four PSUs for redundancy and without the very power hungry 40 mm fans. Indeed, using only 2 PSUs and 80 mm fans should save quite a bit of power compared to 2 1U servers. Last time we measured, the Twin servers consumed 6% less power than the best 1U servers on the market.

At the same time, the expansion capabilities are better: you get two full height and one half height PCIe 3.0 (!) x16 (x8 electrical) slots. The only disadvantage is that you only get 4 DIMM slots per CPU, which generally limits each server to about 128 GB of RAM (8 x 16 GB) unless you go with expensive 32 GB LR-DIMMS for a total of 256 GB. Therefore this server is probably better for HPC workloads than for memory intensive virtualization and database applications.

This new Twin server also features FDR InfiniBand interconnect technology, good for 56Gb/s (!) low latency network connections with an X4 cable. This should work especially well in tandem with Intel Data Direct I/O technology, where packets are directly transferred into the Last Level Cache (LLC) instead of being DMAed to the memory. This is something we'll be investigating in a later article.



Benchmark Configuration

Unfortunately, the Intel R2208GZ4GSSPP is a 2U server, which makes it hard to compare it with the 1U Opteron "Interlagos" and 1U "Westmere EP" servers we have tested in the past. We will be showing you a few power consumption numbers, but since a direct comparison isn't possible, please take them with a grain of salt.

Intel's Xeon E5 server R2208GZ4GSSPP (2U Chassis)

CPU

Intel Xeon processor E5-2690 (2.9 GHz, 8c, 20MB L3, 135W)
Intel Xeon processor E5-2660 (2.2 GHz, 8c, 20MB L3, 95W)

RAM 64 GB (8x8GB) DDR-1600 Samsung M393B1K70DH0-CK0
Motherboard Intel Server Board S2600GZ "Grizzly Pass"
Chipset Intel C600
BIOS version SE5C600.86B (01/06/2012)
PSU Intel 750W DPS-750XB A (80+ Platinum)

The Xeon E5 CPUs have four memory channels per CPU and support DDR3-1600, and thus our dual CPU configuration gets eight DIMMs for maximum bandwidth. The typical BIOS settings can be found below.

Not being show is that all prefetchers were enabled in all tests.

Supermicro A+ Opteron server 1022G-URG (1U Chassis)

CPU Two AMD Opteron "Bulldozer" 6276 at 2.3GHz
Two AMD Opteron "Magny-Cours" 6174 at 2.2GHz
RAM 64GB (8x8GB) DDR3-1600 Samsung M393B1K70DH0-CK0
Motherboard SuperMicro H8DGU-F
Internal Disks 2 x Intel SLC X25-E 32GB or
1 x Intel MLC SSD510 120GB
Chipset AMD Chipset SR5670 + SP5100
BIOS version v2.81 (10/28/2011)
PSU SuperMicro PWS-704P-1R 750Watt

The same is true for the latest AMD Opterons: eight DDR3-1600 DIMMs for maximum bandwidth. You can find the BIOS settings of our Opteron machine here. C6 was enabled.

Asus RS700-E6/RS4 1U Server

CPU Two Intel Xeon X5670 at 2.93GHz - 6 cores
Two Intel Xeon X5650 at 2.66GHz - 6 cores
RAM 48GB (12x4GB) Kingston DDR3-1333 FB372D3D4P13C9ED1
Motherboard Asus Z8PS-D12-1U
Chipset Intel 5520
BIOS version 1102 (08/25/2011)
PSU 770W Delta Electronics DPS-770AB

To speed up testing, we tested the Intel Xeon and AMD Opteron system in parallel. As we didn't have more than eight 8GB DIMMs, we used our 4GB DDR3-1333 DIMMs. The Xeon system only gets 48GB, but this isn't a disadvantage as our highest memory footprint benchmark (vApus FOS, 5 tiles) uses no more than 40GB of RAM.

Finally, we measured the difference between 12x4GB and 8x8GB of RAM and recalculated the power consumption for our power measurements (note that the differences were very small). There is no alternative as our Xeon has three memory channels and cannot be outfitted with the same amount of RAM as our Opteron system (four channels).

Common Storage System

For the virtualization tests, each server gets an Adaptec 5085 PCIe x8 card (driver aacraid v1.1-5.1[2459] b 469512) connected to six Cheetah 300GB 15000 RPM SAS disks (RAID-0) inside a Promise JBOD J300. The virtualization testing requires more storage IOPs than our standard Promise JBOD with six SAS drives can provide. To counter this, we added internal SSDs:

  • We installed the Oracle Swingbench VMs (vApus Mark II) on two internal X25-E SSDs (no RAID). The Oracle database is only 6GB big. We test with two tiles. On each SSD, each OLTP VM accesses its own database data. All other VMs (web, SQL Server OLAP) are stored on the Promise JBOD (see above).
  • With vApus FOS, Zimbra is the I/O intensive VM. We spread the Zimbra data over the two Intel X25-E SSDs (no RAID). All other VMs (web, MySQL OLAP) get their data from the Promise JBOD (see above).

We monitored disk activity and measured the phyiscal disk adapter latency (as reported by VMware vSphere) at between 0.5 and 2.5 ms.

Software Configuration

All vApus testing was done one ESXi vSphere 5--VMware ESXi 5.0.0 (b 469512 - VMkernel SMP build-348481 Jan-12-2011 x86_64) to be more specific. All vmdks use thick provisioning, independent, and persistent. The power policy is "Balanced Power" unless indicated otherwise. All other testing was done on Windows 2008 Enterprise R2 SP1. Unless noted otherwise, we used the "High Performance setting" on Windows 2008 R2 SP1.

Other Notes

Both servers were fed by a standard European 230V (16 Amps max.) powerline. The room temperature was monitored and kept at 23°C by our Airwell CRACs.

We used the Racktivity ES1008 Energy Switch PDU to measure power consumption. Using a PDU for accurate power measurements might seem pretty insane, but this is not your average PDU. Measurement circuits of most PDUs assume that the incoming AC is a perfect sine wave, but it never is. However, the Rackitivity PDU measures true RMS current and voltage at a very high sample rate: up to 20,000 measurements per second for the complete PDU.



Virtualization Performance: Linux VMs on ESXi

We introduced our new vApus FOS (For Open Source) server workloads in our review of the Facebook "Open Compute" servers. In a nutshell, it a mix of four VMs with open source workloads: two PhpBB websites (Apache2, MySQL), one OLAP MySQL "Community server 5.1.37" database, and one VM with VMware's open source groupware Zimbra 7.1.0. Zimbra is quite a complex application as it contains the following components:

  • Jetty, the web application server
  • Postfix, an open source mail transfer agent
  • OpenLDAP software, user authentication
  • MySQL is the database
  • Lucene full-featured text and search engine
  • ClamAV, an anti-virus scanner
  • SpamAssassin, a mail filter
  • James/Sieve filtering (mail)

All VMs are based on a minimal CentOS 6 setup with VMware Tools installed. All our current virtualization testing is on top of the hypervisor which we know best: ESXi (5.0). We have changed two things in our vApusMark FOS setup: we upgradeded the guestOS from 5.6 to 6.0 and increased the number of vCPUs of the OLAP VM from 2 to 4. This small upgrade means that our latest results should not be compared to the results in our older articles.

We (Tijl Deneut and myself) tested with four tiles (one tile = four VMs). Each tile needs nine vCPUs, so the test requires 36 vCPUs.

vApusMark FOS

The benchmark above measures throughput. As for response times, let's take a look at the table below, which gives you the average response time per VM:

vApus FOS Average Response Times (ms), lower is better!
CPU PhpBB1 PHPBB2 MySQL OLAP Zimbra
AMD Opteron 6276 2.3 671 514 1410 758
AMD Opteron 6174 2.2 674 524 1210 861
Intel Xeon E5-2660 2.2 645 394 160 631
Intel Xeon E5-2690 2.9 362 288 40 483
Intel Xeon X5650 2.66 745 569 821 866

Considering that we may assume that the Xeon E5-2690 consumes considerably more than the E5-2660, it looks like the Xeon E5-2660 is the new virtualization champ. Let us check out the power consumption numbers under a realistic load.



Measuring Real-World Power Consumption

The Equal Workload (EWL) version of vApus FOS is very similar to our previous vApus Mark II "Real-world Power" test. To create a real-world “equal workload” scenario, we throttle the number of users in each VM to a point where you typically get somewhere between 20% and 80% CPU load on a modern dual CPU server. The amount of requests is the same for each system, hence "equal workload". The CPU load is typically around 30-50%, with peaks up to 65% (for more info see here). At the end of the test, we get to a low 10%, which is ideal for the machine to boost to higher CPU clocks (Turbo) and race to idle.

We used the "Balanced" power policy and enabled C-states as the current ESXi settings make poor use of the C6 capabilities of the latest Opterons and Xeons.

First let's check out the response times.

vApus FOS Response times (ms)
CPU PhpBB1 PHPBB2 MySQL OLAP Zimbra
AMD Opteron 6276 101 30 3.8 41
AMD Opteron 6174 118 41 3.8 45
Intel Xeon X5650 45 18 2.4 29
Intel Xeon E5-2660 41 18 2.5 25
Intel Xeon E5-2690 27 14 2.3 23

It's worth noting that enabling the C-states in ESXi improves the performance/watt ratio of the Opteron 6276 quite a bit. Not only is the power consumption lower (see below), but enabling C6 allows higher turbo clocks, which in turn benefits response times. Compared to our previous test (standard out of the box "Balanced") all response times improve by 10% except for MySQL (which is already very low).

Even with that improvement however it is not enough to beat the Xeon E5. The Xeon E5 delivers extremely low response times....

vApus FOS EWL Power consumption

... while sipping very little power, despite being run inside a feature rich server. Kudos to Intel for a job very well done.



SQL Server 2008 Enterprise R2

We have been using the Flemish/Dutch Web 2.0 website Nieuws.be as a benchmark for some time. 99% of the loads on the database are selects and about 5% of them are stored procedures. You can find a more detailed description here.

We have improved our testing methodology (read more about it here) and updated the SQL Server, so the results are only comparable to our last Opteron 6276 review (and not comparable to older ones than the latter).

MS SQL Server 2008

Since performance/watt is an extremely important metric, we follow up with a power measurement:

MS SQL Server 2008

The Xeon E5-2690 is by far the fastest in this discipline, but the difference power consumption compared to the rest of the pack is significant. The Xeon E5-2690 needs 140W more than its slower brother, the 95W TDP Xeon E5-2660. That is 70W extra per CPU. This clearly indicates that the fastest Xeon is running closer to its TDP than the 2.2 GHz version. The Xeon E5-2660 offers more than 20% better performance per Watt than the 135W TDP Xeon.

The Xeon E5-2660 is especially impressive if you compare it with the older Xeon. Despite the lower clockspeed, the new Xeon is capable of outperforming the Xeon 5650 by 30%.

Clock for clock, core for core the Xeon E5 is 23% more efficient at SQL Server workloads than its older brother. Considering that it is pretty hard to extract higher IPC out of server workloads, we can say that the Sandy-Bridge architecture is a winner when it comes to SQL databases.

Finally, let's check out the response times with 600 users sending off a query every second (on average):

MS SQL Server 2008

Response times are more or less linear (and low!) when the server is not yet saturated . Once the server is closer to or over its maximum throughput, response times tend to increase almost exponentially. Since the Xeon E5-2690 is capable of sustaining more than 600 users, it can still offer a very low response time. The other CPUs are saturated at this point.

But as we pointed out in our previous article, server benchmarks at 100% are just one datapoint and we should test at lower concurrencies as well. Most people try to make sure that their database server almost never runs at 100% CPU load.



Since you can save quite a bit of power when running at 50% CPU load and lower by enabling the "Balanced" power policy, we test our medium load (125 users) benchmark with both the "Balanced" as the "High Performance" setting.

MS SQL Server 2008

No real surprises, besides a small one: the Xeon 5650 manages to keep up with the best Xeon E5. The Xeon E5 seems to favor the lower p-states in the "Balanced" mode, as the response times double compared to high performance mode. In the case of the Xeon E5, this is not really a problem: a 2.2 GHz Xeon E5 still manages to respond as fast as a 3 GHz Opteron.

MS SQL Server 2008

Despite the fact that our server was equipped with lots of expansion capabilities, the Xeon E5 manages to keep the power consumption very low. Even the 135W TDP Xeon E5-2690 consumes 6% less than the previous generation of 95W Xeons and up to 27% less than the Opterons with the balanced power policy. The new Xeons E5 offer an unbeateable performance/watt ratio when running SQL databases.



Rendering Performance: Cinebench

Cinebench, based on MAXON's CINEMA 4D software, is probably one of the most popular benchmarks around as it is pretty easy to perform this benchmark on your own home machine. The benchmark supports 64 threads, more than enough for our 24- and 32-thread test servers. First we tested single-threaded performance, to evaluate the performance of each core.

Cinebench 11.5 Single threaded

Cinebench achieves an IPC between 1.4 and 1.8 and is mostly dominated by SSE2 code. The Sandy Bridge core offers about 33% better single-threaded SSE performance. We checked: the 33% can be split up into 21% gains from architectural improvements and 12% from the improved Turbo capabilities.

Let's check out the multi-threaded score.

Cinbench R11.5

Prior to the launch of the Xeon E5 series, the Opteron 6276 offered a better performance per dollar ratio than comparable Xeon 5600s due to their similar performance at a much lower pricepoint. Now that the Xeon E5 has arrived, the tables have turned. If Xeon E5 servers are in the same price range as Xeon 5600 servers, the Xeon E5-2630 will offer the best performance/price ratio.

And if you want top performance, Intel is the only option. Case in point: a dual Xeon E5-2690 comes close to what a Quad Opteron 6276 can deliver, with the dual Xeon scoring 24.7 while the quad Opteron scores 26.4.



Rendering: Blender 2.6.0

Blender is a very popular open source renderer with a large and active community. We tested the 64-bit Windows edition, using version 2.6.0a. If you like, you can perform this benchmark very easily too. We used the metallic robot, a scene with rather complex lighting (reflections) and raytracing. Furthermore to make the benchmark more repetitive, we changed the following parameters:

  1. The resolution was set to 2560x1600
  2. Antialiasing was set to 16
  3. We disabled compositing in post processing
  4. Tiles were set to 16x16 (X=16, Y=16)
  5. Threads was set to auto (one thread per CPU is set).

As we have explained, the current 24 and 32 core CPUs benefit from using a much larger number of tiles than we have previously used (64, 8x8). That is why we raised the number of tiles to 256 (16x16), though all CPUs perform better at this setting.

To make the results easier to read, we again converted the reported render time into images rendered per hour, so higher is better.

Blender 2.6.0

Blender is Xeon territory for sure, as Blender mostly runs in the L1 and L2 cache. Therefore a E5-2630 (2.3 GHz, 15 MB L3, $612) will probably perform about 4% faster than the six-core Xeon E5-2660 in this test. Our six-core Xeon E5-2660 is about 26% faster than the best Opteron. We estimate that the Xeon E5-2630 will offer more or less the same performance at an almost 30% lower pricepoint than the Opteron 6276. Whether you have a lot or little to spend, the Xeon E5 is your best bet for Blender.

Rendering Performance: 3DSMax 2012

As requested, we're reintroducing our 3DS Max benchmark. We used the "architecture" scene which is included in the SPEC APC 3DS Max test. As the Scanline renderer is limited to 16 threads, we're using the iray render engine, which is basically an self-configuring Mental Ray render engine.

We rendered at 720p (1280x720) resolution. We measured the time it takes to render 10 frames (from 20 to 29) with SSE enabled. We recorded the time and then calculated (3600 seconds * 10 frames / time recorded) how many frames a certain CPU configuration could render in one hour. All results are reported as rendered images per hour; higher is thus better. We used the 64-bit version of 3ds Max 2008 on 64-bit Windows 2008 R2 SP1.

3DSMax  2012 Architecture

Even with the advanced iray renderer, 3DS Max rendering reaches our scaling limits. The 32-thread Xeons do not come close to 100% CPU load (more like 90%) and in between the frames there are small periods of single threaded processing. Amdahl's law is most likely reason here. We suspect that highly clocked lower core count models can pass the 53 fps barrier we're seeing here.



LS-DYNA

LS-DYNA is a "general purpose structural and fluid analysis simulation software package capable of simulating complex real world problems", developed by the Livermore Software Technology Corporation (LSTC). It is used by the automobile, aerospace, construction, military, manufacturing and bioengineering industry. Even simple simulations take hours to complete, so even a small performance increase results in tangible savings. Add to that that many of our readers have been asking that we perform some benchmarking with HPC workloads. So reasons enough to include our own LS-DYNA benchmarking.

These numbers are not directly comparable with AMD's and Intel's benchmarks as we did not perform any special tuning besides using the message passing interface (MPI) version of LS-DYNA ( ls971_mpp_hpmpi ) to run the LS-DYNA solver to get maximum scalability. This is HP-MPI version of LS-DYNA 9.71.

Our first test is a refined revised  Neon crash test simulation.

LS-Dyna Neon-Refined Revised

This is one of the few benchmarks (besides SAP) where the Opteron 6276 outperforms the older Opteron 6174 by a tangible margin (about 20% faster) and is significantly faster than the Xeon 5600, by 40% to be more precise. However, the direct competitor of the 6276, the Xeon E5-2630, will do a bit better (see the E5-2660 6C score). When you are aiming for the best performance, it is impossible to beat the best Xeons: the Xeon E5-2660 offers 26% better performance, the 2690 is 46% faster. It is interesting to note that LS-Dyna does not scale well with clockspeed: the 32% higher clockspeed of the Xeon E5-2690 results in only a 15% speed increase.

A few other interesting things to note: we saw only a very smal performance increase (+5%) due to Hyperthreading. Memory bandwidth does not seem to be critical either, as performance increased by only 6% when we replaced DDR3-1333 with DDR3-1600. If LS-Dyna was bottlenecked severely by the memory speed we should have seen a performance increase close to 20% (1600 vs 1333).

CMT boosted the Opteron 6276's performance by up to 33%, which seems weird at first since LS-DYNA is a typical floating point intensive application. As the shared floating point "outsources" load and stores to the integer cores, the most logical explanation is that LS-DYNA is limited by the load/store bandwidth. This is in sharp contrast with for example 3DS Max where the additional overhead of 16 extra threads slowed the shared FP down instead of speeding it up.

Also, both CPUs seem to have made good use of their turbo capabilities. The AMD Opteron was running at 2.6 GHz most of the time, the Xeon 2690 at 3.3 GHz and the Xeon 2660 at 2.6 GHz.

The second test is the "Three Vehicle Collision Test" simulation, which runs a lot longer.

LS-Dyna Three Vehicle Collision Test

The three vehicle collision test does not change the benchmarking picture, it confirms our early findings. The Opteron Interlagos does well, but the Xeon E5 is the new HPC champion.



TrueCrypt 7.1 Benchmark

TrueCrypt is a software application used for on-the-fly encryption (OTFE). It is free, open source and offers full AES-NI support. The application also features a built-in encryption benchmark that we can use to measure CPU performance. First we test with the AES algorithm (256-bit key, symmetric).

TrueCrypt AES

Core for Core, clock for clock, the Xeon E5 - which also supports AES-NI - is about 30% faster than the best Opteron (Xeon E5-2660 vs Opteron 6276). At a similar pricepoint (Opteron 6276 vs Xeon E5-2660 6C) however, the Opteron and Xeon E5 perform more or less the same, with a small advantage for the latter.

We also test with the heaviest combination of the cascaded algorithms available: Serpent-Twofish-AES.

TrueCrypt AES-Twofish-Serpent

The combination benchmark is limited by the slowest algorithms: Twofish and Serpent. This one of the few benchmarks where the Opteron 6276 is able to keep up with the Xeon E5.

It is important to realize that these benchmarks are not real-world but rather are synthetic. It would be better to test a website that does some encrypting in the background or a fileserver with encrypted partitions. In that case the encryption software is only a small part of the total code being run. A large performance (dis)advantage might translate into a much smaller performance (dis)advantage in that real-world situation. For example, eight times faster encryption resulted in a website with 23% higher throughput and a 40% faster file encryption (see here).

7-Zip 9.2

7-zip is a file archiver with a high compression ratio. 7-Zip is open source software, with most of the source code available under the GNU LGPL license

7-zip

Compression is more CPU intensive than decompression, meanwhile the latter depends a little more on memory bandwidth. When it comes to load/stores and memory bandwidth, the Xeon E5-2660 is about 13% faster than AMD's flagship. Compression is for a part determined by the quality of the branch predictor. The new and improved Sandy Bridge branch predictor is one of the reasons why a 2.2 GHz 6-core 2660 is able to keep up with a 2.93 GHz (!) Xeon 5670, which is also a six-core processor. The Opterons get blown away in the compression benchmark: each core of Xeon E5 is about twice as efficient in this task. The overall winner is thus once again the Xeon E5.



Conclusions

Our conclusion about the Xeon E5-2690 2.9 GHz is short and simple: it is the fastest server CPU you can get in a reasonably priced server and it blows the competition and the previous Xeon generation away. If performance is your first and foremost priority, this is the CPU to get. It consumes a lot of power if you push it to its limits, but make no mistake: this beast sips little energy when running at low and medium loads. The price tag is the only real disadvantage. In many cases this pricetag will be dwarfed by other IT costs. It is simply a top notch processor, no doubt about it.

For those that prioritize performance/watt or performance/dollar, we've summarized our findings in a comparison table. We made 3 columns for easy comparison:

  • In the first column, we compare Intel's newest generation with the previous one. We compare the CPUs with midrange TDP (95W).
  • In the second column, we compare Intel's and AMD's midrange offerings.
  • In the third column we compare CPUs with a similar pricepoint as we believe that a six-core E5-2660 will be very close to the performance of 2.3 GHz Xeon E5-2630.

We also group our benchmarks in different software groups and indicate the importance of this software group in the server market (we motivated this here).

Software: Importance in the market Xeon E5-2660
vs Xeon X5650
Xeon E5-2660
vs Opteron 6276
Xeon E5-2660 6C
vs Opteron 6276

Virtualisation: 20-50%

     
ESXi + Linux

+40%

+40%

+7%

OLAP Databases: 10-15%

 

 

 

MS SQL Server 2008 R2

+30%

+34%

+8%

HPC: 5-7%

 

 

 

LS Dyna

+77%

+26%

+15%

Rendering software: 2-3%

 

 

 

Cinebench

+50%

+37%

+9%

3DS Max 2012 (iRay)

2%

+12%

+18%

Blender

+9%

+32%

+26%

 

 

 

 

Other: N/A

 

 

 

Encryption/Decryption AES

+42/41%

+38/32%

+8/4%

Encryption/Decryption Twofish/Serpent

+37/49%

+5/2%

-19%/-19%

Compression/decompression

+35/37%

+105/13%

+66/-11%

It is pretty amazing that with the exception of two rendering applications with relatively mediocre scaling, the new Xeon is able to outperform the previous Xeons by a large margin (from 30% up to 60%) in a wide range of applications. All that performance comes with lower energy consumption and a very fast I/O interface. Whether you want high performance per dollar or performance per watt, the Xeon E5-2660 is simply a home run. End of story.

For those who are more price sensitive, the Xeon E5-2630 costs less than the Opteron 6276 and performs (very likely) better in every real world situation we could test.

And what about the Opteron? Unless the actual Xeon-E5 servers are much more expensive than expected, it looks like it will be hard to recommend the current Opteron 6200. However if Xeon E5 servers end up being quite a bit more expensive than similar Xeon 5600 servers, the Opteron 6200 might still have a chance as a low end virtualization server. After all, quite a few virtualization servers are bottlenecked by memory capacity and not by raw processing power. The Opteron can then leverage the fact that it can offer the same memory capacity at a lower price point.

The Opteron might also have a role in the low end, price sensitive HPC market, where it still performs very well. It won't have much of chance in the high end clustered one as Intel has the faster and more power efficient PCIe interface.

Ultimately, our hope for stiffer competion lies with the newest Opteron "Abu Dhabi" which is based upon the "Piledriver" core. The new Opteron was after all made to operate at 3 GHz and higher clockspeeds as opposed to the meager 2.3/2.6 GHz we have seen so far. Apparantely AMD will not only be able to boost IPC a bit (by 10% or more) but they may also significantly boost the clockspeed as we have learned from this ISSC paper: "a AMD’s 4+ GHz x86-64 core code-named “Piledriver” employs resonant clocking to reduce clock distribution power up to 24% while maintaining a low clock-skew target."

This should allow AMD to get higher clockspeeds within the same power envelope. Until then, it is the Xeon E5-2600 that rules the server world.

Log in

Don't have an account? Sign up now