Original Link: https://www.anandtech.com/show/2118



Introduction

There has been a relentless assault on the Server CPU market. How else could you describe Intel's impressive amount of new server CPU launches, and aggressive pricing during the past several months? At the end of May 2006, Intel released the dual core Xeon DP 5080 "Dempsey" at 3.73 GHz, still based on the same architecture as the latest Pentium 4 ("Presler"). As shown by the Dell DVD store benchmark, Dempsey made the performance gap with the best AMD Opterons smaller, but it still wasn't very competitive in the performance/Watt league.

Only one month later, we reviewed a new Xeon DP 5160 based on the Intel's brand spanking new Core architecture, codenamed Woodcrest. With the exception of SSL Encryption and the MySQL database tests, the new Xeon DP simply annihilated the competition. Our most recent data shows that the Xeon 5160 outperforms the best Opteron (2.8 GHz) by 10% (MySQL) to 60% (LAMP), while presenting 33% lower TDP numbers (80W versus 119W, 65W versus 95W). AMD launched the new Socket F in August, but the current Opterons are not capable of extracting higher performance out of the faster DDR2 DIMMs, leaving AMD no other option than severely reducing the price of their server CPU flagship in the dual socket market.


The Xeon MP machine on top of the HP DL585 in our rack...
but can it really overpower the quad Opteron?

But Intel wasn't satisfied. The lucrative 4 socket market was and is still dominated by the 8xx Opteron, which managed to capture up to 50% of the market share in only a few years. In September the 3.4 GHz Xeon MP 7140M, codename Tulsa, was born. With up to 16 MB of L3-cache, can the new Xeon MP stop AMD's Quad Opteron from grabbing even more market share? Or do we have to wait for Tigerton to arrive? Let us find out....



The Xeon 70xx

Tulsa or the Xeon MP 71xx is the last Mohican of the "NetBurst / Pentium 4" tribe. It is the successor of the Xeon MP 70xx, also known as the infamous Paxville CPU. The Xeon MP 70xx was one of the worst CPUs in history from a performance/Watt view. The max TDP of Paxville was no les than 173W, and the CPU was limited to "only" 3 GHz, which is low for a NetBurst CPU as NetBurst CPUs were initially built for 4 GHz and more. According to Intel's own graphs, the fastest Opteron beats the best Xeon MP 7041 by no less than 30% in integer benchmarks....


... and by no less than 76% in Java Server benchmarks!


Needles to say, the Xeon 70xx is and was a small disaster and one of the reasons why AMD's Opteron gained so much support so quickly. With that kind of heritage, the expectations for the Xeon MP 71xx, aka Tulsa, are not high. Is Tulsa yet another power gobbling CPU which can't outperform the competition? Although the CPU is sitting completely in the shadow of Intel's newest Core based Xeons, Intel engineering did spend a lot of time on trying to make the last NetBurst CPU perform well and consume less.

Tulsa is a dual core Xeon built on Intel's very successful 65 nm process. It is a true dual core, with both cores sharing some control logic and a large L3 cache which can be 4, 8 or 16 MB in size. Tulsa can scale up to 3.4 GHz, but we tested the more affordable 3.2 GHz version with 8 MB cache.


The Tulsa Die

The biggest Tulsa die weighs in at 435 mm², a result of containing 1.3 Billion transistors. By using slower but 3 times less "leaky" transistors, and letting the parts of the caches that are not accessed "sleep", the caches consume less than 1 W/MB. Tulsa can be used as an upgrade for Paxville and uses the same "Truland platform" with the Twin Castle chipset. If that sounds like gibberish, the Truland platform has been tested and explained here at AnandTech by Jason.


Two independent 800 MHz FSBs give each of the 2 sockets (4 cores) a 6.4GB/s pipe to the Northbridge. By using four XMBs (eXternal Memory Bridge), capacity and bandwidth is maximized. The XMBs find a place on a hot swappable memory board, and each XMB drives 4 memory slots. Below you can see the memory board; the XMB is under the heatsink.


The big performance booster is Tulsa's L3 cache. Tulsa's massive L3 is protected by Pellston technology. As caches get bigger, the possibility of getting a data error also increases. Pellston can disable a faulty cache line (128 byte) during BIOS initialization when all cache lines are checked, or it can even do so while the CPU is processing. The Pellston technology is in fact an algorithm that checks if a cache line error is the result of a hard error or a soft error. The actual "checking" whether a cache line is bad or not is done by an ECC algorithm on the 32 ECC bits which protects the L3 cache lines. In other words, Pellston makes the ECC protect cache a little smarter, allowing it to act on ECC errors rather than only reporting ECC errors.

The L3 cache is inclusive: it also contains the contents of the L2-cache. Thanks to the shared and inclusive nature of the L3-cache coherency traffic between the four CPUs is significantly reduced. Too much Coherency traffic can cause multithreaded applications that share variables among the different threads like OLTP databases and web servers to slow down.

So higher clock speeds, the newer 65 nm process, much less leaky transistors, and an extra shared L3 should allow the Xeon 71xx "Tulsa" to perform much better than the Xeon 70xx "Paxville" and consume a quite a bit less. Considering that Xeon 71xx has a TDP of 95W at 3 GHz while the Xeon 70xx needed 165 W at the same speed, it appears that Intel engineers have been very successful in reducing power consumption.


Intel's own benchmarks indicate 42% higher Integer throughput while the clock speed has increased by 13%. The most spectacular graph is the SPECjbb one: according to Intel, the Xeon 7140 is no less than 2.5 times faster than the old Xeon 7041. However, the benchmark is rather vague, as Intel does not reveal if the JVMs were completely the same. A different JVM can make a big difference. Tulsa also supports EM64T, the XD bit, HW Virtualization Technology and EIST as you can see from our BIOS setup screenshot.



Intel SR6850HW4 Server

Our server was not a simple demo system, but a complete Intel Server system solution: the SR6850HW4 server. It is an enormous 6U rack/pedestal server, also available in a slimmer 4U form as the SR4850HW4. The main difference is that the 4U server has 5 (6U:10) disk bays, a bit less cooling, and smaller and slightly less powerful power supplies.


Below you can see the massive heatsinks used to cool the Xeon 71xx down.


As the new quad socket Xeon 71xx CPU fits in a 4U server, we can easily compare it with one the most popular quad Opteron servers: the HP DL585.

HP ProLiant DL585 version 2006

The HP ProLiant DL585 available in the labs was not the recently introduced DL585G2 which features DDR2, the new AMD Opteron socket F and 2.5 inch SAS drives. It is a small evolution of the HP DL585 which we reviewed back in 2004. Back then, each of the four CPU boards had 8 memory slots and supported up to 16GB (8x2GB) of DDR-266, for total of 64GB RAM. The original HP was able to use up to 48GB of DDR-333.

The latest DL585 can use 4GB DIMMs, allowing it to access no less than 128GB of DDR266. It can also use 32GB of fast DDR-400 or 64GB of DDR-333, which is very impressive. Below you can find a schematic overview of the latest DL585 technology.


The Opteron keeps a few very important advantages over the Xeon MP, not the least of which is a very elegant platform design. The quad dual core configuration generates more cache coherency traffic, as the 8 cores of the Opteron have to keep 8 L2 caches coherent while the Xeon MP has to keep track of 4 L3 caches. However four 4GB/s full duplex point to point connections make this traffic flow very smoothly while each pair of Xeon MPs have to share a 6.4GB/s half-duplex bus. The Opterons also have a fast 4.8GB/s full duplex point-to-point connection to the I/O chips.

Supermicro SS6015b-8+ server

For the price of a 4U quad socket server you can get several dual socket 1U/2U servers. There are several reasons why you would still prefer to buy the more expensive 4U server: more disk bays, more RAS features, more full height expansion slots, and of course more performance. Depending on your needs one these factors might be the decisive one. In the case of the 4U servers reviewed here, direct attached storage will not be the deciding factor, as you can get 4 or 5 disk bays in a 1 or 2U server too. If performance is the critical factor, it is clear that we must include a 1U or 2U server to see how much more performance you gain if you choose the quad socket machine.

Enter the Supermicro SS6015B-8+, which is equipped with Supermicro's X7DBR-8+ dual Xeon "Bensley platform" server board, based on the Intel 5000P chipset. While it doesn't have the RAS features of the 4U machines (for example, it doesn't have redundant PSUs), it can compete on all other points. In terms of RAM capacity for example, the Supermicro motherboard has sixteen (!) 240-pin DIMM slots that can support up to 64GB of ECC DDR2-667/533 FB-DIMMs. You can also trade in some RAM capacity for RAS: the similar SS6015B-3RV features "only" 8 DIMM slots but two 650W PSUs. The SS6015-8+ is also equipped with a SCSI or SATA backplane that offers 4 drive bays.


Due to the 1U form factor, you have to sacrifice a bit of expandability. One 133 MHz PCI-X slot can support two different riser cards, but not simultaneously. You have to choose between the PCIe riser card and the PCI-X one.

We are well aware that this particular Supermicro server is not a direct competitor. However, a similar 2U server like the 6025B will give almost the same performance numbers. The 6025B Server offers 8 drive bays, redundant power supplies and 6 expansion slots. So we are basically using the SS6015b-8+ as a "performance reference". We will try to answer the question: when do four Xeon MP or Opterons make sense, and how does it compare to the best dual Xeon available, the Xeon DP 5160?



Server CPUs overview

As the CPU is still one of the most important cost factors in a server, we want to give an overview of the currently available server CPUs. We'll start with the Intel CPUs.
The biggest advantage of Intel's newest Bensley platform is the longevity: the Dempsey, Woodcrest, and quad core Clovertown Xeon all use the same socket and "Bensley" platform. Even the successor of Clovertown, the 45nm Harpertown, is confirmed to be compatible with the Bensley platform.

Intel Xeon Overview
Intel CPU Clock Codename L2 L3 FSB Mem bandwidth TDP In test? Price
Xeon MP 7140M 3.4GHz Tulsa 2x1MB 16MB 200 MHz Quad 6.4 GB/s 150W No $1,980
Xeon MP 7130M 3.2GHz Tulsa 2x1MB 8MB 200 MHz Quad 6.4 GB/s 150W yes $1,391
Xeon MP 7120M 3GHz Tulsa 2x1MB 4MB 200 MHz Quad 6.4 GB/s 95W No $1,117
.
Xeon DP 5160 3GHz Woodcrest 4MB - 333 MHz Quad 21 GB/s 80W Yes $851
Xeon DP 5150 2.66GHz Woodcrest 4MB - 333 MHz Quad 21 GB/s 65W No $690
Xeon DP 5148 2.33GHz Woodcrest 4MB - 333 MHz Quad 21 GB/s 40W No $519
Xeon DP 5140 2.33GHz Woodcrest 4MB - 333 MHz Quad 21 GB/s 65W No $455
Xeon DP 5130 2GHz Woodcrest 4MB - 333 MHz Quad 21 GB/s 65W No $316
Xeon DP 5120 1.86GHz Woodcrest 4MB - 266 MHz Quad 17 GB/s 65W No $256
.
Xeon DP 5080 3.73GHz Dempsey 2x2MB - 266 MHz Quad 8.5 GB/s 130W No $851
Xeon DP 5063 3.2GHz Dempsey 2x2MB - 266 MHz Quad 8.5 GB/s 95W No $369
Xeon DP 5060 3.2GHz Dempsey 2x2MB - 266 MHz Quad 8.5 GB/s 130W No $316

The Opteron CPU comes in two forms: one for DDR and one for DDR-2. The DDR-2 version uses 4 model numbers, the DDR version three.

AMD Opteron Overview
AMD CPU Clock Codename L2 L3 HT Mem bandwidth TDP In test? Price
Opteron 8220 SE 2.8GHz Santa Rosa 2x1MB - 1000 MHz DDR 8.5 GB/s 119W No $2,149
Opteron 8218 2.6GHz Santa Rosa 2x1MB - 1000 MHz DDR 8.5 GB/s 95W No $1,514
Opteron 8216 2.4GHz Santa Rosa 2x1MB - 1000 MHz DDR 8.5 GB/s 95W No $1,165
Opteron 8214 2.2GHz Santa Rosa 2x1MB - 1000 MHz DDR 8.5 GB/s 95W No $873
Opteron 8216 HE 2.4GHz Santa Rosa 2x1MB - 1000 MHz DDR 8.5 GB/s 68W No $1,340
.
Opteron 885 2.6GHz Egypt 2x1MB - 1000 MHz DDR 6.4 GB/s 95W No $1,514
Opteron 880 2.4GHz Egypt 2x1MB - 1000 MHz DDR 6.4 GB/s 95W yes $1,165
Opteron 875 2.2GHz Egypt 2x1MB - 1000 MHz DDR 6.4 GB/s 95W No $873
Opteron 875 HE 2.2GHz Egypt 2x1MB - 1000 MHz DDR 6.4 GB/s 55W No $1,514
.
Opteron 2220 SE 2.8GHz Santa Rosa 2x1MB - 1000 MHz DDR 8.5 GB/s 95W No $786
Opteron 2216 2.6GHz Santa Rosa 2x1MB - 1000 MHz DDR 8.5 GB/s 95W No $611
Opteron 2214 2.4GHz Santa Rosa 2x1MB - 1000 MHz DDR 8.5 GB/s 95W No $450
Opteron 2214 2.2GHz Santa Rosa 2x1MB - 1000 MHz DDR 8.5 GB/s 95W No $377
Opteron 2216HE 2.4GHz Santa Rosa 2x1MB - 1000 MHz DDR 8.5 GB/s 68W No $531
.
Opteron 285 2.6GHz Italy 2x1MB - 1000 MHz DDR 6.4 GB/s 95W No $611
Opteron 280 2.4GHz Italy 2x1MB - 1000 MHz DDR 6.4 GB/s 95W No $450
Opteron 275 2.2GHz Italy 2x1MB - 1000 MHz DDR 6.4 GB/s 95W No $377
Opteron 275 HE 2.2GHz Italy 2x1MB - 1000 MHz DDR 6.4 GB/s 55W yes $611

The Opteron's TDP numbers are the maximum power consumption numbers, while Intel's numbers are "thermal solution design targets". In practice, this means that you should subtract about 5% from AMD's TDP numbers to compare the two brands. AMD is not doing too well in the dual CPU arena: it needs about 30W more power per dual core CPU and CPU clock speed has hardly increased the past two years. Luckily for AMD, the power disadvantage is negated by the use of FB-DIMMs instead of DDR2 on the Intel platform.



Words of thanks

A lot of people gave us assistance with this project, and we would of course like to thank them.

Trevor E. Lawless, Intel US
Matty Bakkeren, Intel Netherlands
Markus Weingartner, Intel Germany
(www.intel.com)

Damon Muzny, AMD US
(www.amd.com)

Angela Rosario, Supermicro US
Michael Kalodrich, Supermicro US
Peter Yang, Supermicro US
(http://www.supermicro.com)

Peter Zaitsev, Elite MySQL Guru
(www.mysql.com)

Bob Cramblitt and Larry D. Gray
(www.spec.org)

Brecht Kets, MySQL patching and tuning
Pieter Beel, SPECjbb benchmarking
Anja Gheldof, MySQL benchmarking
Tijl Deneut, Linux support

Benchmark configuration

In case you're wondering why we chose to use the fastest Xeon DP, the second fastest Xeon MP, and the second fastest Opteron, the reason is simple: those were the CPUs that were made available to us. As always, both AMD and Intel were contacted for this test. If a manufacturer has questions about any of our benchmarks, it is discussed and if necessary the manufacturer is allowed to login to our servers and monitor our benchmarking. This allows us to use our own benchmarks and not only industry standard benchmarks, which easily fall victim of "extreme" and in some cases "non-realistic" tuning....

Hardware configurations

Here is the list of the different configurations:

Xeon Server 1: Dual Xeon DP Supermicro SS6015b-8+
Dual Xeon DP 5160 3 GHz
Intel 5000P chipset
Supermicro's X7DBR-8+
8GB (8x1024 MB) Micron FB - DIMM Registered DDR-II 533 MHz CAS 4, ECC enabled
NIC: Dual Intel PRO/1000 Server NIC
2 Seagate Cheetah 73GB - 15000 rpm - SCSI 320 MB/s

Xeon Server 2: Quad Xeon MP Intel SR6850HW4
Quad Xeon MP 7130M 3.2 GHz 8MB L3
Intel 8501 chipset
16GB (8x2048 MB) Micron Registered DDR-II PC2-3200R, 400 MHz CAS 3, ECC enabled
NIC: Dual Intel PRO/1000 Server NIC
2 Seagate Cheetah 73GB - 15000 rpm - SCSI 320 MB/s

Opteron Server 1: Quad Opteron HP DL585
Quad Opteron 880 2.4 GHz
AMD8000 Chipset
16GB (16x1048 MB) Crucial DDR333 CAS 2.5, ECC enabled
NIC: NC7782 Dual PCI-X Gigabit
2 Seagate Cheetah 73GB - 15000 rpm - SCSI 320 MB/s

Client Configuration: Dual Opteron 850
Dual Opteron 850 1.8 GHz
MSI K8T Master1-FAR
4x512 MB Infineon PC2700 Registered, ECC
NIC: Broadcom 5705

Software
Ubuntu 6.06 LTS Server Edition (2.6.15-26-amd64-server SMP)
MySQL 5.0.26 with Peter Zaitsev Mutex Patch
SPECjbb2005
Sun Hotspot Java JVM 1.5.0_08



The Official SPEC Numbers

We checked the SPEC FP and Int 2000 rates to get a first idea of what to expect. The Spec Rates are nothing more than measuring the performance of running multiple copies of the Spec CPU benchmarks simultaneously. Typically, the number of copies is the same as the number of cores. Again, it is important to note that these benchmark numbers are highly dependent on the compiler. SPEC fp and Integer show the best case performance as the CPU runs on aggressively compiled and highly optimized code. In reality, real world code is typically compiled in a more conservative and less optimized fashion.

SPEC Int 2000 Performance
(CPU/cores) Server / CPU Clock Speed (MHz) SPEC Int 2000
(4/8) IBM POWER5+ 36MB L3 2200 196
(4/8) HP Opteron AM2 2800 160
(4/8) HP Xeon MP 7140M 16MB L3 3400 159
(4/8) FSC Xeon MP 7130M 8MB L3 3200 143
(8/8) Hitachi Itanium 2 1666 138
(4/8) HP Proliant DL585 Opteron 2400 136
(2/4) Dell Xeon 5160 3000 123
(4/8) IBM Xeon MP 7041 3000 108

Digging into the SPEC database, some very interesting results surface. The Fujitsu Siemens PRIMERGY RX600 S3 with Intel Xeon processor 7130M, 3.20 GHz is speced very similar to the Intel server in this test, and the HP DL585 machine is identical to ours and about 5% slower than the Xeon 7130 machine. The massive L3 cache is definitely helping the Xeon here.

Note the foolish figure that the previous Xeon MP 7041 cuts: almost 30% slower, 3 times more expensive and consuming twice as much compared to the Opteron 880 in the HP DL585. On top of that, Intel's newest Xeon 5160 makes the Quad Xeon MP 7041 look completely ridiculous as it performs 14% better with only two CPUs.

SPEC FP 2000 Performance
(CPU/cores) Server / CPU Clock Speed (MHz) SPEC FP 2000
(4/8) IBM POWER5+ 36MB L3 2200 355
(4/8) SGI Itanium Montecito 12 MB L3 1600 244
(4/8) AMD Opteron 8220 SE 2800 163
(4/8) Sun Opteron 880 2400 140
(4/8) HP Xeon MP 7140M 16 MB L3 3400 105
(4/8) FS Xeon MP 7130M 8 MB L3 3200 97
(2/4) Dell Xeon 5160 3000 81
(4/8) IBM Xeon MP 7041 3000 64

Floating point tests paint a different figure. The Xeon MP is no longer competitive. The best FP monsters are clearly the IBM Power 5+, Intel's Itanium, and AMD's Opteron. The AMD Opteron 880 is 43% faster than the Xeon MP 7130M.



SPECjbb2005

SPECjbb2005 from SPEC (Standard Performance Evaluation Corporation) evaluates the performance of server side Java by emulating a three-tier client/server system with emphasis on the middle tier. Instead of testing with a possible disk intensive database system, SPECjbb uses tables of objects, implemented by Java Collections, rather than a separate database. The SPECjbb score thus depends on:
  • The JVM (Java Virtual Machine) and the way the JVM is tuned
  • CPU processing power
  • Caching and memory speed
  • Multiprocessing configuration (Scalability)
The latest version SPECjbb2005 is much more memory intensive and uses XML processing among other changes. From spec.org:
"SPECjbb2005 is a follow-on release to SPECjbb2000, which was inspired by the TPC-C benchmark and loosely follows the TPC-C specification for its schema, input generation, and transaction profile. SPECjbb2005 runs in a single JVM in which threads represent terminals, where each thread independently generates random input before calling transaction specific logic. There is neither network nor disk IO in SPECjbb2005."
SPECjbb starts up to two threads per core. For example, with Hyper-Threading enabled on our 8 core quad CPU Xeon MP 7030M system, 32 threads were started on the 16 logical CPUs. Each thread is a warehouse. Again from SPEC.org:
"A warehouse is a unit of stored data. It contains roughly 25MB of data stored in many objects in several Collections (HashMaps, TreeMaps). A thread represents an active user posting transaction requests within a warehouse. There is a one-to-one mapping between warehouses and threads, plus a few threads for SPECjbb2005 main and various JVM functions. As the number of warehouses increases during the full benchmark run, so does the number of threads. A "point" represents the throughput during the measurement interval at a given number of warehouses. A full benchmark run consists of a sequence of measurement points with an increasing number of warehouses (and thus an increasing number of threads)"
First we tested with some decent but rather generic tuning that we could use on all systems. The JVM was Sun's, version 1.5.0_08.
java -classpath jbb.jar:check.jar -Xms3072m -Xmx3072m -Xmn1024m -Xss128k -XX:+AggressiveOpts -XX:+UseParallelOldGC -XX:+UseParallelGC spec.jbb.JBBmain -propfile SPECjbb.props

Performance on the quad Opteron machine is absolutely horrible: the dual Xeon DP 5160 is only a few percent slower than our quad Opteron. As SPECjbb is very memory sensitive we suspected that the NUMA architecture of the Opteron might be influencing the result. The scaling numbers confirmed our assumption: the dual Opteron scored only 48% lower, while we expect a 70% increase from 2 extra cores.

In many cases you would like to run several Java applications on one server with or without virtualization, especially on quad socket machines. Therefore we also tested SPECjbb with four application instances. Using NUMActl, a clever utility written by Andi Kleen, we were able to bind each Java application to one CPU node on the HP DL585.

On the Opteron we used:
numactl -cpubind=(1-4) -membind=(1-4) java -classpath jbb.jar:check.jar -Xms3072m -Xmx3072m -Xmn1024m -Xss128k -XX:+AggressiveOpts -XX:+UseParallelOldGC -XX:+UseParallelGC spec.jbb.JBBmain -propfile SPECjbb.props -id (1-4)
On the Xeon MP we used:
java -classpath jbb.jar:check.jar -Xms3072m -Xmx3072m -Xmn1024m -Xss128k -XX:+AggressiveOpts -XX:+UseParallelOldGC -XX:+UseParallelGC spec.jbb.JBBmain -propfile SPECjbb.props -id (1 to 4)

If we let Linux manage the four instances, performance increases about 16% compared to using one instance. If we force each instance to stay on one node (one CPU + memory), performance increases spectacularly by 56%! So it seems that it is rather hard for the Linux kernel to keep the instances where they should be. This is good and bad news for AMD: it means that the Opteron 880 can compete with the more expensive Xeon MP, but it also means that the Opteron requires more "manual" optimization than the Xeon MP. The Xeon MP performs at the same level with 4 instances as it does with one.

We suspect that the Sun JVM is reasonably well optimized for the Opteron, and maybe a little bit less effort went into the Intel optimizations as Sun features mostly Opteron and Sparc servers. The BEA JRockit JDK provides a highly optimized JVM for running JAVA applications on the x86-64 and Itanium CPUs. We are still in the process of testing with this JVM, but it seems that the HP DL585 is capable of attaining 110,000 bops, the Supermicro Dual Xeon 5160 about 70 to 75,000 bops and the Tulsa system about 140,000 bops so far. We are trying to find out which tuning parameters are realistic and which ones are maybe a little too extreme. We'll report back soon with our findings, as we have another new server CPU to show you in the near future.



Secure Socket Layers RSA Performance

Secure web communication is possible through the utilization of the Secure Sockets Layer (SSL). Using "openssl speed rsa" we can measure the number of RSA public keys (sign) operations that a system can perform per second using OpenSSL 0.9.8a. Both verifies/s and signs/s benchmarks are rather synthetic, but give an idea of the "pure" encrypting and decrypting speed.

Note that this time we did not compile OpenSSL with specific flags for each architecture (march="xxx") but we used the same flags on each CPU. We feel that this better reflects the real world use of SSL as most people do not know the specific CPU architecture they are running on. So we compiled with the following on all x86 systems:
gcc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -m64 -DL_ENDIAN -DTERMIO -O3 -Wa,-noexecstack -g -Wall -DMD32_REG_T=int -DMD5_ASM
We also included the T2000 numbers with MAU acceleration via the Solaris Cryptographic Framework from our previous server CPU shootout. One thread of OpenSSL Signing per core is optimal so we tested the quad Xeon MP 7130 with a maximum of 16 threads, as there are 8 physical but 16 logical cores.


Compared to our previous findings, the Opteron 2.4 GHz no longer (slightly) beats the 3 GHz Xeon DP 5160. This is the result of replacing a "compiled specifically for each architecture" binary with a binary that is compiled with the more generic -o3 optimization, which as stated is more realistic. Still, our previous conclusion stands: clock for clock, the Opteron is quite a bit better at this than the Xeon "Core" architecture (Xeon 5160) and a lot better than the Xeon "NetBurst" architecture (Xeon MP 7130). Despite being clocked 20% lower than the Xeon 5160, it is only 9% slower at 4 threads. The 8 MAUs of the Sun T1 still give the 1 GHz Sun the edge when we fire off 32 "SSL RSA Signing" threads.

In the case of doing verifies, the server has to authenticate the identity of the client. This is a lot less intensive, and we show you the verifies/s numbers at 2048 bits. At 1024 bits length, both the Woodcrest and Opteron were able to verify more than 50,000 keys per core, and that is a hard limit of the OpenSSL benchmark.


Again, the Opteron takes the lead. Encrypting or signing will slow down a server much quicker than verifying keys, so this benchmark is of smaller importance than the sign/s benchmark.



MySQL Configuration

As our loyal readers know from our previous MySQL adventures, the MySQL database is a highly tweakable but somewhat badly scaling database. Most workloads scale well from one to two cores, but from two to four cores scaling is very mediocre, and in the "SELECT intensive" workload that we benchmark even negative. This has surprised quite a few people, but it is an issue that the InnoDB team is well aware of, and the issue will be resolved in one of the next releases of InnoDB. Until then, we compiled version 5.0.26 with Peter Zaitsev's Mutex patch. This Patch gives much better scaling and performance. Scaling is no longer negative, and we saw a 20% to 40% increase going from two to four cores. However, our workload still doesn't scale beyond four cores, so we tested all CPUs with two CPUs and four cores. That way we have at least an impression on how the different server CPUs compare.

All testing was thus done with InnoDB as our storage engine in MySQL 5.0.26. We optimized for a server with 4GB of RAM. Here is our MySQL configuration:

MySQL Configuration
default-storage-engine InnoDB
skip-external-locking
skip-locking
key_buffer 256M
.
table_cache 64
max_allowed_packet 1M
thread_stack 128K
.
sort_buffer_size 2M
read_buffer_size 2M
innodb_buffer_pool_size 1G
.
thread_concurrency 16
innodb_thread_concurrency 16
innodb_additional_mem_pool_size 8MB
read_rnd_buffer_size 8MB
thread_cache 64
max_heap_table 256MB
tmp_table 128MB
.
innodb_log_file_size 250MB
innodb_table_locks 0
innodb_flush_log_at_trx_commit 0
max_user_connections 2000
max_connections 2000

The "query cache" was off, as we wanted to test worst case performance. Our test database is still the same ~1GB database. The workload consists of more than 90% selects, mostly a "read intensive" workload.

MySQL results

All numbers are expressed in queries per second (Y-axis), and the X-axis shows the number of concurrent accesses.


On average is the Xeon DP 5160 is about 22% faster than the Opteron. That means that the Opteron is clock for clock as fast as the Xeon 5160, which is not bad news for AMD at all, although Woodcrest currently has the raw clock speed advantage. Considering the HP DL585 can only use DDR-333 with the Opteron 880, the picture might even get better with the DL885 which can use DDR-400.

There is little doubt that MySQL is not the favorite application of the Xeon MP: the Opteron 880 beats Xeon MP by 20% to 30%. We have seen this before as the Opteron has always outrun "NetBurst" based CPUs in MySQL. The good news for Intel is that the new Core architecture is no less than 52% faster in MySQL when we compare the 3 GHz Xeon DP with the 3.2 GHz Xeon MP.

We also noted something strange: the Xeon MP performs better with hardware prefetch disabled. Below you can see our findings. All numbers are expressed in queries per second served by the server (Y-axis); and the X-axis shows the number of concurrent accesses.


Hardware prefetch lowers performance by about 1% to 4%, while Hyper-Threading allows the Xeon MP to make better use of its potential and increases performance by 7% to 9% at the higher concurrencies.



Analyses: the Xeon MP and Opteron Server

A CPU is only one aspect of choosing a server; at the end of the day it is the server that you can afford that makes you decide for one platform or another. The 4U Intel SR4850HW4 isn't very different from the SR6850HW4, so we can compare our Xeon MP test machine to the HP Opteron server.

Server Feature Comparison
  SR6850HW4 (Intel SR4850HW4) HP DL585 Model 2006
Hardware
CPU 4x Intel Xeon 70xx and 71xx Opteron 8xx
Fastest CPU Xeon MP 3.4 GHz /16MB L3 Opteron 885 2.6 GHz
Max Mem Capacity 64 GB DDR2 400 FB Dimms (16 x 4 GB) 128 GB DDR266
32 GB DDR400
Mem Type ECC DDR2 400 DDR400/333/266
Chipset E8501 AMD 8000 chipset
RAS
ECC Memory Yes Yes
Memory RAID Yes No
Hot plug memory Yes No
Memory Sparing Yes No
Memory Mirroring Yes No
Hotswappable PCI Yes on PCI-X 133 and PCIe No
Hotswappable Fans 6 (4) 8
Hotswappable PSU Yes, 1+1 Yes,1+1
Integrated Onboard
Video Chip ATI RADEON 7000 VGA PCI ATI Rage XL
Video RAM 16 MB SDRAM 8 MB SDRAM
Max. Resolution 1600x1200 1280x1024
PCIe x16/x8 0/1 0/0
PCIe x4/x1 4/0 0/0
PCI-X (133/100) 1/2 2/6
PCI 0 0
USB Front 3 0
USB Rear 2 2
LAN Intel Dual Gigabit NC7782 Dual PCI-X Gigabit
Server management Intel Server management HP Ilo
Serial Ports 1 1
Storage
Controller LSI Logic LSI53C1030 HP Smart Array 5i Plus Ultra 3
Cache Optional 64 MB BBU
Interface Dual-Channel Ultra320 SCSI SCA Dual-Channel Ultra320 SCSI SCA
Disks 10 (5) 4
RAID 0,1,1E 0,1,1+0,5
5.25 bays 2 1
Dimensions & Power
Form Factor 6U (4U) 4U
Weight (kg) 60 (40) 30
PowerSupply 2x1570W 2x 870W
.
URL SR6850HW4 HP DL585 2006

The Xeon MP offers much more in the way of RAS features than the Opteron machine. The HP DL585 also has a few shortcomings: it does not offer any PCIe expansion slots, the SCSI controller is an old SCSI 160 model, and there are no USB ports on the front of the machine. Being able to quickly load some network drivers from a USB stick is very convenient compared to tinkering in the back of your rack.

However, the HP is the winner for memory intensive HPC applications: it can use DDR1-400 DIMMs which are quite a bit faster than the DDR2-400 FB DIMMs Intel uses. We were disappointed that both 4U designs do not offer more than 4-5 disk bays. If you are a medium sized enterprise and you have only one or a few heavy duty database applications, you can save a lot of money if you don't have to buy an external storage rack. With a RAID-1 setup for the operating system and programs, you only have two disks left to install your database on a second RAID-1 partition. Both the HP DL585 and the Intel SR4850HW4 basically force you to invest in an external storage rack in this case. Some 3U solutions like Supermicro's offer 16(!) disk bays and might be a better fit for a compute intensive transactional database. The HP and Intel machine are more suited for a HPC machine or as the host of a SAN storage rack to house a massive database/ERP system.

To make a fair comparison between the Xeon MP and AMD Opteron 8xx platforms, we decided to compare the costs of similar HP Xeon MP and HP Opteron machines, configuring them as similarly as possible.

Price Comparison
Server HP ProLiant DL580 G4 3.20GHz HP ProLiant DL585 G2 2.4GHz - Rack Server
CPUs 4x Intel Xeon MP 7130 M 4x AMD Opteron 8216 DC
Memory 4 Memory boards x 2 x 1 GB DDR2-400 8x 1 GB DDR2-667
Storage HP Smart Array P400/256 PCIe Controller HP Smart Array P400/512 Controller with battery
NIC HP Dual embedded NC371i Gigabit HP Dual embedded NC371i Gigabit
PSU Dual 910/1300W power supplies Dual 910/1300W hot plug power supplies
DVD SlimLine DVD-ROM Drive (8x/24x) Option Kit SlimLine DVD-ROM Drive (8x/24x) Option Kit
Price $15,343 $13,184

The price disadvantage of the Xeon MP is more than $2000, which is not huge but still tangible. It is the result of the fact that you have to pay an extra $400 per Xeon CPU and $1000 for two extra memory boards. It is possible to save $1000 if you only get two memory boards, but that is not advisable. As 4GB DIMMs are extremely expensive, this means that you limit your server to 16GB (8x2GB) and that you cannot use the more advanced RAS features such as memory mirroring.

Power

How much power can we save by choosing the 95W TDP Opteron over the 150W TDP Xeon MP? We tested all machines with only one power supply running. DBS and PowerNow! were not enabled.

Power Requirements
System Configuration Max / Idle Power Usage
(100% / <1% CPU load, W)
HP DL585 4 CPUs - 16 GB RAM 657 / 520
Intel Xeon MP 7130M 4 CPUs - 16 GB RAM 885 / 460 (620)

Both machines use huge fast turning fans which consume a lot of energy. To give you an idea of what this means, while idling the power consumption of the Xeon MP machine fluctuated between 460W and 620W. The 620W figure was generated when all the fans where turned on, while the 460W result was measured when the fans were silent. The HP DL585 did not use this on/off fan system, and consumed 520W while running idle. Once running at 100% load, the Xeon MP consumed 200W more than the Opteron machine while running SPECjbb2005. For your information, our Supermicro system consumed 310 W with 4 GB and about 360 W with 12 GB of RAM

Conclusion so far

Yes, our testing is not done. We still have to test other databases, and we are running benchmarks with Bea's JVM while you are reading this. Those benchmarks will be presented in our Clovertown - Intel's new quad core server CPU - review. In this review we focused a little more on the actual servers. So what can we conclude so far?

The Xeon 7140MP "Tulsa" is nothing less than a massive improvement over the previous Xeon 7041: it consumes less, performs a lot better (see the SPEC int/fp numbers) and is much less expensive. The new Xeon MP needs fewer optimizations than the Opteron to perform well in Java applications. Or if we look at our preliminary Bea Webrockit numbers, it performs better than the quad Opteron with a highly optimized JVM in applications with a big memory footprint (like SPECjbb2005) thanks to its massive L3 cache. In applications where the large L3 cache doesn't play a big role, the relatively poor server performance of the "NetBurst" architecture becomes visible again: our MySQL benchmark runs a lot better on the AMD Opteron and Intel's newest Core architecture Xeons. Power consumption is still rather high though, and the HP Opteron server consumed over 230W less.

In a nutshell, the new Xeon MP will have a hard time convincing people who are leaning towards an Opteron server or want the best performance/watt. But on the other hand, the decent performance and superior RAS features will keep the customers who desire high availability in the Intel camp, while the previous Xeon MP was such a poor performer that many people had no other choice than the AMD Opteron in the quad socket market.

When "High-end RAS" is less important, the excellent performance of the Xeon 5160 based Supermicro 6015 server shows how much potential the Xeon DP "Clovertown" has. Clovertown is nothing more than two Xeon DP 51xx on one chip, but it could give our quad monsters a hard time. You will find out more very soon....

Log in

Don't have an account? Sign up now