Original Link: http://www.anandtech.com/show/2743




Challenging. That is the least you can say about the economic climate for the launch of Intel's newest "Nehalem EP Xeon" platform. However, challenges must be met and they certainly make things more interesting. The server vendors won't convince a lot of people to buy a new Intel Nehalem (or AMD Shanghai) based server just because "performance is higher". That will only work in the processing hungry HPC and render worlds, where less time per task results in time and cost savings. Hence, the challenge for AMD and Intel is to convince the rest of the market - that is 95% or so - that the new platforms provide a compelling ROI (Return On Investment).
 
The most productive or intensively used servers in general get replaced every 3 to 5 years. Based on Intel's own inquiries, Intel estimates that the current installed base consists of 40% dual-core CPU servers and 40% servers with single-core CPUs.
 

That means that Intel's Nehalem platform (and AMD's Shanghai/Opteron 23xx platform) has to convince people to replace their dual-core Opteron, dual-core Xeon 50xx ("Dempsey"), and Xeon "Irwindale" servers. There are two great ways to turn a much more powerful server into a moneymaking and cost saving machine. One is to use fewer servers in a cluster, which is not applicable to all companies. The other more popular approach is to consolidate more servers on the same physical machine by using virtualization. The most important arguments for upgrading your servers are performance/watt and support for virtualization.

Intel's newest platform holds the promise that it supports virtualization better by adding EPT and lower world switch times. However, probably the largest bottleneck in the past was the amount of available bandwidth. Bandwidth is frequently an overrated performance factor, as few applications - excluding the HPC world - get a boost from for example using three instead of two memory channels. That changes dramatically when you are running tens of virtual machines on top of a physical machine: many applications with medium bandwidth demands morph into one big bandwidth-hogging monster. The challenge is thus to provide access to the memory as fast as possible, lower energy consumption, and better support for virtualization. On paper, the Nehalem architecture definitely can play all those trump cards. Anand has provided a detailed description of the Nehalem architecture. The most important improvements for business applications are:

  • The integrated memory controller talks to its own local memory or remote memory (NUMA). Memory access takes between 27 and 54 ns (80 to 161 cycles). Compare this to the Xeon 5450 at the same clock speed where memory access via the MC in the chipset can take up to 123 ns! The closest competitor (Opteron "Shanghai") needs between 32 and 71 ns.
  • A native quad-core design with fast 33 cycle L3 cache make it easy for the L2 caches to exchange cache coherency information
  • Fast CPU interconnects make sure that the rest of the snoops happen very fast and do not interfere with other traffic.
  • The memory controller has up to three channels. A dual CPU configuration has access to 35GB/s of memory bandwidth (measured with stream) if you use DDR3-1333. The latest dual Opteron achieves 19.4GB/s with DDR2-800

Basically, Nehalem is Intel's version of the improvements found in the AMD Barcelona platform, only better (or at least that's the goal). Let's see what it can do in reality.



What Intel is Offering

So what are Intel newest offerings and how do they compare to AMD? First, since power consumption is more important in servers than in high-end desktops, Intel selects the 2.93GHz Nehalems with the lowest power consumption (less than or equal to 95W TDP) and sells them in the server market. The 95W-130W TDP parts are for the desktop market. There is a 3.2GHz Xeon W5580 at 130W, but it's only targeted at the workstation market.

Processor Speed and Cache Comparison
Xeon model Speed (GHz) Max. Turbo Max. Turbo
4 cores busy
L3 Cache (MB) TDP (W)
X5570 2.93 3.33GHz 3.2GHz 8MB 95
X5560 2.8 3.2GHz 3.066GHz 8MB 95
X5550 2.66 3.066GHz 2.93GHz 8MB 95
E5540 2.53 2.8GHz 2.66GHz 8MB 80
E5530 2.4 2.66GHz 2.53GHz 8MB 80
L5520 2.26 2.4GHz 2.33GHz 8MB 60
L5510 2.13 No turbo No Turbo 4MB 60
E5520 2.26 2.4GHz 2.33GHz 8MB 80
E5506 2.13 No turbo No Turbo 4MB 80
E5504 2 No turbo No Turbo 4MB 80
E5502 1.86 No turbo No Turbo 4MB 80

Notice that the fastest 95W parts are able to boost their frequency with two 133MHz increments even if all four cores are busy. In reality, we have noticed that with most business workloads a 2.93GHz Xeon X5570 is running at 3.066 most of the time and from time to time even at 3.2GHz, but relatively rarely at 2.93GHz. In other words, you get a bit more clock speed than advertised. In rendering we noticed that peaking at 3.2GHz was rather rare, so the workload really determines how high the CPU will clock.

 


1366 pads make contact with the new Xeon motherboards

 

The E5520 to E5540 Xeons boost their clock speed by only one increment if all cores are busy. The E550x versions are really the low end: they get no Hyper-Threading (SMT) nor are they able to boost their clock speed (Turbo mode).



The buyer's market approach: our newest testing methods

Astute readers have probably understood what we'll change in this newest server CPU evaluation, but we will let one of the professionals among our readers provide his excellent feedback on the question of improving our evaluations at it.anandtech.com:

"Increase your time horizon. Knowing the performance of the latest and greatest may be important, but most shops are sitting on stuff that's 2-3 years old. An important data point is how the new compares to the old. (Or to answer management's question: what does the additional money get us vs. what we have now? Why should we spend money to upgrade?)"

To help answer this question, we will include a 3 year old system in this review: a dual Dempsey system, which was introduced in the spring of 2006. The Dempsey or Xeon 5080 server might even be "too young", but as it is based on the "Blackford" chipset, it allows us to use the same FB-DIMMs as can be found in new Harpertown (Xeon 54xx) systems. That is important as most of our tests require quite large amounts of memory.

A 3.73GHz Xeon 5080 Dempsey performed roughly equal to a 2.3GHz Xeon 51xx Woodcrest and 2.6GHz dual-core Opteron in SAP and TPC-C. That should give you a few points of comparison, even though none of them are very meaningful. After all, we are using this old reference system to find out if the newest CPU is 2, 5, or 10 times faster; a few percent or more does not matter in that case.

In our Shanghai review, we radically changed our benchmark methodology. Instead of throwing every software box we happen to have on the shelf and know very well at our servers, we decided that the "buyers" should dictate our benchmark mix. Basically, every software type that is really important should have at least one and preferably two representatives in the benchmark suite. In the table below, you will find an overview of the software types servers are bought for and the benchmarks you can find in this review. If you want more detail about each of these software packages, please refer to this page.

Benchmark Overview
Server Software Market Importance Benchmarks used
ERP, OLTP 10-14% SAP SD 2-tier (Industry Standard benchmark)
Oracle Charbench (Free available benchmark)
Dell DVD Store (Open Source benchmark tool)
Reporting, OLAP 10-17% MS SQL Server (Real world + vApus)
Collaborative 14-18% MS Exchange LoadGen (MS own load generator for MS Exchange)
Software Dev. 7% Not yet
e-mail, DC, file/print 32-37% MS Exchange LoadGen
Web 10-14% MCS eFMS (Real World + vApus)
HPC 4-6% LS-DYNA, LINPACK (Industry Standard)
Other 2%? 3DSMax (Our own bench)
Virtualization 33-50% VMmark (Industry standard)
vApus test (in a later review)

The combination of an older reference system and real world benchmarks that closely match the software that servers are bought for should offer you a new and better way of comparing server CPUs. We complement our own benchmarks with the more reliable industry standard benchmarks (SAP, VMmark) to reach this goal.

A look inside the lab

We had two weeks to test Nehalem, and tests like the exchange tests and the OLTP tests take more than half a day to set up and perform - not to mention that it sometimes takes months to master them. Understanding how to properly configure a mail server like Exchange is completely different from configuring a database server. It is clear that our testing is now clearly beyond what one person needs to know to perform all these tests. I would like to thank my colleagues at the Sizing Servers Lab for helping to perform all this complicated testing: Tijl Deneut, Liz Van Dijk, Thomas Hofkens, Joeri Solie, and Hannes Fostie. The Sizing Servers Lab is part of Howest, which is part of the Ghent University in Belgium. The most popular parts of our research are published here at it.anandtech.com.

 


Liz proudly showing that she was first to get the MS SQL Server testing done. Notice the missing parts: the Shanghai at 2.9GHz (still in the air) and the Linux Oracle OLTP test that we are still trying to get right.

 

The SQL Server and website testing was performed with vApus, or "Virtual Application Unique Stress testing" tool. This tool took our team led by Dieter Vandroemme two years of research and programming, but it was well worth it. It allows us to stress test real world databases, websites, and other applications with the real logs that applications produce. vApus simulates the behavior not just by replaying the logs, but by intelligently choosing the actions that real users would perform using the different statistical distributions.


You can see vApus in action in the picture above. Note that the errors are time-outs. For each selection of concurrent users we see the number of responses and the average response time. It is possible to dig deeper to examine the response time of each individual action. An action is one or more queries (Databases) or a number of URLs that for example are necessary to open one webpage.

The reason why we feel that it is important to use real world applications of lesser-known companies is that these kind of benchmarks are impossible to optimize for. Manufacturers sometimes include special optimizations in their JVM, compilers, and other developer tools with the sole purpose of gaining a few points in well-known benchmarks. These benchmarks allows us to perform a real world sanity check.



Benchmark Configuration

None of our benchmarks required more than 16GB RAM.

Each Server had an Adaptec 5805 connected to the Promise 300js DAS. Database files were placed on a six drive RAID 0 set of Intel X25-E SLC 32GB SSDs, and log files on a four drive RAID 0 set of 15000RPM Seagate Cheetah 300GB hard disks.

We used AMD 8356 and 8384 CPUs in dual CPU configurations. Performancewise they are identical to the Opteron 2356 and 2387. So to avoid confusion, we list the Opterons 83xx as Opteron 2356 and Opteron 2384.

Xeon Server 1: ASUS RS700-E6/RS4 barebone
CPU: Dual Xeon "Gainestown" X5570 2.93GHz
MB: ASUS Z8PS-D12-1U
RAM: 6x4GB (24GB) ECC Registered DDR3-1333
NIC: Intel 82574L PCI-E Gbit LAN


Xeon Server 2: Intel "Stoakley" platform server
CPU: Dual Xeon E5450 at 3GHz
MB: Supermicro X7DWE+/X7DWN+
RAM: 16GB (8x2GB) Crucial Registered FB-DIMM DDR2-667 CL5 ECC
NIC: Dual Intel PRO/1000 Server NIC

Xeon Server 3: Intel "Bensley" platform server
CPU: Dual Xeon X5365 at 3GHz, Dual Xeon L5320 at 1.86 GHz and Dual Xeon 5080 at 3.73 GHz
MB: Supermicro X7DBE+
RAM: 16GB (8x2GB) Crucial Registered FB-DIMM DDR2-667 CL5 ECC
NIC: Dual Intel PRO/1000 Server NIC

Opteron Server: Supermicro SC828TQ-R1200LPB 2U Chassis
CPU: Dual AMD Opteron 8384 at 2.7GHz or Dual AMD Opteron 8356 at 2.3GHz
MB: Supermicro H8QMi-2+
RAM: 24GB (12x2GB) DDR2-800
NIC: Dual Intel PRO/1000 Server NIC
PSU: Supermicro 1200W w/PFC (Model PWS-1K22-1R)

vApus/DVD Store/Oracle Calling Circle Client Configuration
CPU: Intel Core 2 Quad Q6600 2.4GHz
MB: Foxconn P35AX-S
RAM: 4GB (2x2GB) Kingston DDR2-667
NIC: Intel Pro/1000

The Platform: ASUS RS700-E6/RS4

We were quite surprised to see that Intel chose the ASUS RS700-E6/RS4 barebone, but it came clear that ASUS is really gearing up to compete with companies like Supermicro and Tyan. This ASUS 1U barebone has a new Tylersburg-36D (Intel 5520) chipset and ICH10R Southbridge.

The ASUS RS700-E6 is a completely cable-less design, which is quite rare. According to ASUS, the gold finger mating mechanism delivers a more reliable signal quality. That is hard to verify but it is clear that a loose connection is much more unlikely than with cables. We have only had the server in the labs a few weeks, so it is too early to talk about the reliability, but we can say that the build quality of the server is excellent. The 6-phase power regulation that feeds each CPU comes from very high quality solid capacitors that are guaranteed to survive 5 years of working at 86°C (typically this is only 2 years). The same is true for the 3-phase memory power regulation. A special energy process unit (EPU) steers the VRMs to obtain higher power efficiency.


A rather unique feature is that this 1U server also supports two full height PCI-E expansion slots and one half-height slot (close to the PSU). The two full height slots are PCI-E x16 slots and the low profile slot is PCI-E x8. In addition, you can add a proprietary PIKE card, which allows you to add a SAS controller. This can be an LSI 1064E Software RAID solution (RAID 0 or 1) or a real hardware RAID card (the LSI 1078) with support for RAID 0, 1, 10, 5 and even 6.


The expandability is thus excellent, especially if you consider that the ASUS RS700 has room for two (1+1) redundant PSUs. We still have a few items on our wish list, though. We would like a less exotic video card with slightly more video RAM; ASUS uses the AST2050 with only 8MB. While many people will never use the onboard video, some of us do need to use it from time to time. The card comes with decent Windows and Linux drivers. Our distribution (SUSE SLES10SP2) would only work well at 1024x768 and refused to work in text mode until we installed the video driver, so it took a bit of tinkering before we were even capable of installing the right driver.

ESX 3.5 Update 3 does not recognize the new Intel SATA controller well, but luckily the ASUS server can be equipped with an ESX3i USB stick. ASUS offers a special USB port inside the server to attach the stick. We are currently circumventing the SATA-ESX issue with an install via ftp.

Overall, this is one of the finest 1U barebones that we have seen to date. We are pleased with the expandability, the excellent fabrication quality, and the 3-year warranty that ASUS provides.



ERP benchmark 1: SAP SD

The SAP SD (sales and distribution, 2-tier internet configuration) benchmark is an interesting benchmark as it is a real world client-server application. We decided to take a look at SAP's benchmark database. The results below are 2-tier benchmarks, so the database and the underlying OS can make a difference. It is best to keep those parameters the same, although the type of database (Oracle, MS SQL server, MaxDb or DB2) only makes a small difference. The results below all run on Windows 2003 Enterprise Edition and MS SQL Server 2005 database (both 64-bit). Every "2-tier Sales & Distribution" benchmark was performed on SAP's "ERP release 2005".

In our previous server oriented article, we summed up a rough profile of SAP S&D:

  • Very parallel resulting in excellent scaling
  • Low to medium IPC, mostly due to "branchy" code
  • Somewhat limited by memory bandwidth
  • Likes large caches (memory latency!)
  • Very sensitive to sync ("cache coherency") latency
SAP Sales & Distribution 2 Tier benchmark
(*) Estimate based on Intel's internal testing

If you focus on the cores only, the differences between the Xeon 55xx "Nehalem" and the previous generation Xeon 54xx "Harpertown" and Xeon 53xx "Clovertown" is relatively small. The enormous differences in SAP scores are solely a result of Hyper-Threading, the "uncore", and the NUMA platform. According to SAP benchmark specialist Tuan Bui (Intel), enabling Hyper-Threading accounts for a 31% performance boost. Using somewhat higher clocked DDR3 (1066 instead of 800 or 1333 instead of 1066) is good for another 2-3%. Enabling the prefetcher provides another 3% and the Turbo mode increased performance by almost 5%. As this SAP benchmark scales almost perfectly with clock speed, that means that the Xeon X5570 2.93GHz was in fact running at 3.07GHz on average.

Consider the following facts:

  • The quad-core AMD Opteron 8384 at 2.7GHz has no problem beating the higher clocked 5470 at 3.3GHz.
  • It is well known that the Xeon 54xx raw integer power is a lot higher than any of the Opterons (just take a look at SPECint2006).
  • Faster memory and thus bandwidth plays only a minor role in the SAP benchmark.
  • SAP threads share a lot of data (as is typical for these kind of database driven applications).

It is clear that synchronization (between L2 caches) that happens in the L3 cache, the fast inter-CPU synchronization that happens via dedicated interconnects, is what made the "native quad-cores" of AMD winners in this benchmark. Slow cache synchronization is probably the main reason why the integer crunching power hidden deep inside the "Harpertown" cores did not result in better performance.

Take the same (slightly improved) core and give it the right (L3 as quick syncing point for the L2s) cache architecture and NUMA platform with fast CPU interconnects and all that integer power is unleashed. The result is the Nehalem X5570 Xeon is clock for clock about 66% faster than its predecessor (19000 vs. 11420). Add SMT (Simultaneous Multi-Threading) and you allow the integer core to process a second thread when it is bogged down by one of those pesky branches. The last hurdle for supreme SAP performance is taken: The eight core "Nehalem" server is just as fast as a 24 core "Dunnington" and 80% faster than the competition.

AMD has just launched the Opteron 2389 at 2.9GHz. We estimate that this will bring AMD's best SAP score to about 14800, so Nehalem's advantage will be lowered to ~70%. Unfortunately for AMD, that is still a very large advantage!



OLTP: Dell DVD Store on MS SQL Server 2005
Operating System Windows 2008 Enterprise RTM (64-bit)
Software SQL Server 2005 Enterprise x64 SP3 (64-bit)
Benchmark software Dell DVDStore 2
Database Size 3.5 GB
Typical error margin 2-4%

DVD Store is a project that the Linux department of Dell developed in 2005 as a test for its internal server laboratory. The DVD Store database schema consists of only eight tables, but it does include stored procedures and transactions. The beauty is that it is available as open source software. This allowed us to turn this into a custom benchmark. With the default settings, the database can only be three sizes: 10MB, 1GB, or 100GB. A 10MB database size is simply too small. A 1GB size does not allow us to scale well, as too much locking contention happens. Those two options are out, but to run DVD Store with a 100GB database as a CPU benchmark, we need to take out a second mortgage on all the houses of our team to pay for the necessary storage racks.

We decided to recompile the test, allowing us to use a 3.5GB database. A 3.5GB database proved to be a good compromise between not needing too much storage speed and making the database scale well to eight cores and beyond. As you can read on our benchmark configuration page, we used a RAID 0 set of six SSDs for the data, and four 15000RPM SAS disks for the logs. We monitored the DQL (Disk Queue Length) to ensure our test was not bottlenecked by the storage subsystem.


Most of the time, our storage subsystem copes well with the transactions (DQL <1), but there are a few brief spikes where the disks are limiting throughput. This means that our fastest CPUs are running at a slightly lower CPU load (Xeon X5570 is at slightly less than 80%) than the slowest CPU (85%). Giving our fastest CPUs an even faster storage system hardly improved performance despite somewhat higher CPU load levels. In reality, it is very unlikely that you will add a few drives because you wish to run your CPU at an 82% instead of 78% CPU load, so we feel this small variation in CPU load is acceptable. This is especially true as the resulting variation in performance is much smaller: we are talking about 2-3% performance variations, well within the error margin of our test.

You can test this OLTP database via a very thin web tier or directly. As the web tier only added noise (it uses the slow ODBC driver!) to our results, we tested directly. All servers were tested in a dual CPU configuration.

Dell DVD Store on MS SQL Server 2005 SP3 (64-bit)

The AMD Shanghai has no trouble leaving the older Xeons behind, even at a lower clock speed. The Xeon 5570 does not play in the same class. Thanks to SMT, it is capable of outperforming its older brother by 78% and the competition by 66%. Hyper-Threading gives the Xeon 5570 a 21% performance boost.

One Xeon 5570 server is capable of replacing 3 to 4 older server systems based on the Xeon 50xx series.



OLTP: Oracle Charbench "Calling Circle"
Operating System Windows 2008 Enterprise RTM (64-bit)
Software Oracle 10g Release 2 (10.2) for 64-bit Windows
Benchmark software Swingbench/Charbench 2.2
Database Size 9 GB
Typical error margin 2-2.5%

In our last review, we included our first Oracle benchmark. In this review, we are happy to announce that we finally tamed the Oracle beast … somewhat. The first benchmark we tried (see our AMD Opteron 8384 2.7GHz review) was "Order Entry", but this benchmark is designed for Oracle Real Application Clusters and right now we could not make it scale above eight cores. Even the gains from four to eight cores were pretty small, despite many experiments (increasing the number of users and so on). With Calling Circle, we increased the database size to 9.5GB to make sure that once again locking contention was not completely killing off multi-core performance. To reduce the pressure on our humble storage system, we increased the SGA size (Oracle buffer in RAM) to 10GB and the PGA size was set at 1.6GB. A calling circle test consists of 83% selects, 7% inserts, and 10% updates.

The "calling circle" test runs for 10 minutes. The test is repeated six times and the results of the first run are discarded. The reason for discarding the first run is that the disk queue length is sometimes close to 1, while the second run and later have a DQL of 0.2 or lower. In this case it was rather easy to run the CPUs at 99% load. All configurations below are dual CPUs.

Oracle Calling Circle

We have seen this picture before: the latest Opteron has no problem with leaving the older generations of Xeons behind. However, the newest Xeon is simply running circles around the rest of the pack. It is a mind blowing 95% faster than the Xeon 5472 and 85% than the Opteron 8384 2.7GHz. SMT is formidable weapon, as Oracle makes good use of the extra threads, and it provides a 35% performance increase.



Decision Support: Nieuws.be
Operating System Windows 2008 Enterprise RTM (64-bit)
Software SQL Server 2005 Enterprise x64 SP3 (64-bit)
Benchmark software vApus + real world "Nieuws.be" Database
Database Size > 100 GB
Typical error margin 1-2%

The Flemish/Dutch Nieuws.be site is one of the newest web 2.0 websites, launched in 2008. It gathers news from many different sources and allows readers to personalize their view of all this news. The Nieuws.be site sits on top of a large database - more than 100GB and growing. This database consists of a few hundred separate tables, which have been carefully optimized by our lab (the Sizing Servers Lab).

Nieuws.be allowed us to test the MS SQL 2005 database for CPU benchmarking. We used a log taken between 10:00 and 11:00, when traffic is at its peak. vApus, the stress testing software developed by the Sizing Servers Lab, analyzes this log and simulates real users by performing the actions they performed on the website. In this case, we used the resulting load on the database for our test. 99% of the load on the database consists of selects, and about 5% of them are stored procedures. Network traffic is 6.5MB/s average and 14MB/s peak, so our Gigabit connection still has a lot of headroom. DQL (Disk Queue Length) is at 2 in the first round of tests, but we only report the results of the subsequent rounds where the database is in a steady state. We measured a DQL close to 0 during these tests, so there is no tangible impact from the hard disks. This test is as real world as it gets! All servers were tested in a dual CPU configuration.

Nieuws.be MS SQL Server 2005

Seven times faster than a 3-year old CPU and 76% faster than an adversary that used to outperform almost every Intel CPU! Nehalem is like a CPU that used a time machine and teleported to 2009 from 2011. To put this kind of performance into perspective: it would take a 4.7GHz Opteron to keep up with Nehalem at 3.03GHz (that's the average clock speed as Turbo mode was enabled).



Website: MCS eFMS (Windows 2003 32-bit EE)
Operating System Windows 2003 R2 - 32-bit
Software MCS eFMS 9.2
Benchmark software vApus + real world "MCS" PHP site
Typical error margin 1-2%

One very interesting and processing intensive application that we encountered was the modular MCS Enterprise Facility Management Software (MCS eFMS), developed by MCS. The objective of eFMS is to integrate the management of space usage (buildings), assets and equipment (such as furniture, beamers etc.), cabling infrastructure, and others while keeping track of costs. MCS eFMS stores all information in a central Oracle database.


MCS eFMS integrates space management, room reservations and much more…

What makes the application interesting to us as IT researchers is the integration of three key technologies. First, it uses a web-based front end that integrates CAD drawings and gets its information from a rather complex, ERP-like Oracle database. It provides building overview trees of all rooms available and their reservations in a certain building. Finally, it allows users to drill down using the CAD drawing to get more detail. MCS eFMS is one of the most demanding web applications we have encountered so far. MCS eFMS uses the following software:

  • Microsoft IIS 6.0 (Windows 2003 Server Standard Edition R2)
  • PHP 4.4.0
  • FastCGI
  • Oracle 9.2

Large international companies such as Siemens, Ernst & Young, and Startpeople use MCS eFMS daily, which makes testing this application even more attractive. We tested with both single CPU and dual CPU configurations.

MCS eFMS 9.2 website
(*) CPU load was at 50-55%.

For once, the Opteron stays in the slipstream of the Xeon X5570. If you look at the single CPU results, you can see that something went wrong: the Xeon X5570 with HT enabled is about 37% faster than the Opteron at 2.7GHz, but once we add a second Xeon X5570, the website refuses to perform better. Without Hyper-Threading, adding four more cores leads to a 32% performance increase from 51.2 responses per second to 68 responses per second. This seems to be something PHP related, as there are too few PHP threads to actually absorb CPU power. The result is that the dual Xeon X5570 with Hyper-Threading enabled is only loaded at 50%-55%, clearly indicating that the PHP site is not making use of the 16 logical cores. Windows 2003 R2 does not seem to schedule its threads optimally in that case.

This real world test shows that not all applications are capable of scaling up easily. We know from previous tests that this application scales out when you use load balancing. If we only look at the single CPU performance, which is quite common in the website market, we can conclude that Xeon X5570 is about 37% faster than its best competitor and 39% faster than its predecessor (Xeon 54xx). At that point both CPUs are running at 80 to 100% CPU load. You would almost be disappointed after all the spectacular performance numbers that the latest Xeon produced, but a 37% performance advantage is still very impressive.



Collaboration and infrastructure software: MS Exchange 2007
Operating System Windows 2008 Enterprise RTM (64-bit)
Software MS Exchange 2007 SP1 (64-bit)
Benchmark software LoadGen 08.02.004
Typical error margin 1-2%

Collaborative and infrastructure servers are good for about 50% of the server market. Even if we subtract the fileservers and print servers (which rarely demand a lot of processing power), it is still the most important market for servers. Today we're introducing MS Exchange 2007 in our server CPU benchmark suite.

For our Exchange 2007 test we used Microsoft LoadGen in stress mode. This means instead of actually simulating a business day, LoadGen will fire as many actions at the server as it can handle for the specified duration of the test, which in our case is slightly more than 1 hour. We limited the mailbox for each of the 2000 users to 30MB instead of the default 750MB to reduce the load on our storage system. All users are logged on before the actual test started.

The LoadGen test results tend to vary wildly when you use the default settings. Even when we tested for 8 hours, the results were not within an acceptable margin of error. To remedy this, we limited the different actions to just SendMail, ReadAndProcessMessages, BrowseContacts, and BrowseCalendar. It is not perfect, but at least we get very repeatable results. As we are relative newbies when it comes to benchmarking the Exchange groupware, expect some improvements to this benchmark in the future.

MS Exchange 2007 LoadGen

Our testing shows that the Opteron 2384 achieves the same initial throughput as the Xeon 5472, but for some reason the testing breaks off or slows down to an incredibly slow pace. That is why we cannot give you the final results right now; we'll update the results when we solve this problem. Nevertheless, there is little doubt in our minds that the newest Xeon X5570 is running circles around everyone else: it is capable of performing twice as many operations as its older brother.



Rendering: 3ds Max 2008
Operating System Windows 2008 Enterprise RTM (64-bit)
Software 3ds Max 2008
Benchmark software Build in timer
Typical error margin 1-2%

Render server are only a small part of the server market. We used the "architecture" scene included in the SPEC APC 3DS Max test. All tests were done with 3ds max's default scanline renderer, SSE enabled, and we rendered at HD 720p (1280x720) resolution. We measured the time it takes to render 10 frames (from 20 to 29) and then calculated (3600 seconds * 10 frames / time recorded) how many frames a certain CPU configuration could render in one hour. Results are reported as rendered images per hour.


We used the 32-bit version of 3ds Max 2008 on 64-bit Windows 2008 RTM. The 64-bit version of Windows 2008 is a bit slower (especially when you use the scanline renderer). All CPU configurations are dual, unless we indicate otherwise.

3ds Max 2008 32-bit - architecture scene

When it comes to floating point and SSE, the performance gains over several CPU generations are a bit smaller. The Xeon 5570 again shatters all records, but it's "only" three times faster than the Xeon 5080. There are two reasons for this. First, the Xeon 5080 is based on the Pentium 4 architecture. Thanks to its high clock speed, it can deliver relatively high FLOPS (Floating Point Operations per Second). The high branch prediction penalty, the relatively low hit rate of the trace cache, and very high memory latency which all made the Pentium 4 based Xeons very inefficient in integer code are of no real importance when running floating point intensive applications such as 3ds Max.

Improvements have been slower in this area. In the Xeon 51xx we have seen the introduction of 128-bit SSE units (AMD: Barcelona, Opteron 23xx) and faster 4-bit RADIX in the Harpertown Xeon (Xeon 54xx). We analyzed this in great detail previously: while the Opterons are still better at divisions, the Xeon 54xx is faster in multiplications which are much more common. The Xeon 55x "Nehalem" is almost identical to the Xeon 54xx "Harpertown", while the AMD "Shanghai" is identical to AMD "Barcelona" core when it comes to floating point. Notice how the Nehalem at 2.93GHz (in reality 3.1GHz) settles between the 3GHz and 3.3GHz Xeon 54xx. This confirms that floating point code hardly sees a difference between a Harpertown and a Nehalem… unless it is limited by the bandwidth available to the core of course. Nehalem can still beat its older brothers thanks to SMT, once again underlining what a powerful weapon SMT is.

While the Xeon X5570 is only 24% faster than the Xeon 5450, that is good enough to make the current 4-way servers completely useless for rendering. The dual Xeon "Nehalem" offers the same performance at much lower price points, while consuming a lot less power.



Virtualization (ESX 3.5 Update 2/3)

More than 50% of the servers are bought to virtualize. Virtualization is thus the killer application and the most important benchmark available. VMware is by far the market leader with about 80% of the market. However, we encountered - once again - serious issues in getting ESX installed and running on the newest platform. ASUS told us we need the ESX Update 4, which we do not yet have in the labs. We are doing all we can to make sure that our long awaited hypervisor comparison will be online in April, so stay tuned. Since we have not been able to carry out our own virtualization benchmarking, we turn to VMware's VMmark.

VMware VMmark is a benchmark of consolidation. Several virtual machines performing different tasks are consolidated, creating a tile. A VMmark tile consists of:

  • MS Exchange VM
  • Java App VM
  • Idle VM
  • Apache web server VM
  • MySQL database VM
  • SAMBA fileserver VM

The first three run on a Windows 2003 guest OS and the last three on SUSE SLES 10.


Let us first see how many tiles (six VMs per tile) each server can support:

VMware Vmark number of Tiles

The newest Xeon is shattering records again: with 13 tiles (in 72GB) it can consolidate by far the most VMs in a dual socket server. It is already dangerously close to the quad socket servers with up to 128GB of RAM. It is important to note that once you use more than one DIMM per channel, the maximum DDR3 speed is 1066. Once you fill up all slots (three DIMMs per channel, nine DIMMs per CPU), the DDR3 memory is running at 800MHz. Intel's official validation results can be found here.

Nevertheless, the performance impact of lower DDR3 speeds is not large enough to offset the advantage of three DIMMs per channel: up to 18 DIMMs in a dual configuration is a record. So far, AMD's latest Opteron held the record with eight DIMMs per CPU, or a maximum of 16 per dual socket server. AMD' supports up to three DIMMs per channel at 800MHz. Once you use four DIMMs (eight per CPU) per channel, the clock speed falls back to 533MHz. That is also a reason, besides pure performance, why Intel can support 13 tiles or 78 light VMs per server: Intel used 72GB of DDR3 at 800MHz. AMD is stuck at eight tiles for the moment: the dual Opteron servers get 64GB (at 533MHz) at the most.

After a benchmark run, the workload metrics for each tile are computed and aggregated into a score for that tile. This aggregation is performed by first normalizing the different performance metrics such as MB/second and database commits/second with respect to a reference system. Then, a geometric mean of the normalized scores is computed as the final score for the tile. The resulting per-tile scores are then summed to create the final metric.

VMware VMmark
(*) preliminary benchmark data

World switch times from VM to hypervisor have been reduced to 40% of those of Clovertown (Xeon 53xx), and EPT is good for a 27% performance increase. Add a massive amount of memory bandwidth, and we understand why the Nehalem EP shines in this benchmark. The scores for the Xeon X5570 are however preliminary: we have seen scores range from 17.9 to 19.51, but always with 13 tiles. The ESX version was not an official version ("VMware ESX Build 140815") which will probably morph into ESX 3.5 Update 4. AMD's results might also get a bit better with ESX 3.5 Update 4, so take the results with a grain of salt, but they give a good first idea. There is little doubt that the newest Xeon is also the champion in virtualization.

Both AMD and Intel emphasize that you can "vmotion" across several generations. AMD demonstrated that it is possible to migrate from the hex-core Istanbul to the quad-core Barcelona, while Intel demonstrated vmotion between "Harpertown" and "Nehalem".


It will be interesting to see how far you can go with this in practice. In theory you can go from Woodcrest to Nehalem. It is funny to see that Intel (and AMD to a lesser degree) have to clean up the mess they made with the incredibly chaotic ISA SIMD extensions: from MMX to more SSE extensions then we care to remember.



HPC Market

Contrary to virtualization, web servers, and databases, we have little expertise in our lab to perform and fully understand HPC benchmarks. Nevertheless, we can get an impression from AMD's and Intel's own benchmarking. There are two kinds of HPC applications: those that are completely CPU processing limited (dense matrices) and those that are mostly bandwidth limited (sparse matrices). A good example of the first type is LINPACK. We still have to verify our testing but the first results show about a 15% advantage for Intel. The intensity of the LINPACK benchmark does not allow turbo mode to kick in. LINPACK shows that when it comes to raw FPU performance, the newest Intel is only a few percent faster clock for clock than its competitor.

The second type of HPC applications is far more common. We have found a few LSDyna (crash simulation) numbers.

LSDyna 3 car collision (AMD and Intel numbers)

The Xeon X5570 business applications were fantastic, and the HPC applications are no exception. The newest Xeon is no less than 101% (!) faster than the previous generation of Xeons and almost 60% faster than the best Opteron.



Power Consumption

The most power hungry 2.93GHz Nehalems are sold in the desktop market (130W TDP), while the "greenest" ones are sold in the server market (95W). It is clear that Intel understands that performance alone is not good enough and the performance/watt metric is getting more popular each day. A direct power comparison was not possible, as the servers are too different: different power supplies, form factors, and so on. Therefore, we tested in a different way. First, we tested the server with two CPUs. Second, we tested the server with one CPU, while we kept the number of DIMMs the same. That way we could subtract both numbers and calculate the difference that one CPU made. It is not very accurate, but it's good enough to get a rough idea. The CPUs were running at about 80% CPU load, running the DVD-store benchmark for 10 minutes. Below you find the average power consumption.
 
Power Consumption

The method we used does not allow us to determine the absolute idle power numbers very accurately, but it seems that Xeon X5570 consumes 8W to 10W less when running at idle. Again, all these numbers have a pretty high margin of error, but they are accurate enough to say that the Opteron 2384 consumes quite a bit less at full load while the latest Xeon is clearly the winner when you are running idle. If your application is running close to idle most of the time, with a few spikes at some parts of the day, the Xeon is the performance/watt champion.

The only question is what happens if the server is running most of the time at relatively high load (for example thanks to virtualization)? Then we have to remember that the CPU is only part of a complete server. Let us assume that the Nehalem server consumes 320W (which is close to what we measured). A similar AMD Opteron server can then save about 18W per CPU, and 1W per DIMM as high speed DDR3 is a bit more power hungry than DDR2 (which runs at a lower speed). We assume that we use six DIMMs per CPU.

Power Comparison
  Power consumption Performance Performance/Watt
Intel X5570 2.93GHz 320 116399 363.7469
AMD 270 70034 259.3852

We could say that the Nehalem is winning by a margin of about 40%. Now, it is clear that the absolute winner is difficult to determine; it all depends on your applications. Still, it is clear that when you compare the best Intel and AMD CPUs, the best performance/Watt figures come from Intel by pretty large margin.



Pricing

Pricing steers most of the purchasing decisions so let's look at how the best server CPUs from Intel and AMD compare. We compare the 45nm "Nehalem" Xeon with the 45nm "Shanghai" Opteron.

Pricing
Intel Xeon model Speed (GHz) / TDP (W) Price AMD Opteron model Speed (GHz) Price
X5570 2.93 / 95W $1386      
X5560 2.80 / 95 W $1172      
X5550 2.66 / 95W $958 2389 2.9 / 75-115W $989
      2387 2.8 / 75W $873
E5540 2.53 / 80W $744 2384 2.7 / 75-115W $698
E5530 2.4 / 80W $530 2382 2.6 / 75-115W $523
L5520 2.26 / 60W $530 2376 HE 2.3 / 55-79W $575
L5510 2.13/ 60W $423 2374 HE 2.2 / 55-79W $450
E5520 2.26 $373 2380 2.5 / 75-115W $377
E5506 2.13 $266 2378 2.4 / 75-115W $255
E5504 2 $224      
E5502 1.86 $188 2376 2.3 / 75-115W $174

A few interesting observations can be made. First, AMD's 45nm process is a lot healthier than the 65nm process. Only a few months after the introduction of a 2.7GHz part, AMD is not only capable of boosting the clock speed to 2.9GHz but it does so without increasing the TDP. It is also interesting that AMD CPUs are covering a very narrow clock speed band at 75W from 2.4GHz to 2.9GHz. This indicates that AMD is really getting some good clock speeds out of the 45nm CPUs. This is a huge contrast with what we saw in 2007 and the first half of 2008. We were used to seeing AMD stuck at 2.3GHz, and those clock speeds are now all low energy parts.

AMD recognizes that Intel has the faster micro architecture and positions the 2.9GHz Shanghai at the level of the X5550. Intel is untouchable at the high-end but leaves AMD some chances at the low end. It positions a 1.86GHz without any Hyper-Threading or Turbo mode against a 2.3GHz chip. Unfortunately, those chips are not in our lab so we can't draw any conclusions.



Market Analysis

We'll wrap up with a quick look at the complete market to see how the most interesting CPUs from Intel and AMD compare. In the first column you will find the market. The second column shows the percentage of server shipments to this market. Some markets generate more revenue for server manufactures like ERP, OLTP, and OLAP; however, we have no recent numbers on this so we'll just keep it in mind. The green zones of the market are the ones where we have a decent benchmark that AMD wins, the blue ones represent the Intel zone, and the red parts are - for now - unknown. Let's first look back at the situation from a few months ago.

AMD "Shanghai" Opteron 2.7GHz vs. Xeon "Harpertown" 3GHz
Market Importance First bench Second bench Benchmarks/remarks
ERP, OLTP 10-14% 21% 5% SAP, Oracle
Reporting, OLAP 10-17% 27%   MySQL
Collaborative 14-18% N/a    
Software Dev. 7% N/a    
e-mail, DC, file/print 32-37% N/a    
Web 10-14% 2%    
HPC 4-6% 28% -3% to 66% LS-DYNA, Fluent
Other 2%? -18% -15% 3DSMax, Cinebench
Virtualization 33-50% 34%   VMmark

The market was almost completely green. AMD's "Shanghai" Opteron was reigning supreme in the HPC and virtualization market. It was clearly in the lead in the OLTP and OLAP market and it had a small advantage in the web market and probably also in the collaborative software market. Since the AMD servers also consumed less power (the Xeons used power hungry FB-DIMMs), you could say that AMD was the "smarter" choice in about 90-98% of the market.

Then a Tsunami called "Nehalem" was launched…

   
Nehalem Performance Overview
Server Software Market Importance Benchmarks used Intel Xeon X5570 vs. Opteron 2384 Intel Xeon X5570 vs. Xeon 5450
ERP, OLTP 10-14% SAP SD 2-tier (Industry Standard benchmark) 81.40% 119%
Oracle Charbench (Free available benchmark) 84.70% 94%
Dell DVD Store (Open Source benchmark tool) 66.20% 78%
Reporting, OLAP 10-17% MS SQL Server (Real world vApus benchmark) 76.50% 107%
Collaborative 14-18% MS Exchange LoadGen (MS own load generator for MS Exchange) Estimated 75-95% 93%
e-mail, DC, file/print 32-37% See MS Exchange    
Software Dev. 7% None    
Web 10-14% MCS eFMS (Real world vApus benchmark) 36.80% 39%
HPC 4-6% LS-DYNA (Industry Standard) 57.00% 101%
<1% LINPACK 15.00% 1%
Other 2%? 3DSMax (Our own bench) 50.30% 24%
Virtualization 50% VMmark (Industry standard) 58.70% 114%

…and nothing that was not called Xeon X55xx was still standing. The Xeon X55xx series simply crushes the competition and reduces the older Xeons to expensive space heaters, with the exception of the rendering and dense matrix HPC market. If you are consolidating your servers, buying a new heavyweight back end database server or mail server, there is only one choice at this moment: the Xeon X55xx series. Period.

AMD after the Sledgehammer blow

Is this the end of the line for the Sunnyvale based company? Is the launch of Bulldozer the day that never comes? Is AMD broken, beat and scarred? Scarred: who would not after this kind of blow. Beaten? For now. But not broken; AMD dies hard. After more than a full year of rather poor execution (Q2 2007 to Q3 2008), AMD is finally shaping up and executing like in the K7-K75 days. The 45nm process technology is very healthy and the speed path problems of Barcelona have been fixed in Shanghai. The result is that only four months after the successful launch of the 2.7GHz Shanghai, we are already seeing a speed bump while the power dissipation stays the same. The 2.9GHz chip was flying towards our lab while I was writing this conclusion; we'll add it as soon as possible.

The 2.9GHz part will not be able to come close to the top Nehalems; however, with the right pricing it might be an attractive alternative to the lower end Xeon 55xx series. Considering that a triple channel board equipped with DDR3 will result in a somewhat more expensive server, AMD might still be able to compete at the lower end. What is more, faster versions of Shanghai strengthen the position of AMD in the small but profitable octal CPU market. For example, 2.9GHz will allow SUN and HP to produce massive monster servers that can support more than 20 tiles and performance scores above 30 in VMmark. Faster versions of Shanghai with vast amounts of memory should also keep the 4-way server market open for AMD.

The hex-core version of Shanghai "Istanbul" is already running VMware ESX 3.5, which indicates that the launch of AMD's hex-core is going to be sooner than expected. AMD will have to surprise us with better than expected power consumption and clock speeds, but if they do, AMD might be in the race again. We doubt AMD will be able to outperform the best Xeon 55xx, but at least it has a chance to stay competitive with the midrange Intel options. Until then, aggressive pricing is the only weapon left.



The Bottom Line

An investor might lose some sleep over the Intel versus AMD war, but an ICT professional cares about return on investment. Does it pay off to invest in Xeon 55xx servers if you want to replace your 3-5 year old dual Xeon 50xx, quad Xeon 70xx or even slower Xeon 51xx based servers? Our power measurements show that the ASUS server (dual Xeon 5570) consumes about 285W to 330W under load, with 24GB of RAM (six DIMMs). To consolidate, you need a bit more as you need at least 48GB (12 DIMMs). We assume 320W on average for simplicity sake. Our 5080-based servers consume 460W to 480W under load, with 16GB of DIMMs. We assume that all our servers have between 8GB and 16GB, and simplify our calculation by assuming they need 450W.

Nehalem Power Comparison
Server Application Intel Xeon X5570 vs. 3 year old server based on 50xx CPUs Power consumption + 50% cooling (before) Power consumption + 50% cooling (After) Power consumption Saving per year Energy Savings per year ($0.10 per KWh)
SAP SD 2-tier (Industry Standard benchmark) 4.87 x faster (5 x 450W) * 1.5 = 3.3 KW 320W * 1.5 = 0.48 KW 24364 KWh $2436
Oracle Charbench (Free available benchmark) 4.44 x faster (4 x 450W) * 1.5 = 2.7 KW 320W * 1.5 = 0.48 KW 19180 KWh $1918
Dell DVD Store (Open Source benchmark tool) 3.96 x faster (4 x 450W) * 1.5 = 2.7 KW 320W * 1.5 = 0.48 KW 19180 KWh $1918
MS SQL Server (Real world vApus benchmark) 7.14 x faster (7 x 450W) * 1.5 = 4.7 KW 320W * 1.5 = 0.48 KW 36676 KWh $3668
MS Exchange LoadGen (MS own load generator for MS Exchange) 5.57 x faster (5 x 450W) * 1.5 = 3.3 KW 320W * 1.5 = 0.48 KW 24364 KWh $2436
MCS eFMS (Real world vApus benchmark) 2.84 x faster (3 x 450W) * 1.5 = 1.9 KW 320W * 1.5 = 0.48 KW 12052 KWh $1200
3DSMax (Our own bench) 3.13 x faster (3 x 450W) * 1.5 = 1.9 KW 320W * 1.5 = 0.48 KW 12052 KWh $1200

Power consumption alone is paying back about half to one third of the investment in the server (which is probably in the $4000-$6000 range). In the case of Oracle, MS SQL server, SAP, and Exchange you may add significant savings in software licensing too. One server is far easier to manage than three to seven servers, so there are lots of cost savings in terms of manpower. Less rack space saves quite a bit of money too… and so on. It is clear that the new generation is well worth the investment even if we didn't make a detailed TCO calculation.

Conclusion

The Nehalem architecture only caused a small ripple in the desktop world, mostly due to high pricing and performance that only shines in high-end applications. However, it has created a giant tsunami in the server world. The Xeon 5570 doubles the performance of its predecessor in applications that matter to more than 80% of the server market. Pulling this off without any process technology or clock speed advantage, without any significant increase in power consumption, is nothing but a historic achievement for the ambitious and talented team of Ronak Singhal.

With native quad-core, fast interconnects between the CPUs, a shared L3 cache that allows faster cache coherency synchronization, and an integrated memory controller, Intel's team followed in the footsteps of AMD's team. However, they were determined to do better in every aspect, especially the memory controller, and they could count on a much more potent integer processing engine. It will be interesting to see how the clearly motivated AMD engineering teams will react. The trend of the past few months is good, but it will take some brilliant ideas and flawless execution to stay in the slipstream of today's Intel.

For the IT professional in these difficult economic times, the new generation of server CPUs are an excellent investment. Especially if you are consolidating on less but more powerful servers, the investment will pay off quickly and generate cost savings after 1-1.5 year or even less.

Log in

Don't have an account? Sign up now